Intrinsic Dimensional Correlation Discretization for Mining Task

Article Preview

Abstract:

Discretization is a necessary pre-processing step of the mining task, and a way of performance improvement for many machine learning algorithms. Existing techniques mainly focus on 1-dimension discretization in lower dimensional data space. In this paper, we present an intrinsic dimensional correlation discretization technique in high-dimensional data space. The approach estimates the intrinsic dimensionality (ID) of the data by using maximum likelihood estimation (MLE). Further, we project data onto eigenspace of the estimated lower ID by using principle component analysis (PCA) that can discover the potential correlation structure in the multivariate data. Thus, all the dimensions of the data can be transformed into new independent eigenspace of the ID, and each dimension can be discretized separately in the eigenspace based on the promising Bayes discretization model by using outstanding MODL discretization method. We design a heuristic framework to find better discretization scheme. Our approach demonstrates that there is a significantly improvement on the mean learning accuracy of the classifiers than traditional discretization methods.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

548-554

Citation:

Online since:

September 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] H. Liu, F. Hussain, C. L. Tan, and M. Dash, Discretization: an enabling technique, Journal of Data Mining and Knowledge Discovery, vol. 6, no. 4, p.393–423, (2002).

Google Scholar

[2] C. T. Su and J. H. Hsu, An extended chi2 algorithm for discretization of real value attributes, IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 3, p.437–441, (2005).

DOI: 10.1109/tkde.2005.39

Google Scholar

[3] C. J. Tsai, C. I. Lee, and W. P. Yang, A discretization algorithm based on class-attribute contingency coefficient, Information Sciences, vol. 178, p.714–731, (2008).

DOI: 10.1016/j.ins.2007.09.004

Google Scholar

[4] U. Fayyad and K. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, In Proc. Thirteenth International Joint Conference on Artificial Intelligence. San Mateo, CA: Morgan Kaufmann, p.1022–1027, (1993).

Google Scholar

[5] M. Boulle, MODL: A bayes optimal discretization method for continuous attributes, Machine Learning, vol. 65, p.131–165, (2006).

DOI: 10.1007/s10994-006-8364-x

Google Scholar

[6] R. M. Jin, Y. Breitbart, and C. Muoh, Data discretization unification, the Seventh IEEE International Conference on Data Mining (ICDM Best Paper), p.183–192, (2007).

DOI: 10.1109/icdm.2007.35

Google Scholar

[7] S. D. Bay, Multivariate discretization for set mining, Knowledge and Information Systems, vol. 3, no. 4, p.491–512, (2001).

DOI: 10.1007/pl00011680

Google Scholar

[8] M. Mehta, S. Parthasarathy, and H. Yang, Toward unsupervised correlation preserving discretization, IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 8, p.1–14, (2005).

DOI: 10.1109/tkde.2005.153

Google Scholar

[9] I. T. Jolliffe, Principal component analysis, Springer-Verlag, New York, (1986).

Google Scholar

[10] E. Levina and P. J. Bickel, Maximum likelihood estimation of intrinsic dimension, Advances in NIPS, vol. 17, (2005).

Google Scholar

[11] J. Ramirez and F. G. Meyer, Machine learning for seismic signal processing: Seismic phase classification on a manifold, Proceedings of 10th International Conference on Machine Learning and Applications, p.382–388, (2011).

DOI: 10.1109/icmla.2011.91

Google Scholar

[12] Z. Pawlak, Rough sets, International Journal of Computer and Information Sciences, vol. 11, no. 5, p.341–356, (1982).

Google Scholar

[13] S. Hettich and S. D. Bay, The uci kdd archive [db/ol], http: /kdd. ics. uci. edu/, (1999).

Google Scholar