Sample Size on the Impact of Imbalance Learning

Article Preview

Abstract:

Classification of imbalanced data sets is widely used in many real life applications. Most state-of-the-art classification methods which assume the data sets are relatively balanced lose their efficiency. The paper discusses the factors which influence the modeling of a capable classifier in identifying rare events, especially for the factor of sample size. Carefully designed experiments using Rotation Forest as base classifier, carried on 3 datasets from UCI Machine Learning Repository based on weak show that, in particular imbalance ratio, increases the size of training set by unsupervised resample the large error rate caused by the imbalanced class distribution decreases. The common classification algorithm can reach good effect.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 756-759)

Pages:

2547-2551

Citation:

Online since:

September 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] P-N. Tan and M. Steinbach, Introduction to Data Mining, p.127–187, [M] (2005).

Google Scholar

[2] Fawcett, T. and Provost, F., Combining Data Mining and Machine Learning for Effective User Profile, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Protland OR, AAAI Press(1996) 8-13.

Google Scholar

[3] Lewis, D. and Caltett, Heterogeneous, Uncertainty Sampling for Supervized Learning. Proceedings of the 11th International Conference on Machine Learning. ICML'94(1994)148-156.

Google Scholar

[4] Y. Sun, M. S. Kamel and A. K.C. Wong, Cost-sensitive boosting for classification of imbalanced data, Patter Recognition Society, pp.3358-3378 , (2007).

DOI: 10.1016/j.patcog.2007.04.009

Google Scholar

[5] S. Visa and A. Ralescu, Issues in Mining imbalanced Data Sets-A Review Paper, Proc. Of MidWest Artificial Intelligence and Cognitive Science Conference, pp.67-73, (2005).

Google Scholar

[6] G. Weiss and F. Provost, Learning when training data are costly: the effect of class distribution on tree induction, J. Aritif. Intell. Res. 19, pp.315-354 , (2003).

DOI: 10.1613/jair.1199

Google Scholar

[7] K. Ezawa, M. Singh and S. W. Norton, Learning goal oriented Bayesian networks for telecommunications risk management, in: Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, pp.139-147, (1996).

Google Scholar

[8] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic Minority Over-Sampling Technique, J. Artificial Intelligence Research, vol. 16, pp.321-357, (2002).

DOI: 10.1613/jair.953

Google Scholar

[9] R. Agarwal, and M. V. Joshi, PNrule: A new Framework for Learning Classifier Models in Data Mining (A Case-Study in Network Intrusion Detection), Technical Report TR 00-01, Department of Computer Science University of Minnesota, USA, (2000).

DOI: 10.1137/1.9781611972719.29

Google Scholar

[10] Cieslak, D.A., CHawla, N.V., Learning decision trees for unbalanced data. ECML 99. 241-256(2008).

Google Scholar