A Comparison Study of Cost-Sensitive Learning and Sampling Methods on Imbalanced Data Sets

Article Preview

Abstract:

The classifier, built from a highly-skewed class distribution data set, generally predicts an unknown sample as the majority class much more frequently than the minority class. This is due to the fact that the aim of classifier is designed to get the highest classification accuracy. We compare three classification methods dealing with the data sets in which class distribution is imbalanced and has non-uniform misclassification cost, namely cost-sensitive learning method whose misclassification cost is embedded in the algorithm, over-sampling method and under-sampling method. In this paper, we compare these three methods to determine which one will produce the best overall classification under any circumstance. We have the following conclusion: 1. Cost-sensitive learning is suitable for the classification of imbalanced dataset. It outperforms sampling methods overall, and is more stable than sampling methods except the condition that data set is quite small. 2. If the dataset is highly skewed or quite small, over-sampling methods may be better.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 271-273)

Pages:

1291-1296

Citation:

Online since:

July 2011

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2011 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Y Lin, Y Lee, G Wahba: Support vector machines for classification in nonstandard situations. Machine Learning, Vol. 46, p.191, (2002).

Google Scholar

[2] Zheng En-Hui, Li Ping, Song Zhi-Huan: Mining Knowledge from Unbalanced Data: Effect of Class Distribution on SCM Classification. Information and Control, Vol. 34, p.703, (2005).

Google Scholar

[3] Zheng En-Hui, Li Ping, Song Zhi-Huan: Cost Sensitive Support Vector Machine. Control and Design, Vol. 21, p.473, (2006).

Google Scholar

[4] Chen Ning, Bernardete Riberio, Armando Vieira, João Duarte and João Neves: Weighted learning vector quantization to cost-sensitive learning. ICANN 2010, Thessaloniki, Greece, pp.277-281, (2010).

DOI: 10.1007/978-3-642-15825-4_33

Google Scholar

[5] Peter D. Turney: Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm. Journal of Artificial Intelligence Research, Vol. 2, p.369, (1995).

DOI: 10.1613/jair.120

Google Scholar

[6] Tao Xin-Min, Xu Jing, Tong Zhi-Jing and Liu Yu: Over-sampling algorithm based on negative immune in imbalanced data sets learning. Control and Decision, Vol. 25, p.867, ( 2010).

Google Scholar

[7] Kate McCarthy, Bibi Zabar and Gary Weiss: Does Cost-Sensitive Learning Beat Sampling for Classification Rare Classes. UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining, Chicago, Illinois, USA, pp.69-77, ( 2005).

DOI: 10.1145/1089827.1089836

Google Scholar

[8] Elkan, C: The foundations of cost-sensitive learning. IJCAI'01, Seattle Washinton, p.973, ( 2001).

Google Scholar

[9] Chen C., Liaw, A., and Breiman, L, Using random forest to learn unbalanced data, Technical Report 666, Statistics Department, University of California at Berkeley, (2004).

Google Scholar

[10] Chris Drummond, Robert C. Hotel: C4. 5, Class Imbalance, and Cost Sensitivity: Why Under-sampling beats Over-sampling. Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC, (2003).

Google Scholar

[11] Japkowicz N, and Stephen, S: The class imbalance problem: A systematic study. Intelligent Data Analysis, Vol. 6, pp.419-449, (2002).

DOI: 10.3233/ida-2002-6504

Google Scholar

[12] M.A. Maloof: Learning when data sets are imbalanced and when costs are unequal and unknown. In Working Notes of the ICML'03 Workshop on Learning from Imbalanced Data Sets, Washington, DC, (2003).

Google Scholar

[13] Abe, N., Zadrozny, B., and Langford. J: An iterative method for multi-class cost-sensitive learning. Proc. 10th ACM SIGKDD International Conference. Knowledge Discovery and Data Mining, p.3(2004).

DOI: 10.1145/1014052.1014056

Google Scholar

[14] Information on http: /www. csie. ntu. edu. tw/~cjlin/libsvmtools/datasets.

Google Scholar

[15] Information on http: /archive. ics. uci. edu/ml.

Google Scholar

[16] Liu Xu-Ying, Zhou Zhi-Hua: Learning with cost intervals. Proceedings of the 16th ACM SIGKDD international conference, Washington, DC, USA, p.403, (2010).

Google Scholar