Paper Titles

Exploration and Practice on Process Assessment in Computer Course
p.1268

Research of Mobile Learning System Equipment Environment
p.1272

Research on Educational Software Unified Process Model Based on Education Domain Knowledge
p.1279

An Ensemble Learning Model Based on SOM-SVM Model for Personal Credit Risk
p.1286

A Comparison Study of Cost-Sensitive Learning and Sampling Methods on Imbalanced Data Sets
p.1291

Research on the Problems of Multimedia Web Teaching in China’s Higher Education and the Countermeasures
p.1297

Thinking and Methodology in Programming Teaching of Single-Chip Microcomputer
p.1301

Design and Realization of E-Learning Resource Storage System
p.1307

Design and Realization of Virtual Computer Cluster System
p.1313

HomeAdvanced Materials ResearchAdvanced Materials Research Vols. 271-273A Comparison Study of Cost-Sensitive Learning and...

A Comparison Study of Cost-Sensitive Learning and Sampling Methods on Imbalanced Data Sets

Article Preview

Abstract:

The classifier, built from a highly-skewed class distribution data set, generally predicts an unknown sample as the majority class much more frequently than the minority class. This is due to the fact that the aim of classifier is designed to get the highest classification accuracy. We compare three classification methods dealing with the data sets in which class distribution is imbalanced and has non-uniform misclassification cost, namely cost-sensitive learning method whose misclassification cost is embedded in the algorithm, over-sampling method and under-sampling method. In this paper, we compare these three methods to determine which one will produce the best overall classification under any circumstance. We have the following conclusion: 1. Cost-sensitive learning is suitable for the classification of imbalanced dataset. It outperforms sampling methods overall, and is more stable than sampling methods except the condition that data set is quite small. 2. If the dataset is highly skewed or quite small, over-sampling methods may be better.

You might also be interested in these eBooks

Advanced Materials and Information Technology Processing

Info:

Periodical:

Advanced Materials Research (Volumes 271-273)

Pages:

1291-1296

DOI:

https://doi.org/10.4028/www.scientific.net/AMR.271-273.1291

Citation:

Cite this paper

Online since:

July 2011

Authors:

Jin Wei Zhang, Hui Juan Lu, Wu Tao Chen, Yi Lu

Keywords:

Cost-Sensitive Learning, Misclassification Cost, Over-Sampling, Under-Sampling

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

© 2011 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Y Lin, Y Lee, G Wahba: Support vector machines for classification in nonstandard situations. Machine Learning, Vol. 46, p.191, (2002).

[2] Zheng En-Hui, Li Ping, Song Zhi-Huan: Mining Knowledge from Unbalanced Data: Effect of Class Distribution on SCM Classification. Information and Control, Vol. 34, p.703, (2005).

[3] Zheng En-Hui, Li Ping, Song Zhi-Huan: Cost Sensitive Support Vector Machine. Control and Design, Vol. 21, p.473, (2006).

[4] Chen Ning, Bernardete Riberio, Armando Vieira, João Duarte and João Neves: Weighted learning vector quantization to cost-sensitive learning. ICANN 2010, Thessaloniki, Greece, pp.277-281, (2010).

DOI: 10.1007/978-3-642-15825-4_33

[5] Peter D. Turney: Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm. Journal of Artificial Intelligence Research, Vol. 2, p.369, (1995).

DOI: 10.1613/jair.120

[6] Tao Xin-Min, Xu Jing, Tong Zhi-Jing and Liu Yu: Over-sampling algorithm based on negative immune in imbalanced data sets learning. Control and Decision, Vol. 25, p.867, ( 2010).

[7] Kate McCarthy, Bibi Zabar and Gary Weiss: Does Cost-Sensitive Learning Beat Sampling for Classification Rare Classes. UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining, Chicago, Illinois, USA, pp.69-77, ( 2005).

DOI: 10.1145/1089827.1089836

[8] Elkan, C: The foundations of cost-sensitive learning. IJCAI'01, Seattle Washinton, p.973, ( 2001).

[9] Chen C., Liaw, A., and Breiman, L, Using random forest to learn unbalanced data, Technical Report 666, Statistics Department, University of California at Berkeley, (2004).

[10] Chris Drummond, Robert C. Hotel: C4. 5, Class Imbalance, and Cost Sensitivity: Why Under-sampling beats Over-sampling. Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC, (2003).

[11] Japkowicz N, and Stephen, S: The class imbalance problem: A systematic study. Intelligent Data Analysis, Vol. 6, pp.419-449, (2002).

DOI: 10.3233/ida-2002-6504

[12] M.A. Maloof: Learning when data sets are imbalanced and when costs are unequal and unknown. In Working Notes of the ICML'03 Workshop on Learning from Imbalanced Data Sets, Washington, DC, (2003).

[13] Abe, N., Zadrozny, B., and Langford. J: An iterative method for multi-class cost-sensitive learning. Proc. 10th ACM SIGKDD International Conference. Knowledge Discovery and Data Mining, p.3(2004).

DOI: 10.1145/1014052.1014056

[14] Information on http: /www. csie. ntu. edu. tw/~cjlin/libsvmtools/datasets.

[15] Information on http: /archive. ics. uci. edu/ml.

[16] Liu Xu-Ying, Zhou Zhi-Hua: Learning with cost intervals. Proceedings of the 16th ACM SIGKDD international conference, Washington, DC, USA, p.403, (2010).