Hybrid Balancing Technique Using GRSOM and Bootstrap Algorithms for Classifiers with Imbalanced Data

Article Preview

Abstract:

To deal with imbalanced data, this paper proposes a hybrid data balancing technique which incorporates both over and under-sampling approaches. This technique determines how much minority data should be grown as well as how much majority data should be reduced. In this manner, noise introduced to the data due to excessive over-sampling could be avoided. On top of that, the proposed data balancing technique helps to determine the appropriate size of the balanced data and thus computation time required for construction of classifiers would be more efficient. The data balancing technique over samples the minority data through GRSOM method and then under samples the majority data using the bootstrap sampling approach. GRSOM is used in this study because it grows new samples in a non-linear fashion and preserves the original data structure. Performance of the proposed method is tested using four data sets from UCI Machine Learning Repository. Once the data sets are balanced, the committee of classifiers is constructed using these balanced data. The experimental results reveal that our proposed data balancing method provides the best performance.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 931-932)

Pages:

1375-1381

Citation:

Online since:

May 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] M.R. Kubat, C. Holte and S. Matwin, Machine learning for the detection of oil spills in satellite radar images, Machine Learning, 30 (2-3) (1998), 195–215.

DOI: 10.1023/a:1007452223027

Google Scholar

[2] C.S. Hilas and P.A. Mastorocostas, An application of supervised and unsupervised learning approaches to telecommunications fraud detection. Knowledge-Based Systems, 21(7) (2008), 721-726.

DOI: 10.1016/j.knosys.2008.03.026

Google Scholar

[3] P.K. Chan, F. Wei, A. Prodromidis and S.J. Stolfo, Distributed data mining in credit card fraud detection. IEEE Intelligent Systems, 14 (6) (1999), 67-74.

DOI: 10.1109/5254.809570

Google Scholar

[4] S. Daskalaki, I. Kopanas and N. Avouris, Evaluation of classifiers for an uneven class distribution problem, Applied Artificial Intelligence, 20 (5) (2006), 381-417.

DOI: 10.1080/08839510500313653

Google Scholar

[5] Y.M. Huang, C.M. Hung and H.C. Jiau, Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem, Nonlinear Analysis: Real World Applications, 7 (4) (2006), 720-757.

DOI: 10.1016/j.nonrwa.2005.04.006

Google Scholar

[6] I. Adrianto, M.B. Richman and T.B. Trafalis, Intelligent Engineering Systems Through Artificial Neural Networks, Machine Learning Techniques for Imbalanced Data: An Application for Tornado Detection. Proceedings of Artificial Neural Networks in Engineering Conference ANNIE 2010, pp.509-516.

DOI: 10.1115/1.859599.paper63

Google Scholar

[7] D. Chetchotsak, S. Pattanapairoj, Intelligent Engineering Systems Through Artificial Neural Networks. Committee Network Model for HDD Functional Tests, Proceedings of Artificial Neural Networks in Engineering Conference (ANNIE) 2010, pp.629-636.

DOI: 10.1115/1.859599.paper78

Google Scholar

[8] D.C. Li, C.W. Liu and S.C. Hu, A learning method for the class imbalance problem with medical data sets, Computers in Biology and Medicine, 40 (2010), 509-518.

DOI: 10.1016/j.compbiomed.2010.03.005

Google Scholar

[9] S.J. Yen and Y.S. Lee, Cluster-based Under-sampling Approaches for Imbalanced Data Distributions, Expert System with Applications, 36 (2009), 5718-5727.

DOI: 10.1016/j.eswa.2008.06.108

Google Scholar

[10] S. Pattanapairoj, D. Chetchotsak and B. Arnonkijpanich, Integrating New Data Balancing Technique with Committee Networks for Imbalanced Data: GRSOM Approach, submited to Neural Computing and Applications.

DOI: 10.1007/s11571-015-9350-4

Google Scholar

[11] Y. Bai, W. Zhang and H. Hu, An Efficient Growing Ring SOM and Its Application to TSP, Proceedings of the 9th WSEAS International Conference on Applied Mathematics. Istanbul, Turkey 2006a, pp.351-355.

Google Scholar

[12] Y. Bai, W. Zhang and Z. Jin, An New Self-Organizing Maps Strategy for Solving the Traveling Salesman Problem, Chaos Solitons and Fractals, 28 (2006b), 1082-1089.

DOI: 10.1016/j.chaos.2005.08.114

Google Scholar

[13] N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer, SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research, 16 (2002), 321-357.

DOI: 10.1613/jair.953

Google Scholar

[14] N.V. Chawla, A. Lazarevic, L.O. Hall and K.W. Bowyer, Smoteboost: Improving Prediction of the Minority Class in Boosting, Proceedings of The 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Dubrovnik, Croatia, 2003, pp.107-119.

DOI: 10.1007/978-3-540-39804-2_12

Google Scholar

[15] Y. Liu, X. Yu, J.X. Huang and A. An, Combining Integrated Sampling with SVM Ensembles for Learning from Imbalanced dataset, Information Processing and Management, 47 (2011), 617-631.

DOI: 10.1016/j.ipm.2010.11.007

Google Scholar

[16] A. Fernandez, S. Garcia, M.J. Jesus and F. Herrera, A Study of The Behaviour of Linguistic Fuzzy Rule based Classification Systems in the Framework of Imbalanced Data-set, Fuzzy Sets and Systems, 159 (2008), 2378-2398.

DOI: 10.1016/j.fss.2007.12.023

Google Scholar

[17] Y.M. Chyi, Classification Analysis Techniques for skewed class distribution problems, Master thesis, Department of Information Management, National Sun Yat-Sen University, (2003).

Google Scholar

[18] J. Zhang and I. Mani, kNN approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction, Proceeding of the ICML workshop on learning from imbalanced dataset, (2003).

Google Scholar

[19] R. Akbani, S. Kwek and N. Japkowic, Appling Support Vector Machines to imbalanced datasets, ECML 2004, pp.39-50.

DOI: 10.1007/978-3-540-30115-8_7

Google Scholar

[20] Y. Sun, M.S. Kamel, A.K.C. Wong and Y. Wang, Cost-sensitive Boosting for Classification of Imbalanced data, The Journal of The Pattern Recognition Society, (40) (2007), 3358-3378.

DOI: 10.1016/j.patcog.2007.04.009

Google Scholar

[21] J.P. Hwang, S. Park and E. Kim, A New Weighted Approach to Imbalanced data Classification Problem via Support Vector Machine with Quadratic Cost Function, Expert Systems with Applications, 38 (2011) 8580-8585.

DOI: 10.1016/j.eswa.2011.01.061

Google Scholar

[22] Y. Tang, Y.Q. Zhang, N.V. Chawla and S. Krasser, SVMs Modeling for Highly Imbalanced Classification, IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, 39(1) (2002), 281-288.

DOI: 10.1109/tsmcb.2008.2002909

Google Scholar

[23] J. Ren, ANN vs. SVM: Which one performs better in classification of MCCs in mammogram imaging, Knowledge-Based Systems, 26 (2012), 144-153.

DOI: 10.1016/j.knosys.2011.07.016

Google Scholar

[24] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. [http: /archive. ics. uci. edu/ml].

Google Scholar