A Comparative Study on Feature Selection in Chinese Text Classification Problem

Article Preview

Abstract:

Information explosion brings lots of challenges to text classification. The dimension disaster led to a sharp increase of computational complexity and lower classification accuracy. Therefore, it is critical to use feature selection techniques before actual classification. Automatic classification of English text has been researched for many years, but little on Chinese text. In this paper, several classic feature selection methods, namely TF, IG and CHI, are compared on classifying Chinese text. Meanwhile, we take imbalanced data into consideration in the paper. Experimental results show that CHI performed better than IG and TF when the dataset is imbalanced, but no obvious difference on balanced data.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

2854-2857

Citation:

Online since:

August 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] J. Furnkranz: Round Robin classification. J. Mach. Learn., vol. 2 (2002), pp.721-747.

Google Scholar

[2] S. L. Lam and D. L. Lee: Feature Reduction for Neural Network Based Text Categorization, Proc. 6th International Conference on Database Systems for Advanced Applications, IEEE (1999), pp.195-202.

DOI: 10.1109/dasfaa.1999.765752

Google Scholar

[3] Y. Yang and J. P. Pedersen: A Comparative Study on Feature Selection in Text Categorization, Proc. 14th International Conference on Machine Learning, Morgan Kaufmann, (1997), pp.412-420.

Google Scholar

[4] W. Zhang, T. Yoshida and X. Tang: A comparative study of TF*IDF, LSI and multi-words for text classification, J. Expert Systems with Applications, vol. 38, (2011), pp.2758-2765.

DOI: 10.1016/j.eswa.2010.08.066

Google Scholar

[5] Y. L. Hung: Efficient classifiers for multi-class classification problems, J. Decision Support Systems, vol. 53 (2012), pp.473-481.

DOI: 10.1016/j.dss.2012.02.014

Google Scholar

[6] K. S. Jones: A statistical interpretation of term specificity and its application in retrieval, J. Documentation, vol. 28 (1972), pp.11-21.

DOI: 10.1108/eb026526

Google Scholar

[7] J. Nathalie and S. Shaju: The class imbalance problem: A systematic study, J. Intelligent Data Analysis, vol. 6 (2002), pp.429-449.

Google Scholar

[8] H. He and E. A. Garcia: Learning from imbalanced data, J. IEEE Trans. Knowl. Data Eng., vol. 21 (2009), pp.1263-1284.

DOI: 10.1109/tkde.2008.239

Google Scholar

[9] A. K. Daniel, O. Daniela and R. Christian: Analyzing document collections via context-aware term extraction, Proc. 14th international conference on Applications of Natural Language to Information Systems, Springer (2009), pp.154-168.

DOI: 10.1007/978-3-642-12550-8_13

Google Scholar