A Comparative Study on Feature Selection in Chinese Text Classification Problem

Hu Li; Peng  Zou; Wei Hong Han

doi:10.4028/www.scientific.net/AMM.380-384.2854

Paper Titles

Improved Top-k Query Processing on Uncertain Data
p.2837

Task Scheduling Algorithm Research in Grid Computing
p.2841

Design of Integrated Control and Support Platform for Airborne Digital Image Transmission System
p.2845

Research on Distributed Database Query Optimization Based on Genetic Algorithm
p.2850

A Comparative Study on Feature Selection in Chinese Text Classification Problem
p.2854

A Petri Net Based Public-Key Cryptosystem
p.2858

Mining Frequent Items in Uncertain Dataset
p.2862

Mining Information Spreading Based on Users' Retweet Behavior in Twitter
p.2866

Embedded Linux Porting Based on ARM9 Hardware Platform
p.2871

HomeApplied Mechanics and MaterialsApplied Mechanics and Materials Vols. 380-384A Comparative Study on Feature Selection in...

A Comparative Study on Feature Selection in Chinese Text Classification Problem

Abstract:

Information explosion brings lots of challenges to text classification. The dimension disaster led to a sharp increase of computational complexity and lower classification accuracy. Therefore, it is critical to use feature selection techniques before actual classification. Automatic classification of English text has been researched for many years, but little on Chinese text. In this paper, several classic feature selection methods, namely TF, IG and CHI, are compared on classifying Chinese text. Meanwhile, we take imbalanced data into consideration in the paper. Experimental results show that CHI performed better than IG and TF when the dataset is imbalanced, but no obvious difference on balanced data.

You might also be interested in these eBooks

View Preview

Info:

Periodical:

Applied Mechanics and Materials (Volumes 380-384)

Pages:

2854-2857

DOI:

https://doi.org/10.4028/www.scientific.net/AMM.380-384.2854

Citation:

Cite this paper

Online since:

August 2013

Authors:

Hu Li, Peng Zou, Wei Hong Han

Keywords:

Feature Selection, Imbalanced Data, Text Classification

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

References

[1] J. Furnkranz: Round Robin classification. J. Mach. Learn., vol. 2 (2002), pp.721-747.

Google Scholar

[2] S. L. Lam and D. L. Lee: Feature Reduction for Neural Network Based Text Categorization, Proc. 6th International Conference on Database Systems for Advanced Applications, IEEE (1999), pp.195-202.

DOI: 10.1109/dasfaa.1999.765752

Google Scholar

[3] Y. Yang and J. P. Pedersen: A Comparative Study on Feature Selection in Text Categorization, Proc. 14th International Conference on Machine Learning, Morgan Kaufmann, (1997), pp.412-420.

Google Scholar

[4] W. Zhang, T. Yoshida and X. Tang: A comparative study of TF*IDF, LSI and multi-words for text classification, J. Expert Systems with Applications, vol. 38, (2011), pp.2758-2765.

DOI: 10.1016/j.eswa.2010.08.066

Google Scholar

[5] Y. L. Hung: Efficient classifiers for multi-class classification problems, J. Decision Support Systems, vol. 53 (2012), pp.473-481.

DOI: 10.1016/j.dss.2012.02.014

Google Scholar

[6] K. S. Jones: A statistical interpretation of term specificity and its application in retrieval, J. Documentation, vol. 28 (1972), pp.11-21.

DOI: 10.1108/eb026526

Google Scholar

[7] J. Nathalie and S. Shaju: The class imbalance problem: A systematic study, J. Intelligent Data Analysis, vol. 6 (2002), pp.429-449.

Google Scholar

[8] H. He and E. A. Garcia: Learning from imbalanced data, J. IEEE Trans. Knowl. Data Eng., vol. 21 (2009), pp.1263-1284.

DOI: 10.1109/tkde.2008.239

Google Scholar

[9] A. K. Daniel, O. Daniela and R. Christian: Analyzing document collections via context-aware term extraction, Proc. 14th international conference on Applications of Natural Language to Information Systems, Springer (2009), pp.154-168.

DOI: 10.1007/978-3-642-12550-8_13

Google Scholar