Improved Relative Term Frequency Probability Feature Selection for Document Categorization

Article Preview

Abstract:

Feature selection is an important process to choose a subset of features relevant to a particular application in document classification. Those terms which occur unevenly in various categories have strong distinguishable information as to categorization. Firstly, based on the categorical document frequency probability (CTFP), a CTFP_VM feature selection algorithm was designed for feature selection. Secondly, a maximum term frequency conditional distribution factor was proposed to improve the CTFP_VM criterion further. We perform the document categorization experiments on SVM classifiers with the well-known Reuters-21578 and 20news-18828 corpuses as unbalanced and balanced corpus respectively. Experiments compare the novel methods with other conventional feature selection algorithms and the proposed method achieves the excellent feature set for document categorization.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

1102-1109

Citation:

Online since:

April 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Uysal, Alper Kursat, and Serkan Gunal. A novel probabilistic feature selection method for text classification., Knowledge-Based Systems (2012).

DOI: 10.1016/j.knosys.2012.06.005

Google Scholar

[2] Chowdhury, G. Introduction to modern information retrieval. Facet publishing, (2010).

Google Scholar

[3] Maron, Melvin Earl, and John L. Kuhns. On relevance, probabilistic indexing and information retrieval., Journal of the ACM (JACM) 7. 3 (1960): 216-244.

DOI: 10.1145/321033.321035

Google Scholar

[4] Jones, Karen Sparck. A statistical interpretation of term specificity and its application in retrieval., Journal of documentation 28. 1 (1972): 11-21.

DOI: 10.1108/eb026526

Google Scholar

[5] Li, Yanling, Guanzhong Dai, and Gang Li. Feature selection method of text tendency classification. " Fuzzy Systems and Knowledge Discovery, 2008. FSKD, 08. Fifth International Conference on. Vol. 2. IEEE, (2008).

DOI: 10.1109/fskd.2008.263

Google Scholar

[6] He, Ji, Ah-Hwee Tan, and Chew Lim Tan. A Comparative Study on Chinese Text Categorization Methods., PRICAI Workshop on Text and Web Mining. Vol. 35. (2000).

Google Scholar

[7] Yang, Yiming, and Jan O. Pedersen. A comparative study on feature selection in text categorization., ICML. Vol. 97. (1997).

Google Scholar

[8] WEKA, http: /www. cs. waikato. ac. nz/~ml/index. html.

Google Scholar

[9] Joachims, Thorsten. Text categorization with support vector machines: Learning with many relevant features. Springer Berlin Heidelberg, (1998).

DOI: 10.1007/bfb0026683

Google Scholar

[10] Qu, Shouning, Sujuan Wang, and Yan Zou. Improvement of text feature selection method based on tfidf. " Future Information Technology and Management Engineering, 2008. FITME, 08. International Seminar on. IEEE, (2008).

DOI: 10.1109/fitme.2008.25

Google Scholar

[11] http: /kdd. ics. uci. edu/databases/reuters21578/reuters21578. html.

Google Scholar

[12] http: /people. csail. mit. edu/jrennie/20Newsgroups.

Google Scholar

[13] Zhen, Zhilong, et al. Categorical Document Frequency Based Feature Selection for Text Categorization., Information Technology, Computer Engineering and Management Sciences (ICM), 2011 International Conference on. Vol. 2. IEEE, (2011).

DOI: 10.1109/icm.2011.365

Google Scholar

[14] Zhen, Zhilong, et al. A global evaluation criterion for feature selection in text categorization using Kullback-Leibler divergence., Soft Computing and Pattern Recognition (SoCPaR), 2011 International Conference of. IEEE, (2011).

DOI: 10.1109/socpar.2011.6089284

Google Scholar