Improved Relative Term Frequency Probability Feature Selection for Document Categorization

Qiang Li; Liang He; Xin Lin

doi:10.4028/www.scientific.net/AMM.548-549.1102

Paper Titles

A Frequency Reuse Scheme Based on Interference Pre-Cancellation and Resource Scheduling for Multi-Beam Satellite Downlink Signal
p.1085

Quality Control Oriented Data Analysis and Application of Traditional Chinese Medicine Extraction Process
p.1089

Driver Fatigue Monitoring Based on Head and Facial Features Using Hierarchical Bayesian Method
p.1093

The Application Research of Data Mining in Research-Oriented Learning Platform Building
p.1098

Improved Relative Term Frequency Probability Feature Selection for Document Categorization
p.1102

Facial Expression Recognition Based on the Texture Features of Global Principal Component and Local Boundary
p.1110

Mixed Expression Recognition & Analysis Based on Compressed Sense and Subjection Degree
p.1118

NURBS Curve Extraction from 2D Laser Sensor Data and 3D Simulated Data
p.1124

A New Method for Processing Symbolization Time Series
p.1130

HomeApplied Mechanics and MaterialsApplied Mechanics and Materials Vols. 548-549Improved Relative Term Frequency Probability...

Improved Relative Term Frequency Probability Feature Selection for Document Categorization

Abstract:

Feature selection is an important process to choose a subset of features relevant to a particular application in document classification. Those terms which occur unevenly in various categories have strong distinguishable information as to categorization. Firstly, based on the categorical document frequency probability (CTFP), a CTFP_VM feature selection algorithm was designed for feature selection. Secondly, a maximum term frequency conditional distribution factor was proposed to improve the CTFP_VM criterion further. We perform the document categorization experiments on SVM classifiers with the well-known Reuters-21578 and 20news-18828 corpuses as unbalanced and balanced corpus respectively. Experiments compare the novel methods with other conventional feature selection algorithms and the proposed method achieves the excellent feature set for document categorization.

You might also be interested in these eBooks

View Preview

Info:

Periodical:

Applied Mechanics and Materials (Volumes 548-549)

Pages:

1102-1109

DOI:

https://doi.org/10.4028/www.scientific.net/AMM.548-549.1102

Citation:

Cite this paper

Online since:

April 2014

Authors:

Qiang Li, Liang He, Xin Lin

Keywords:

Categorical Distribution, Category Tendency, Distribution Probability, Term Frequency, Variance Mean

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

References

[1] Uysal, Alper Kursat, and Serkan Gunal. A novel probabilistic feature selection method for text classification., Knowledge-Based Systems (2012).

DOI: 10.1016/j.knosys.2012.06.005

Google Scholar

[2] Chowdhury, G. Introduction to modern information retrieval. Facet publishing, (2010).

Google Scholar

[3] Maron, Melvin Earl, and John L. Kuhns. On relevance, probabilistic indexing and information retrieval., Journal of the ACM (JACM) 7. 3 (1960): 216-244.

DOI: 10.1145/321033.321035

Google Scholar

[4] Jones, Karen Sparck. A statistical interpretation of term specificity and its application in retrieval., Journal of documentation 28. 1 (1972): 11-21.

DOI: 10.1108/eb026526

Google Scholar

[5] Li, Yanling, Guanzhong Dai, and Gang Li. Feature selection method of text tendency classification. " Fuzzy Systems and Knowledge Discovery, 2008. FSKD, 08. Fifth International Conference on. Vol. 2. IEEE, (2008).

DOI: 10.1109/fskd.2008.263

Google Scholar

[6] He, Ji, Ah-Hwee Tan, and Chew Lim Tan. A Comparative Study on Chinese Text Categorization Methods., PRICAI Workshop on Text and Web Mining. Vol. 35. (2000).

Google Scholar

[7] Yang, Yiming, and Jan O. Pedersen. A comparative study on feature selection in text categorization., ICML. Vol. 97. (1997).

Google Scholar

[8] WEKA, http: /www. cs. waikato. ac. nz/~ml/index. html.

Google Scholar

[9] Joachims, Thorsten. Text categorization with support vector machines: Learning with many relevant features. Springer Berlin Heidelberg, (1998).

DOI: 10.1007/bfb0026683

Google Scholar

[10] Qu, Shouning, Sujuan Wang, and Yan Zou. Improvement of text feature selection method based on tfidf. " Future Information Technology and Management Engineering, 2008. FITME, 08. International Seminar on. IEEE, (2008).

DOI: 10.1109/fitme.2008.25

Google Scholar

[11] http: /kdd. ics. uci. edu/databases/reuters21578/reuters21578. html.

Google Scholar

[12] http: /people. csail. mit. edu/jrennie/20Newsgroups.

Google Scholar

[13] Zhen, Zhilong, et al. Categorical Document Frequency Based Feature Selection for Text Categorization., Information Technology, Computer Engineering and Management Sciences (ICM), 2011 International Conference on. Vol. 2. IEEE, (2011).

DOI: 10.1109/icm.2011.365

Google Scholar

[14] Zhen, Zhilong, et al. A global evaluation criterion for feature selection in text categorization using Kullback-Leibler divergence., Soft Computing and Pattern Recognition (SoCPaR), 2011 International Conference of. IEEE, (2011).

DOI: 10.1109/socpar.2011.6089284

Google Scholar