Study and Analyze on Feature Selection in Text Categorization for Engineering Domain

Article Preview

Abstract:

First this paper makes a brief introduction about DF, expected cross entropy, MI, IG, and statistic. Then combining with KNN classification algorithm, it assesses the four methods of feature selection by recall, precision, F1. At last, this paper proposes and discusses one method of improving MI.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

383-386

Citation:

Online since:

March 2012

Authors:

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2012 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Li Yushan. Digital Vision Video Technology [M]. Xi'an: Xidian University Press, 2005.

Google Scholar

[2] KW.Church, Wa Gale.Inverse Document Frequency (IDF): A Measure of Deviations from Poission [J]. Proceedings of the Third Workshop on Very Large Corpora, 1995.

DOI: 10.1007/978-94-017-2390-9_18

Google Scholar

[3] Fabrizio Sebastiani.Maehine learning in automated texteategorization [J]. ACM ComPuting Surveys, 2002, Vol.34, No.l: l-47

Google Scholar

[4] Andrew McCallum and Kamal Nigam.A Comparison of event models for naïve bayes text Categorization [J], AAAI-98 workshop on "Learning for Text Categorization" ,1998,29-138

Google Scholar

[5] Yiming and Xin Liu.A re-examination of text eategorization methods [J]. Proeeedings of the 22nd Allnual International ACM SIGIR Conference on Research and Development in the Information Retrieval.NewYork: ACM, 1999:42 - 49.

DOI: 10.1145/312624.312647

Google Scholar

[6] Tang Liang, Duan Jianguo. Xu Hongbo. Liang Ling. Maximization of mutual information based feature selection algorithm and its application [J]. Computer Engineering and Applications, 2008,44 (13) :130-133. Table 2 Feature extraction results Class topics Test indicators Language material Document frequency Mutual information Information gain Expect Cross- entropy Statistics Education Precision rate 80.37% 73.71% 75.65% 79.23% 78.32% recall rate 87.83% 81.97% 86.28% 80.28% 83.74% F1 Value 83.93% 77.62% 80.62% 79.75% 80.94% Computer Precision rate 77.32% 72.14% 73.48% 86.14% 81.21% recall rate 79.32% 81.38% 85.31% 66.14% 71.21% F1 Value 78.30% 76.48% 78.95% 75.78% 78.00% Environment Precision rate 81.03% 79.26% 76.38% 84.13% 83.18% recall rate 78.57% 78.91% 81.84% 77.56% 75.79% F1 Value 79.87% 79.14% 78.86% 82.27% 79.57% Traffic Precision rate 77.69% 75.97% 77.46% 76.16% 76.42% recall rate 81.56% 81.34% 83.23% 85.75% 78.37% F1 Value 79.56% 78.62% 80.45% 81.23% 77.58% Military Precision rate 75.47% 69.21% 73.15% 74.54% 71.26% recall rate 80.14% 80.52% 84.71% 86.23% 81.76% F1 Value 77.73% 74.44% 78.51% 79.96% 76.15% Economic Precision rate 78.26% 72.17% 74.42% 78.39% 75.36% recall rate 83.47% 81.47% 86.54% 76.47% 85.84% F1 Value 80.78% 76.54% 80.23% 77.42% 80.26% Real estate Precision rate 77.38% 71.87% 74.84% 73.62% 72.26% recall rate 82.73% 75.13% 82.47% 89.38% 81.46% F1 Value 79.97% 73.46% 78.47% 80.74% 76.58%

Google Scholar