A Text Categorization Method Based on Features Clustering

Article Preview

Abstract:

Choosing Features of a text is an important part of text categorization. Its result can affect the quality and efficiency of the text categorizer. Since there are usually thousands of features of a text, it always needs to reduce the dimension of the feature space. Considering the semantic relationship among words, a new text categorization method based on features clustering is proposed in this paper. This method first uses word segmentation to split texts into words, then, remove stop words and words with low information, and then calculate the distribution of words in these texts to construct a matrix of co-occurrence words. After that, cluster algorithms are employed to reduce the dimension of the feature space. Finally some experiments are carried out on two corpuses using several text categorization algorithms. The results demonstrate that this new method can not only improve the precision and recall of text categorization, but also increase the efficiency.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 532-533)

Pages:

1090-1094

Citation:

Online since:

June 2012

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2012 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Jiawei Han, Micheline Kamber, data mining concept and technology 2nd condition, China Machine Press, (2006).

Google Scholar

[2] Guoa, B., Damper, R., Gunna, S.R., Nelsona, J.: A fast separability-based feature selection method for high-dimensional remotely sensed image classification. Pattern Recognition 41, 1653–1662 (2007).

DOI: 10.1016/j.patcog.2007.11.007

Google Scholar

[3] Turney P.D. Learning Algorithms for Key phrase Extraction. Information Retrieval, 2, pp.303-336(2000).

Google Scholar

[4] Institute of Computing Technology Chinese Academy of Science. Institute of Computing Technology, Chinese Lexical Analysis System. http: /ictclas. org/index. html.

Google Scholar

[5] Xiaoxu Zhong. Research on key words selection using hierarchical clustering algorithm. Computer knowledge and technology, 2009, pp: 1483-1484.

Google Scholar

[6] Salton G. A vector space model for automatic indexing. Communications of the ACM, 1975, 18: 613-620.

DOI: 10.1145/361219.361220

Google Scholar

[7] ESTER M, KRIEGEL H, SANDER J. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc of the 1996 2nd Int' l Conf on Knowledge Discovery and Data Mining. Portland : AAAI Press , 1996 : 226-231.

Google Scholar

[8] Salton G and Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988, 24(5): 513-523.

DOI: 10.1016/0306-4573(88)90021-0

Google Scholar

[9] LI Juanzi, FAN Qi'na, ZHANG Kuo, Keyword extraction based on tf/idf for Chinese news document, Wuhan University Journal of Natural Sciences. VOl. 12 NO. 5 2007 917-921.

DOI: 10.1007/s11859-007-0038-4

Google Scholar

[10] The Sogou corpus, The R&D Center of SOHU, http: /www. sogou. com/labs/dl/t. html.

Google Scholar

[11] Yang Yiming. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999, 1(1/2): 67-88.

Google Scholar

[12] Yang Yiming, Liu Xin. A re-examination of text categorization methods. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'99), 1999: 42-49.

DOI: 10.1145/312624.312647

Google Scholar

[13] Osuna E, Freund R, Girosi F. An improved training algorithm for support vector machines. In: Proceeding of IEEE NNSP, 1997: 276-285.

DOI: 10.1109/nnsp.1997.622408

Google Scholar