A Text Categorization Method Based on Features Clustering

Zhi Bin Feng; Ping Jian Zhang; Juan Juan Zhao

doi:10.4028/www.scientific.net/AMR.532-533.1090

Paper Titles

Time Series Data Mining in Process Industry
p.1069

Research of Face Recognition Method Based on Multiple Classifier Fusion
p.1075

Scalable MD5 Crypt Cracker on PetaScale Supercomputer
p.1080

Research on Farmland Information Acquisition System Based on IoT
p.1085

A Text Categorization Method Based on Features Clustering
p.1090

Design of Real-Time Image Collecting Module Based on FPGA
p.1095

Research on Internet of Things Based on RFID
p.1100

A New Ship Detection Method for Massive Data High-Resolution Remote Sensing Images
p.1105

Design of SoC Verification System Based on Multi-FPGA
p.1110

HomeAdvanced Materials ResearchAdvanced Materials Research Vols. 532-533A Text Categorization Method Based on Features...

A Text Categorization Method Based on Features Clustering

Abstract:

Choosing Features of a text is an important part of text categorization. Its result can affect the quality and efficiency of the text categorizer. Since there are usually thousands of features of a text, it always needs to reduce the dimension of the feature space. Considering the semantic relationship among words, a new text categorization method based on features clustering is proposed in this paper. This method first uses word segmentation to split texts into words, then, remove stop words and words with low information, and then calculate the distribution of words in these texts to construct a matrix of co-occurrence words. After that, cluster algorithms are employed to reduce the dimension of the feature space. Finally some experiments are carried out on two corpuses using several text categorization algorithms. The results demonstrate that this new method can not only improve the precision and recall of text categorization, but also increase the efficiency.

You might also be interested in these eBooks

View Preview

Info:

Periodical:

Advanced Materials Research (Volumes 532-533)

Pages:

1090-1094

DOI:

https://doi.org/10.4028/www.scientific.net/AMR.532-533.1090

Citation:

Cite this paper

Online since:

June 2012

Authors:

Zhi Bin Feng, Ping Jian Zhang, Juan Juan Zhao

Keywords:

Feature Selection, Features Clustering, Matrix of Co-Occurrence Words, Text Categorization

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

References

[1] Jiawei Han, Micheline Kamber, data mining concept and technology 2nd condition, China Machine Press, (2006).

Google Scholar

[2] Guoa, B., Damper, R., Gunna, S.R., Nelsona, J.: A fast separability-based feature selection method for high-dimensional remotely sensed image classification. Pattern Recognition 41, 1653–1662 (2007).

DOI: 10.1016/j.patcog.2007.11.007

Google Scholar

[3] Turney P.D. Learning Algorithms for Key phrase Extraction. Information Retrieval, 2, pp.303-336(2000).

Google Scholar

[4] Institute of Computing Technology Chinese Academy of Science. Institute of Computing Technology, Chinese Lexical Analysis System. http: /ictclas. org/index. html.

Google Scholar

[5] Xiaoxu Zhong. Research on key words selection using hierarchical clustering algorithm. Computer knowledge and technology, 2009, pp: 1483-1484.

Google Scholar

[6] Salton G. A vector space model for automatic indexing. Communications of the ACM, 1975, 18: 613-620.

DOI: 10.1145/361219.361220

Google Scholar

[7] ESTER M, KRIEGEL H, SANDER J. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc of the 1996 2nd Int' l Conf on Knowledge Discovery and Data Mining. Portland : AAAI Press , 1996 : 226-231.

Google Scholar

[8] Salton G and Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988, 24(5): 513-523.

DOI: 10.1016/0306-4573(88)90021-0

Google Scholar

[9] LI Juanzi, FAN Qi'na, ZHANG Kuo, Keyword extraction based on tf/idf for Chinese news document, Wuhan University Journal of Natural Sciences. VOl. 12 NO. 5 2007 917-921.

DOI: 10.1007/s11859-007-0038-4

Google Scholar

[10] The Sogou corpus, The R&D Center of SOHU, http: /www. sogou. com/labs/dl/t. html.

Google Scholar

[11] Yang Yiming. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999, 1(1/2): 67-88.

Google Scholar

[12] Yang Yiming, Liu Xin. A re-examination of text categorization methods. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'99), 1999: 42-49.

DOI: 10.1145/312624.312647

Google Scholar

[13] Osuna E, Freund R, Girosi F. An improved training algorithm for support vector machines. In: Proceeding of IEEE NNSP, 1997: 276-285.

DOI: 10.1109/nnsp.1997.622408

Google Scholar