A Text Hybrid Clustering Algorithm Based on HowNet Semantics

Article Preview

Abstract:

Many existing text clustering algorithms overlook the semantic information between words and so they possess a lower accuracy of text similarity computation. A new text hybrid clustering algorithm (HCA) based on HowNet semantics has been proposed in this paper. It calculates the semantic similarity of words by using the words’ semantic concept description in HowNet and then combines it with the method of maximum weight matching of bipartite graph to calculate a semantic-based text similarity. Based on the new text similarity and by combining an improved genetic algorithm with k-medoids algorithm, HCA has been designed. The comparative experiments show that: 1) compared with two existing traditional clustering algorithms, HCA can get better quality and 2) when their text cosine similarity is replaced with the new semantic-based text similarity, all the qualities of the three clustering algorithms can be improved significantly.

You might also be interested in these eBooks

Info:

Periodical:

Key Engineering Materials (Volumes 474-476)

Pages:

2071-2078

Citation:

Online since:

April 2011

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2011 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Zhendong Dong, Qiang Dong. Introduction to HowNet [M/OL]. 1999. http: /www. keenage. com.

Google Scholar

[2] Qun Liu, Sujian Li. Word Similarity Computing Based on How-net [J]. Computational Linguistics and Chinese Language Processing, 2002, 7(2): 59-76.

Google Scholar

[3] Gang Yu, Yangjun Pei, Zhengyu Zhu et al. Research of text similarity based on word similarity computing [J]. Computer Engineering and Design, 2006, (2): 67-70.

Google Scholar

[4] Qu Gong. Graph theory and network optimization algorithms [M]. Chongqing: Chongqing University Press, 2000. 87-96.

Google Scholar

[5] Jiawei Han, Micheline Kamber. Data Mining Concepts and Techniques, Second Edition [M]. Beijing: China Machine Press, 2007. 263-266.

Google Scholar

[6] Zhengyu Zhu, Lipei Li, Ying Luo et al. Fitness Function Applied to Chinese Text Clustering [J]. Computer Science, 2009, (5): 244-246, 272.

Google Scholar

[7] C. -H. Chou, M. -C. Su, E. Lai A new cluster validity measure and its application to image compression [J]. Pattern Analysis & Applications (Springer London). July 2004. Vol 7, Issue 2. 205-220.

DOI: 10.1007/s10044-004-0218-1

Google Scholar

[8] Licheng Jiao, Fang Liu, Shuiping Gou et al. Intelligent Data Mining and Knowledge Discovery [M]. Xian: XiDian University Press, 2006. 351-353.

Google Scholar

[9] Zhengyu Zhu, Yunyan Tian. An improved partitioning-based web documents clustering method combining GA with ISODATA [J]. Fourth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2007, v2, pp.208-213.

DOI: 10.1109/fskd.2007.165

Google Scholar

[10] Ronglu Li. Chinese text classification corpus [DB/OL]. 2003. http: /www. nlp. org. cn/docs /download. php?doc_id=281.

Google Scholar

[11] LARSEN B, AONE C. Fast and effective text mining using linear time document clustering [A]. In Proc. of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999. pp.16-22.

DOI: 10.1145/312129.312186

Google Scholar