New Words Identification Based on Ensemble Methods

Article Preview

Abstract:

In order to identify new words in huge Chinese corpus efficiently, this paper comes up with an algorithm based on ensemble methods. At first we perform Chinese word segmenting with Trie and build segment-tree. Then we select words pattern drawing method, frequency filtering, independent word probability and naive Bayes model to be sub-models of ensemble methods and train them independently. At last we integrate results from different sub-models with a multi-layer model. In experiment, this algorithm is proved to be quite fast as well as product precise and high-coverage results.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

1626-1629

Citation:

Online since:

August 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] LIN Ling. How Chinese new words prevailing in Internet survive. Journal of Chengdu University, 2: 110–113, (2008).

Google Scholar

[2] LI Xiao-hua. Semantic motivation and cognition of Chinese homophonic neologisms. Journal of Langfang Teachers College (Social Sciences Edition), 28(6): 39–41, (2012).

Google Scholar

[3] ZHANG Hai-jun, SHI Shu-min, ZHU Chao-yong, and HUANG He-yan. Survey of Chinese new words identification. Computer Science, 37(3): 6–16, (2010).

Google Scholar

[4] LIN Zi-fang and JIANG Xiu-feng. A new method for Chinese new word identification based on the improved PWP. Journal of Fuzhou University (Natural Science Edition), 39(1): 43–48, (2011).

Google Scholar

[5] LIU Jian-zhou, HE Ting-ting, and LUO Chang-ri. Automatic new words detection based on corpus and web. Computer Applications, 24(7): 132–134, (2004).

Google Scholar

[6] LI Dun, GAO Yuanda, and WAN Yueliang. Internet oriented new words identification. Journal of Beijing University of Pots and Telecommunications, 31(1), (2008).

Google Scholar

[7] Thomas G. Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems, volume 1857 of Lecture Notes in Computer Science, pages 1–15. Springer Berlin Heidelberg, (2000).

DOI: 10.1007/3-540-45014-9_1

Google Scholar

[8] L. Kuncheva and C. Whitaker. Measures of diversity in classifier ensembles. Machine Learning, (51): 181–207, (2003).

Google Scholar

[9] CUI Shiqi, LIU Qun, MENG Yao, YU Hao, and Nishino Fumihito. New word detection based on large-scale corpus. Journal of Computer Research and Development, 43(5): 927– 932, (2006).

DOI: 10.1360/crad20060524

Google Scholar

[10] Hwanjo Yu, ChengXiang Zhai, and Jiawei Han. Text classification from positive and unlabelled documents. In Proceedings of the twelfth international conference on Information and knowledge management, pages 232–239. ACM, (2003).

DOI: 10.1145/956863.956909

Google Scholar

[11] NLPIR Chinese words segmenting system. http: /ictclas. nlpir. org.

Google Scholar