Vari-Gram Language Model Based on Category

Article Preview

Abstract:

Category-based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks about this model: (1) the problem of word clustering, it is hard to find a suitable clustering method that has good performance and not large amount of computation. (2) class based method always lose some prediction ability to adapt the text of different domain. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance. At the same time, this paper presents a new method to create the vari-gram model.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

995-1000

Citation:

Online since:

June 2011

Authors:

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2011 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Takuya Matsuzaki, Yusuke Miyao. An Efficient Clustering Algorithm for Class–Based Language Models[A]. Proc of the 7th Conf on Natural Language Learning at HLT-NAACL[C]. 2003. 119-126.

DOI: 10.3115/1119176.1119192

Google Scholar

[2] Ido Dagan et al. Context word similarity and estimation from sparse data. Computer Speech and Language, 1995, 9(2): 123-152.

DOI: 10.1006/csla.1995.0008

Google Scholar

[3] Niesler T R, Woodland P C. A variable-length category-based n-gram language model. In: Proce the International Conference of Acoustics Speech and Signal Processing. Atlanta, 1996, 164-169.

DOI: 10.1109/icassp.1996.540316

Google Scholar

[4] Firth, John Rupert. 1957. A synopsis of linguistic theory 1930-1955. In Philological Society, editor, Studies in Linguistic Analysis. Blackwell, Oxford, pages 1-32. Reprinted in Selected Papers of J. R. Firth, edited by F. Palmer. Longman, (1968).

DOI: 10.1093/ref:odnb/33138

Google Scholar

[5] Christopher D Manning, Hinrich Schutze. Foundations of Statistical Natural Language Processing. London: The MIT Press, (1999).

Google Scholar

[6] Cutting, D. R., Karger, D. R., Perdersen, J. R, and Tukey, J. W. Scatter/garther: A cluster-based approach to browsing large document collections. In SIGIR 92.

DOI: 10.1145/3130348.3130362

Google Scholar

[7] Gao, J., Wang, H. F., M. and Lee, K. F. A unified approach to statistical language modeling for Chinese. ICASSP-2000, Istanbul, Turkey, June.

Google Scholar

[8] Lee, Lillian. Similarity-Based approaches to Natural Language Processing. Ph.D. thesis, Harvard University, Cambridge, MA. (1997).

Google Scholar

[9] Karov, Yael and Shimon Edelman. 1996. Learning similarity-based word sense disambiguation from sparse data. In Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen, Denmark, 42-55.

Google Scholar