iTagger: Part-of-Speech Tagging Based on SBCB Learning Algorithm

Article Preview

Abstract:

The problem of part-of-speech (POS) tagging or disambiguation is a practical issue in natural language processing (NLP) community, especially in the development of a machine translation system. The performance of POS tagging system may interference the subsequent analytical tasks in the translation process, and thereafter affects the overall translation quality. This paper presents a novel POS tagging system, iTagger, which is developed based on Selecting Base Classifiers on Bagging (SBCB) learning algorithm. In this work, the POS tagging task is regarded as a classification problem. Features such as the surrounding context of ambiguous candidates, n-gram information, lexical items and linguistic clues are used and automatically extracted from the annotated corpus. The proposed system has been compared against two state-of-the-art tagging methods, Hidden Markov Model (HMM) and Maximum Entropy. The empirical results conducted on the corpora of (English) Brown corpus, (Portuguese) Tycho Brahe corpus and the Chinese Tree Bank corpus reveal the competitiveness of iTagger. Moreover, the iTagger has been developed and released to the public as library and tool for various development and application purposes.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

3449-3453

Citation:

Online since:

January 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] X. D. Zeng, S. Chao, and F. Wong, Optimization of bagging classifiers based on SBCB algorithm, Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC). 1 (2010) 262-267.

DOI: 10.1109/icmlc.2010.5581054

Google Scholar

[2] Z. S. Haris, String Analysis of Sentence Structure, Mouton, The Hague, 1962.

Google Scholar

[3] S. Klein and R. F. Simmons, A computational approach to grammatical coding of English words, Journal of the Assiciation for Computing Machinery. 10 (1963) 334-347.

DOI: 10.1145/321172.321180

Google Scholar

[4] B. B. Greene and G.M. Rubin, Automatic grammatical tagging of English, Department of Linguistics, Brown University, Providence, Rhode Island, 1971.

Google Scholar

[5] B. Merialdo, Tagging English Text with a Probabilistic Model, Computational Linguistics. (1994) 155-171.

Google Scholar

[6] E. Brill, Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging, Computational Linguistics. (1995) 543-566.

Google Scholar

[7] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Computational linguistics. 19 (1993) 313-330.

DOI: 10.21236/ada273556

Google Scholar

[8] Charlotte Galves and Helena Britto, A Construção do Corpus Anotado do Português Histórico Tycho Brahe: o sistema de anotação morfológica, in IV PROPOR, Evora: University of Evora. (1999) 55-67.

DOI: 10.22481/rbba.v8i1.5585

Google Scholar

[9] N. Xue, F. Xia, F. D. Chiou, and M. Palmer, The Penn Chinese Treebank: Phrase structure annotation of a large corpus, Natural Language Engineering. 11 (2005) 207-238.

DOI: 10.1017/s135132490400364x

Google Scholar

[10] T. Brants, TnT: a statistical part-of-speech tagger, Proceedings of the sixth conference on Applied natural language processing. (2000) 224-231.

DOI: 10.3115/974147.974178

Google Scholar