Multi Queries Methods of the Chinese-English Bilingual Plagiarism Detection

Article Preview

Abstract:

Cross-language plagiarism detection identifies and extracts plagiarized text in a multilingual environment. In recent years, there has been a significant amount of work done involving English and European text. However, somewhat less attention has been paid to Asia languages. We compared a number of different strategies for Chinese-English bilingual plagiarism detection. We present methods for candidate document retrieval and compare four methods: (i) document keywords based, (ii) intrinsic plagiarism based, (iii) headers based, and (iv) machine translation queries. The results of our evaluation indicated that keywords based queries, the simplest and most efficient approach, gives acceptable results for newspaper articles. We also compared different percentage of keywords based query, and the results indicated that putting 50% keywords into queries can obtain the satisfied candidate documents set.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

1158-1162

Citation:

Online since:

November 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] Martin Potthast, et al. Cross-language plagiarism detection. Lang Resource & Evaluation. (2011), 45: 45-62.

Google Scholar

[2] McCabe, D. Research Report of the Center for Academic Integrity. http: /www. academicintegrity. org. (2005).

Google Scholar

[3] Andrew Jacobs, New York Times, October 6, (2010).

Google Scholar

[4] Clough, P. Old and new challenges in automatic plagiarism detection. National UK Plagiarism Advisory Service, http: /www. ir. shef. ac. uk/cloughie/papers/pas_plagiarism. pdf. (2003).

Google Scholar

[5] Barro´n-Ceden˜o, A., Rosso, P., Pinto, D., & Juan A. On cross-lingual plagiarism analysis using a statistical model. ECAI 2008 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 08) (p.9–13). Patras, Greece. (2008).

Google Scholar

[6] P. Vossen, Ed. EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer, Dordrecht, The Netherlands. (1998).

DOI: 10.1007/978-94-017-1491-4_1

Google Scholar

[7] Potthast, M., Stein, B., & Anderka, M. A Wikipedia-based multilingual retrieval model. 30th European conference on IR research, ECIR 2008, Glasgow , volume 4956 LNCS of Lecture Notes in Computer Science (p.522–530). Berlin: Springer. (2008).

DOI: 10.1007/978-3-540-78646-7_51

Google Scholar

[8] Huafu Ding, Lili Quan, Haoliang Qi. The Chinese-English Bilingual Sentence Alignment based on Length. 2011 International Conference on Asia Language Processing. pp.201-204, (2011).

DOI: 10.1109/ialp.2011.70

Google Scholar

[9] Manning, C.D., Raghavan, P., Sch¨utze, H. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK . (2008).

Google Scholar

[10] Simon Suchomel, Jan Kasprzak, Michal Brandejs. Three way search engine queries with multi-feature document comparison for plagiarism detection. Notebook for PAN at CLEF. (2012).

Google Scholar

[11] Eissen, S.M.Z., Stein, B. Intrinsic plagiarism detection. Proceedings of the European Conference on Information Retrieval (ECIR-06) . (2006).

Google Scholar