Apply Language Nature Rhythm to Large Scale Duplicated Text Detection

Article Preview

Abstract:

It is urgent that detect the duplication in large scale text in the Web. An arithmetic based on language rhythm for text duplication detection is proposed here. Get the nature rhythm marked by punctuations in text and build the rhythm compare matrix to complete the publication detection for each paragraph. This arithmetic is different with the other one which is based on words analysis. And it has a high accuracy and a low complicacy.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 457-458)

Pages:

635-640

Citation:

Online since:

January 2012

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2012 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] The 25th of China Internet Development Statistics Report http: /www. cnnic. net. cn/html/Dir/2010/01/15/5767. htm . Jan. (2010).

Google Scholar

[2] N. Shivakumar H.G. Molina. Finding Near-replicas of Documents and Servers on the Web. Proceedings of the International Workshop on World Wide Web and Databases, 1998, Valencia, Spain: 204–212.

DOI: 10.1007/10704656_13

Google Scholar

[3] Manber U Finding similar files in a large file system Proeeedings of the Winter USENIX Conferenee1994 1~10.

Google Scholar

[4] Brin S, Davis J, Garcia-Molina H. Copy detection mechanisms for digital documents. In:Proceedings of the ACM SIGMOD Annual Conference. 1995. http: /www-db. stanford. edu/pub/brin/1995/copy. ps.

DOI: 10.1145/223784.223855

Google Scholar

[5] Shivakumar N. Garcia-Molina H. SCAM: A copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries (DL'95). 1995. http: /www-db. stanford. edu/~shiva/publns. html.

DOI: 10.1145/226931.226961

Google Scholar

[6] Heintze N. Scalable document fingerprinting. In: Proceedings of the 2nd USENIX Workshop on Electronic Commerce. 1996. http: /www. cs. cmu. edu/afs/cs/user/nch/www/koala/main. html.

Google Scholar

[7] Si A, Leong HV, Lau RWH. CHECK: A document plagiarism detection system. In: Proceedings of the ACM Symposium for Applied Computing. 1997. 70~77. http: /www. acm. org/pubs/citations/proceedings/ sac/331697/p.70-si.

DOI: 10.1145/331697.335176

Google Scholar

[8] Monostori K, Zaslavsky A, Schmidt H. MatchDetectReveal: Finding overlapping and similar digital documents. In: Proceedings of the Information Resources Management Association International Conference (IRMA2000). 2000. http: /www. csse. monash. edu. au/projects/MDR/papers.

Google Scholar

[9] Song QB, Shen JY. On illegal coping and distributing detection mechanism for digital goods. Journal of Computer Research and Development, 2001, 38(1): 121~125 (in Chinese with English abstract).

Google Scholar

[10] WU Pingbo  CHEN Qunxiu  MA Liang. The Study on Large Scale Dupl icated Web Pages of Chinese Fast Deletion Algorithm Based on String of Feature Code Journal of Chinese Information Processing Vol. 17 No. 2.

Google Scholar