Near Duplicated Text Detection Based on MapReduce

Article Preview

Abstract:

As the emerging date intensive applications have received more and more attentions from researchers, its a severe challenge for near duplicated text detection for large scale data. This paper presents an algorithm based on MapReduce and ontology for near duplicated text detection via computing pair document similarity in large scale document collections. We mapping the words in the document to the synonym and then calculate the similarity between them. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key /value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. In large scale test, experimental result demonstrates that this approach outperforms other state of the art solutions. Many advantages such as linear time and accuracy make the algorithm valuable in actual practice.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

2618-2621

Citation:

Online since:

September 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] Jeffrey Dean, Sanjay Ghenmawat: MapReduce Simplified Data Processing on Large Clusters Communications of the ACM Volume 51 Issue 1, January (2008).

Google Scholar

[2] Alexander Maedche, Steffen Staab: Measuring Similarity between Ontologies, EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web, Springer-Verlag, UK (2002).

DOI: 10.1007/3-540-45810-7_24

Google Scholar

[3] Jimmy Lin, Chris Dyer: Data-Intensive Text Processing with MapReduce, Morgan & Claypool (2010), p.22.

Google Scholar

[4] Guemeet Singh Manku, Arvind Jain, Anish, Das Sarma: Detecting Near-Duplicates for Web Crawing WWW '07 Proceedings of the 16th international conference on World Wide Web, New York (2007).

DOI: 10.1145/1242572.1242592

Google Scholar

[5] Klaus Berberich, Srikanta Bedathur: Computing n-gram statistics in MapReduce EDBT '13 Proceedings of the 16th International Conference on Extending Database Technology, New York (2013).

DOI: 10.1145/2452376.2452389

Google Scholar