Near Duplicated Text Detection Based on MapReduce

Ling Shen; Qing Xi Peng

doi:10.4028/www.scientific.net/AMM.427-429.2618

Paper Titles

Summarize the Security Issues of Internet of Things and Prospect of Future Trends
p.2600

Research on the Structure of Smart Laboratory Based on the Internet of Things Technology
p.2605

Study on Group Decision Support System for Production Resource Allocation of Engineering Machinery
p.2609

Classifying Sentiment Based on LDA Model
p.2614

Near Duplicated Text Detection Based on MapReduce
p.2618

An Improved Backoff Scheme for the IEEE 802.15.4 MAC Protocol
p.2622

Research of Security Strategy of Cloud Storage
p.2626

The Data Fusion Survivability Analysis Technology of Wireless Sensor Network
p.2630

Oriented-Sensor Web Framework Based on Web Services
p.2636

HomeApplied Mechanics and MaterialsApplied Mechanics and Materials Vols. 427-429Near Duplicated Text Detection Based on MapReduce

Near Duplicated Text Detection Based on MapReduce

Abstract:

As the emerging date intensive applications have received more and more attentions from researchers, its a severe challenge for near duplicated text detection for large scale data. This paper presents an algorithm based on MapReduce and ontology for near duplicated text detection via computing pair document similarity in large scale document collections. We mapping the words in the document to the synonym and then calculate the similarity between them. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key /value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. In large scale test, experimental result demonstrates that this approach outperforms other state of the art solutions. Many advantages such as linear time and accuracy make the algorithm valuable in actual practice.

You might also be interested in these eBooks

View Preview

Info:

Periodical:

Applied Mechanics and Materials (Volumes 427-429)

Pages:

2618-2621

DOI:

https://doi.org/10.4028/www.scientific.net/AMM.427-429.2618

Citation:

Cite this paper

Online since:

September 2013

Authors:

Ling Shen*, Qing Xi Peng

Keywords:

MapReduce, Near Duplicated Text, Ontology

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

* - Corresponding Author

References

[1] Jeffrey Dean, Sanjay Ghenmawat: MapReduce Simplified Data Processing on Large Clusters Communications of the ACM Volume 51 Issue 1, January (2008).

Google Scholar

[2] Alexander Maedche, Steffen Staab: Measuring Similarity between Ontologies, EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web, Springer-Verlag, UK (2002).

DOI: 10.1007/3-540-45810-7_24

Google Scholar

[3] Jimmy Lin, Chris Dyer: Data-Intensive Text Processing with MapReduce, Morgan & Claypool (2010), p.22.

Google Scholar

[4] Guemeet Singh Manku, Arvind Jain, Anish, Das Sarma: Detecting Near-Duplicates for Web Crawing WWW '07 Proceedings of the 16th international conference on World Wide Web, New York (2007).

DOI: 10.1145/1242572.1242592

Google Scholar

[5] Klaus Berberich, Srikanta Bedathur: Computing n-gram statistics in MapReduce EDBT '13 Proceedings of the 16th International Conference on Extending Database Technology, New York (2013).

DOI: 10.1145/2452376.2452389

Google Scholar