p.2600
p.2605
p.2609
p.2614
p.2618
p.2622
p.2626
p.2630
p.2636
Near Duplicated Text Detection Based on MapReduce
Abstract:
As the emerging date intensive applications have received more and more attentions from researchers, its a severe challenge for near duplicated text detection for large scale data. This paper presents an algorithm based on MapReduce and ontology for near duplicated text detection via computing pair document similarity in large scale document collections. We mapping the words in the document to the synonym and then calculate the similarity between them. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key /value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. In large scale test, experimental result demonstrates that this approach outperforms other state of the art solutions. Many advantages such as linear time and accuracy make the algorithm valuable in actual practice.
Info:
Periodical:
Pages:
2618-2621
Citation:
Online since:
September 2013
Authors:
Keywords:
Price:
Сopyright:
© 2013 Trans Tech Publications Ltd. All Rights Reserved
Share:
Citation: