A Research on MapReduce-Based Redundancy Pruning for Record Linkage

Article Preview

Abstract:

To improve efficiency for record linkage with keeping high recall, technique of multiple signatures that groups an object into several clusters have been applied in many domains. Thus leads to redundant comparisons for a pair of source and target object. Based on MapReduce model, we propose a redundancy pruning approach to prune redundant pairs before final similarity computation. Our approach is implemented on two consecutive MapReduce phase, and then is evaluated on 2 practical datasets and shows good pruning ability.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 753-755)

Pages:

3009-3013

Citation:

Online since:

August 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] H. Köpcke, E. Rahm. Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 2010, 69(2): 197-210.

DOI: 10.1016/j.datak.2009.10.003

Google Scholar

[2] H. Köpcke, A. Thor, E. Rahm. Evaluation of entity resolution approaches on real-world match problems., Proc. of the VLDB Endowment 3. 1-2 (2010): 484-493.

DOI: 10.14778/1920841.1920904

Google Scholar

[3] J. Dean, S. Ghemawat. MapReduce: simplified data processing on large clusters., Commun. of the ACM 51. 1 (2008): 107-113.

DOI: 10.1145/1327452.1327492

Google Scholar

[4] T. Elsayed, J. Lin, and D. W. Oard. Pairwise document similarity in large collections with MapReduce., Proc. of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, (2008).

DOI: 10.3115/1557690.1557767

Google Scholar

[5] R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce., SIGMOD conference. (2010).

DOI: 10.1145/1807167.1807222

Google Scholar

[6] L. Kolb, A. Thor, E. Rahm. Load Balancing for MapReduce-based Entity Resolution., Data Engineering (ICDE), 2012 IEEE 28th International Conference on. IEEE, (2012).

DOI: 10.1109/icde.2012.22

Google Scholar

[7] C. Peter. A survey of indexing techniques for scalable record linkage and deduplication., Knowledge and Data Engineering, IEEE Transactions on 24. 9 (2012): 1537-1555.

DOI: 10.1109/tkde.2011.127

Google Scholar

[8] N. McNeill, H. Kardes, A. Borthwick. Dynamic record blocking: efficient linking of massive databases in mapreduce., (2012).

Google Scholar

[9] G. Papadakis, et al. Eliminating the redundancy in blocking-based entity resolution methods., Proc. of the 11th annual international ACM/IEEE joint conference on Digital libraries. ACM, (2011).

DOI: 10.1145/1998076.1998093

Google Scholar

[10] L. Kolb, A. Thor, E. Rahm. Don't match twice: redundancy-free similarity computation with MapReduce. Tech. rep. http: /dbs. uni-leipzig. de/de/publication/redfree, (2012).

DOI: 10.1145/2486767.2486768

Google Scholar

[11] L. Kolb, A. Thor, E. Rahm. Dedoop: efficient deduplication with Hadoop., Proc. of the VLDB Endowment 5. 12 (2012): 1878-1881.

DOI: 10.14778/2367502.2367527

Google Scholar