Paper Titles

Research and Application on STIW Classification Method
p.3000

Motion Blurred Image Restoration Algorithm Based on AGA and Wiener Filter in Ship Imaging System
p.3005

A Research on MapReduce-Based Redundancy Pruning for Record Linkage
p.3009

A New Method for Information System with Good Performance
p.3014

A Practical Approach for Scalable Record Linkage on Hadoop
p.3018

Plan and Decision of System Maintain Based on Web Service
p.3025

Structure for Information System Error Correction
p.3029

Grey Intuitionistic Fuzzy Sets and its Application in Performance Evaluation of Image Fusion
p.3033

Design and Implementation of an Automatic Scoring Subjective Question System Based on Domain Ontology
p.3039

HomeAdvanced Materials ResearchAdvanced Materials Research Vols. 753-755A Practical Approach for Scalable Record Linkage...

A Practical Approach for Scalable Record Linkage on Hadoop

Article Preview

Abstract:

As increasing data have been collected in many applications, we have to face with millions of data in record linkage. With respect to traditional methods, there comes out a big challenge in performance while dealing with massive data. Parallel computing framework, such as MapReduce, has become an efficient and practical way to address this problem. In this paper, we propose a practical 3-phase MapReduce approach that fulfills blocking, filtering, and linking in 3 consecutive processes on Hadoop cluster. Experiments show that our approach functions efficiently and effectively with keeping high recall in contrast to tradition method.

You might also be interested in these eBooks

Materials Processing and Manufacturing III

Info:

Periodical:

Advanced Materials Research (Volumes 753-755)

Pages:

3018-3024

DOI:

https://doi.org/10.4028/www.scientific.net/AMR.753-755.3018

Citation:

Cite this paper

Online since:

August 2013

Authors:

Fen Gyu Yang*, Ying Chen, Ye Zhang

Keywords:

Blocking, Filtering, Hadoop, MapReduce, Record Linkage

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] S. Sunita, and A. Bhamidipaty, Interactive deduplication using active learning., Proc. of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. (2002).

DOI: 10.1145/775047.775087

[2] I. Bhattacharya, and L. Getoor, Collective entity resolution in relational data., ACM Transactions on Knowledge Discovery from Data (TKDD) 1. 1 (2007): 5.

DOI: 10.1145/1217299.1217304

[3] H. Kopcke and E. Rahm, Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2), (2010).

[4] H. Kopcke, A. Thor, and E. Rahm. Evaluation of Entity Resolution Approaches on real-world Match Problems. PVLDB, 3(1): 484–493, (2010).

DOI: 10.14778/1920841.1920904

[5] C. Peter. Performance and scalability of fast blocking techniques for deduplication and data linkage. Proc. VLDB Endow, 1(2): 1253–1264, (2007).

[6] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters., Commun. of the ACM 51. 1 (2008): 107-113.

DOI: 10.1145/1327452.1327492

[7] D. Borthakur. The hadoop distributed file system: Architecture and design., Hadoop Project Website 11 (2007): 21.

[16] S. Xu, S. Flexner, and V. Carvalho. Geocoding Billions of Addresses: Toward a Spatial Record Linkage System with Big Data., GIScience in the Big Data Age: 13.

[17] L. Kolb, H. Köpcke, A. Thor, et al, Learning-based entity resolution with MapReduce., Proc. of the third international workshop on Cloud data management. ACM, (2011).

DOI: 10.1145/2064085.2064087

[18] G. Dal Bianco, R. Galante, and C. A. Heuser. A fast approach for parallel deduplication on multicore processors., Proc. of the 2011 ACM Symposium on Applied Computing. ACM, (2011).

DOI: 10.1145/1982185.1982411

[19] C. Rong, W. Lu, X. Du, et al. Efficient duplicate detection on cloud using a new signature scheme., Web-Age Information Management (2011): 251-263.

DOI: 10.1007/978-3-642-23535-1_23

[20] L. Kolb, A. Thor, and E. Rahm, Dedoop: efficient deduplication with Hadoop., Proc. of the VLDB Endowment 5. 12 (2012): 1878-1881.

DOI: 10.14778/2367502.2367527

[21] L. Kolb, A. Thor, and E. Rahm, Load Balancing for MapReduce-based Entity Resolution., Data Engineering (ICDE), 2012 IEEE 28th International Conference on. IEEE, (2012).

DOI: 10.1109/icde.2012.22

[22] R. Tuchinda, P. Szekely, and C. A. Knoblock. Building mashups by example., Proc. of the 13th international conference on Intelligent user interfaces. ACM, (2008).

DOI: 10.1145/1378773.1378792

[23] P. Szekely, C. A. Knoblock, F. Yang, X. Zhu, E. E. Fink, and R. Allen. Connecting the Smithsonian American Art Museum to the Linked Data Cloud. ESWC 2013, France.

DOI: 10.1007/978-3-642-38288-8_40

[24] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks., Proc. of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03). (2003).