A Practical Approach for Scalable Record Linkage on Hadoop

Article Preview

Abstract:

As increasing data have been collected in many applications, we have to face with millions of data in record linkage. With respect to traditional methods, there comes out a big challenge in performance while dealing with massive data. Parallel computing framework, such as MapReduce, has become an efficient and practical way to address this problem. In this paper, we propose a practical 3-phase MapReduce approach that fulfills blocking, filtering, and linking in 3 consecutive processes on Hadoop cluster. Experiments show that our approach functions efficiently and effectively with keeping high recall in contrast to tradition method.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 753-755)

Pages:

3018-3024

Citation:

Online since:

August 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] S. Sunita, and A. Bhamidipaty, Interactive deduplication using active learning., Proc. of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. (2002).

DOI: 10.1145/775047.775087

Google Scholar

[2] I. Bhattacharya, and L. Getoor, Collective entity resolution in relational data., ACM Transactions on Knowledge Discovery from Data (TKDD) 1. 1 (2007): 5.

DOI: 10.1145/1217299.1217304

Google Scholar

[3] H. Kopcke and E. Rahm, Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2), (2010).

Google Scholar

[4] H. Kopcke, A. Thor, and E. Rahm. Evaluation of Entity Resolution Approaches on real-world Match Problems. PVLDB, 3(1): 484–493, (2010).

DOI: 10.14778/1920841.1920904

Google Scholar

[5] C. Peter. Performance and scalability of fast blocking techniques for deduplication and data linkage. Proc. VLDB Endow, 1(2): 1253–1264, (2007).

Google Scholar

[6] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters., Commun. of the ACM 51. 1 (2008): 107-113.

DOI: 10.1145/1327452.1327492

Google Scholar

[7] D. Borthakur. The hadoop distributed file system: Architecture and design., Hadoop Project Website 11 (2007): 21.

Google Scholar

[16] S. Xu, S. Flexner, and V. Carvalho. Geocoding Billions of Addresses: Toward a Spatial Record Linkage System with Big Data., GIScience in the Big Data Age: 13.

Google Scholar

[17] L. Kolb, H. Köpcke, A. Thor, et al, Learning-based entity resolution with MapReduce., Proc. of the third international workshop on Cloud data management. ACM, (2011).

DOI: 10.1145/2064085.2064087

Google Scholar

[18] G. Dal Bianco, R. Galante, and C. A. Heuser. A fast approach for parallel deduplication on multicore processors., Proc. of the 2011 ACM Symposium on Applied Computing. ACM, (2011).

DOI: 10.1145/1982185.1982411

Google Scholar

[19] C. Rong, W. Lu, X. Du, et al. Efficient duplicate detection on cloud using a new signature scheme., Web-Age Information Management (2011): 251-263.

DOI: 10.1007/978-3-642-23535-1_23

Google Scholar

[20] L. Kolb, A. Thor, and E. Rahm, Dedoop: efficient deduplication with Hadoop., Proc. of the VLDB Endowment 5. 12 (2012): 1878-1881.

DOI: 10.14778/2367502.2367527

Google Scholar

[21] L. Kolb, A. Thor, and E. Rahm, Load Balancing for MapReduce-based Entity Resolution., Data Engineering (ICDE), 2012 IEEE 28th International Conference on. IEEE, (2012).

DOI: 10.1109/icde.2012.22

Google Scholar

[22] R. Tuchinda, P. Szekely, and C. A. Knoblock. Building mashups by example., Proc. of the 13th international conference on Intelligent user interfaces. ACM, (2008).

DOI: 10.1145/1378773.1378792

Google Scholar

[23] P. Szekely, C. A. Knoblock, F. Yang, X. Zhu, E. E. Fink, and R. Allen. Connecting the Smithsonian American Art Museum to the Linked Data Cloud. ESWC 2013, France.

DOI: 10.1007/978-3-642-38288-8_40

Google Scholar

[24] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks., Proc. of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03). (2003).

Google Scholar