Feedback Clustering Algorithm for Detecting Approximately Duplicate Records

Article Preview

Abstract:

Detecting and merging approximately duplicate records is not an emerging issue in the field of data cleansing, the majority of duplicated records detecting method is based on the "sort-merge" thinking. Although clustering methods have been applied to data cleaning, a large number of non-duplicated records exist in clusters after analysis as a result of the increasing records. Response to this shortcoming, this paper presents a data cleansing method based on Clustering Feedback Pattern. Comparison results of clustering are fed back to the cluster process so that recall and precision improve.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

2138-2141

Citation:

Online since:

August 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] He L, Zhang Z, Tan Y, et al. An efficient data cleaning algorithm based on attributes selection[C]/Computer Sciences and Convergence Information Technology (ICCIT), 2011, 6th International Conference on. IEEE, 2011: 375-379.

Google Scholar

[2] Rahm E, Do H H. Data cleaning: Problems and current approaches[J]. IEEE Data Eng. Bull., 2000, 23(4): 3-13.

Google Scholar

[3] GuoJun H, Ping H. An approach for detecting Approximately Duplicate Data Warehouse records[C]/Computer Application and System Modeling (ICCASM), 2010 International Conference on. IEEE, 2010, 3: V3-679-V3-682.

DOI: 10.1109/iccasm.2010.5620724

Google Scholar

[4] Chen Wei , Wang Hao. Computer Applications and Software. 2000, 37(10): 1153-1159. In Chinese.

Google Scholar

[5] Borah B, Bhattacharyya D K. An improved sampling-based DBSCAN for large spatial databases[C]/Intelligent Sensing and Information Processing, 2004. Proceedings of International Conference on. IEEE, 2004: 92-96.

DOI: 10.1109/icisip.2004.1287631

Google Scholar