Web Data Extraction with Hierarchical Clustering and Rich Features

Article Preview

Abstract:

A novel approach is proposed to automatically extract data records from detail pages using hierarchical clustering techniques. The approach uses the information of the listing pages to identify the content blocks in detail pages, which narrows the scopes of Web data extraction. Meanwhile, it also makes full use of the structure and content features to cluster content feature vectors. Finally, it aligns data elements of multiple details pages to extract the data records. Experiment results on test beds of real web pages show that the approach can achieve high extraction accuracy and outperforms the existing techniques substantially.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

1003-1008

Citation:

Online since:

May 2011

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2011 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] M. K. Bergman, The Deep Web: Surfacing Hidden Value(2001).

Google Scholar

[2] V. Crescenzi, G. Mecca, and P. Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites , in Proceedings of the 27th International Conference on Very Large Data Bases, San Francisco, CA, USA(2001).

DOI: 10.1145/564691.564778

Google Scholar

[3] A. Arasu and H. Garcia-Molina, Extracting structured data from Web pages, in Proceedings of the 2003 ACM SIGMOD international conference on Management of data, New York, NY, USA(2003).

DOI: 10.1145/872757.872799

Google Scholar

[4] B. Liu, R. Grossman, and Y. Zhai, Mining data records in Web pages, in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA(2003).

DOI: 10.1145/956750.956826

Google Scholar

[5] K. Simon and G. Lausen, ViPER: augmenting automatic information extraction with visual perceptions, in Proceedings of the 14th ACM international conference on Information and knowledge management, New York, NY, USA(2005).

DOI: 10.1145/1099554.1099672

Google Scholar

[6] L. Yi, B. Liu, and X. Li, Eliminating noisy information in Web pages for data mining, in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA(2003).

DOI: 10.1145/956750.956785

Google Scholar

[7] G. Salton, A. Wong, and C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, vol. 18(1975), pp.613-620.

DOI: 10.1145/361219.361220

Google Scholar

[8] W. Cohen, P. Ravikumar, and S. Fienberg, A comparison of string distance metrics for name-matching tasks, in Proceedings of 2th international workshop on Information Integration on the Web, Acapulco, Mexico(2003).

Google Scholar