Web Data Extraction with Hierarchical Clustering and Rich Features

Yong Quan Dong; Xiang Jun Zhao; Gong Jie Zhang

doi:10.4028/www.scientific.net/AMM.55-57.1003

Paper Titles

Adiabatic Shear Localization in High Speed Cutting of Hardened Steel
p.983

Appliance on Evaluation of Type Selection for Agricultural Machinery Based on Grey Relational Analysis Method
p.988

Simplified Output Feedback Stabilization for Interconnected Systems Based on Dynamic Surface Control
p.992

Vibration Characteristics Simulation and Experimental Study of Super-Heavy Vibrating Shaker
p.998

Web Data Extraction with Hierarchical Clustering and Rich Features
p.1003

Planar Four-Linkage Guide Mechanism Synthesis Based on Mechanical Chaos System Methods
p.1009

Chaos Nested Intervals Method to Solve Nonlinear Equations and Planar Crank-Slide Mechanism Synthesis
p.1013

A CRF Based Model for Learning High Level Behaviors of the Elders in Household Environment
p.1017

Microstructure and Mechanical Properties of Al-7%Si Matrix Composites Reinforced by Al₆₃Cu₂₅Fe₁₂ Icosahedral Quasicrystal Particles
p.1022

HomeApplied Mechanics and MaterialsApplied Mechanics and Materials Vols. 55-57Web Data Extraction with Hierarchical Clustering...

Web Data Extraction with Hierarchical Clustering and Rich Features

Abstract:

A novel approach is proposed to automatically extract data records from detail pages using hierarchical clustering techniques. The approach uses the information of the listing pages to identify the content blocks in detail pages, which narrows the scopes of Web data extraction. Meanwhile, it also makes full use of the structure and content features to cluster content feature vectors. Finally, it aligns data elements of multiple details pages to extract the data records. Experiment results on test beds of real web pages show that the approach can achieve high extraction accuracy and outperforms the existing techniques substantially.

You might also be interested in these eBooks

Recent Trends in Materials and Mechanical Engineering Materials, Mechatronics and Automation

View Preview

Info:

Periodical:

Applied Mechanics and Materials (Volumes 55-57)

Pages:

1003-1008

DOI:

https://doi.org/10.4028/www.scientific.net/AMM.55-57.1003

Citation:

Cite this paper

Online since:

May 2011

Authors:

Yong Quan Dong, Xiang Jun Zhao, Gong Jie Zhang

Keywords:

Data Extraction, Deep Web, Feature, Hierarchical Clustering

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

References

[1] M. K. Bergman, The Deep Web: Surfacing Hidden Value(2001).

Google Scholar

[2] V. Crescenzi, G. Mecca, and P. Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites , in Proceedings of the 27th International Conference on Very Large Data Bases, San Francisco, CA, USA(2001).

DOI: 10.1145/564691.564778

Google Scholar

[3] A. Arasu and H. Garcia-Molina, Extracting structured data from Web pages, in Proceedings of the 2003 ACM SIGMOD international conference on Management of data, New York, NY, USA(2003).

DOI: 10.1145/872757.872799

Google Scholar

[4] B. Liu, R. Grossman, and Y. Zhai, Mining data records in Web pages, in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA(2003).

DOI: 10.1145/956750.956826

Google Scholar

[5] K. Simon and G. Lausen, ViPER: augmenting automatic information extraction with visual perceptions, in Proceedings of the 14th ACM international conference on Information and knowledge management, New York, NY, USA(2005).

DOI: 10.1145/1099554.1099672

Google Scholar

[6] L. Yi, B. Liu, and X. Li, Eliminating noisy information in Web pages for data mining, in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA(2003).

DOI: 10.1145/956750.956785

Google Scholar

[7] G. Salton, A. Wong, and C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, vol. 18(1975), pp.613-620.

DOI: 10.1145/361219.361220

Google Scholar

[8] W. Cohen, P. Ravikumar, and S. Fienberg, A comparison of string distance metrics for name-matching tasks, in Proceedings of 2th international workshop on Information Integration on the Web, Acapulco, Mexico(2003).

Google Scholar