Web Data Extraction with Hierarchical Clustering and Rich Features

Abstract:

Article Preview

A novel approach is proposed to automatically extract data records from detail pages using hierarchical clustering techniques. The approach uses the information of the listing pages to identify the content blocks in detail pages, which narrows the scopes of Web data extraction. Meanwhile, it also makes full use of the structure and content features to cluster content feature vectors. Finally, it aligns data elements of multiple details pages to extract the data records. Experiment results on test beds of real web pages show that the approach can achieve high extraction accuracy and outperforms the existing techniques substantially.

Info:

Periodical:

Edited by:

Qi Luo

Pages:

1003-1008

DOI:

10.4028/www.scientific.net/AMM.55-57.1003

Citation:

Y. Q. Dong et al., "Web Data Extraction with Hierarchical Clustering and Rich Features", Applied Mechanics and Materials, Vols. 55-57, pp. 1003-1008, 2011

Online since:

May 2011

Export:

Price:

$35.00

In order to see related information, you need to Login.

In order to see related information, you need to Login.