Web Data Extraction with Hierarchical Clustering and Rich Features
A novel approach is proposed to automatically extract data records from detail pages using hierarchical clustering techniques. The approach uses the information of the listing pages to identify the content blocks in detail pages, which narrows the scopes of Web data extraction. Meanwhile, it also makes full use of the structure and content features to cluster content feature vectors. Finally, it aligns data elements of multiple details pages to extract the data records. Experiment results on test beds of real web pages show that the approach can achieve high extraction accuracy and outperforms the existing techniques substantially.
Y. Q. Dong et al., "Web Data Extraction with Hierarchical Clustering and Rich Features", Applied Mechanics and Materials, Vols. 55-57, pp. 1003-1008, 2011