p.983
p.988
p.992
p.998
p.1003
p.1009
p.1013
p.1017
p.1022
Web Data Extraction with Hierarchical Clustering and Rich Features
Abstract:
A novel approach is proposed to automatically extract data records from detail pages using hierarchical clustering techniques. The approach uses the information of the listing pages to identify the content blocks in detail pages, which narrows the scopes of Web data extraction. Meanwhile, it also makes full use of the structure and content features to cluster content feature vectors. Finally, it aligns data elements of multiple details pages to extract the data records. Experiment results on test beds of real web pages show that the approach can achieve high extraction accuracy and outperforms the existing techniques substantially.
Info:
Periodical:
Pages:
1003-1008
Citation:
Online since:
May 2011
Authors:
Keywords:
Price:
Сopyright:
© 2011 Trans Tech Publications Ltd. All Rights Reserved
Share:
Citation: