p.1571
p.1576
p.1580
p.1585
p.1590
p.1595
p.1600
p.1605
p.1610
Web Data Extraction Based on Tag Path Clustering
Abstract:
Fully automatic methods that extract structured data from the Web have been studied extensively. The existing methods suffice for simple extraction, but they often fail to handle more complicated Web pages. This paper introduces a method based on tag path clustering to extract structured data. The method gets complete tag path collection by parsing the DOM tree of the Web document. Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted, then taking advantage of features of tag position, we can separate and filter record, finally complete data extraction. Experiments show this method achieves higher accuracy than previous methods.
Info:
Periodical:
Pages:
1590-1594
Citation:
Online since:
September 2013
Authors:
Keywords:
Price:
Сopyright:
© 2013 Trans Tech Publications Ltd. All Rights Reserved
Share:
Citation: