Web Data Extraction Based on Tag Path Clustering

Article Preview

Abstract:

Fully automatic methods that extract structured data from the Web have been studied extensively. The existing methods suffice for simple extraction, but they often fail to handle more complicated Web pages. This paper introduces a method based on tag path clustering to extract structured data. The method gets complete tag path collection by parsing the DOM tree of the Web document. Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted, then taking advantage of features of tag position, we can separate and filter record, finally complete data extraction. Experiments show this method achieves higher accuracy than previous methods.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 756-759)

Pages:

1590-1594

Citation:

Online since:

September 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Xiaodong Li, Yuqing Gu, DOM-based information extraction for the web soures, Chinese Journal of Computers, Vol. 30 No. 5 May (2002).

Google Scholar

[2] G. Miao, J. Tatemura, Wang-Pin Hsiung, A. Sawires and Louise E. Moser Extracting data records from the Web using tag path clustering, WWW 2009, Madrid. ACM.

DOI: 10.1145/1526709.1526841

Google Scholar

[3] Bing Liu. Web data mining : exploring hyperlinks, contents, and usage data[M].. Memphis: Henry Dream press, pp.291-295, (2007).

Google Scholar

[4] Jsoup: Java Html Parser, http: /jsoup. org/apidocs.

Google Scholar

[5] A. Arasu, H. Garcia-Molina. Extracting structured data from Web pages. In Proc of ACM SIGMOD International Conference on the Management of Data, pp.337-348, (2003).

DOI: 10.1145/872757.872799

Google Scholar