Web Data Extraction Based on Tag Path Clustering

Gui Li; Cheng  Chen; Zheng Yu  Li; Zi Yang Han; Ping  Sun

doi:10.4028/www.scientific.net/AMR.756-759.1590

Paper Titles

Research on 2-D Integer SDCT Algorithm
p.1571

A Look into Content-Based Image Search Engine System
p.1576

Research of E-Government Security Plan System
p.1580

Web Data Extraction and Integration in Domain
p.1585

Web Data Extraction Based on Tag Path Clustering
p.1590

Research on Consistency Verification of Semantic Business Process Model Based on SWRL
p.1595

Comparative Study on Application of Information Visualization in Knowledge Organization
p.1600

Design on the Network Model of Cloud Computing Based on P2P
p.1605

Research of BPMNO-Based Business Process Model in Logistics Warehousing Field
p.1610

HomeAdvanced Materials ResearchAdvanced Materials Research Vols. 756-759Web Data Extraction Based on Tag Path Clustering

Web Data Extraction Based on Tag Path Clustering

Abstract:

Fully automatic methods that extract structured data from the Web have been studied extensively. The existing methods suffice for simple extraction, but they often fail to handle more complicated Web pages. This paper introduces a method based on tag path clustering to extract structured data. The method gets complete tag path collection by parsing the DOM tree of the Web document. Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted, then taking advantage of features of tag position, we can separate and filter record, finally complete data extraction. Experiments show this method achieves higher accuracy than previous methods.

You might also be interested in these eBooks

View Preview

Info:

Periodical:

Advanced Materials Research (Volumes 756-759)

Pages:

1590-1594

DOI:

https://doi.org/10.4028/www.scientific.net/AMR.756-759.1590

Citation:

Cite this paper

Online since:

September 2013

Authors:

Gui Li, Cheng Chen, Zheng Yu Li, Zi Yang Han, Ping Sun

Keywords:

Clustering, Extracting Structured Data, Tag Path

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

References

[1] Xiaodong Li, Yuqing Gu, DOM-based information extraction for the web soures, Chinese Journal of Computers, Vol. 30 No. 5 May (2002).

Google Scholar

[2] G. Miao, J. Tatemura, Wang-Pin Hsiung, A. Sawires and Louise E. Moser Extracting data records from the Web using tag path clustering, WWW 2009, Madrid. ACM.

DOI: 10.1145/1526709.1526841

Google Scholar

[3] Bing Liu. Web data mining : exploring hyperlinks, contents, and usage data[M]．. Memphis: Henry Dream press, pp.291-295, (2007).

Google Scholar

[4] Jsoup: Java Html Parser, http: /jsoup. org/apidocs.

Google Scholar

[5] A. Arasu, H. Garcia-Molina. Extracting structured data from Web pages. In Proc of ACM SIGMOD International Conference on the Management of Data, pp.337-348, (2003).

DOI: 10.1145/872757.872799

Google Scholar