Website Information Extraction Based on DOM-Model

Article Preview

Abstract:

With the rapid development of network technology and the promotion of application, web has become the main platform of the issuing and accessing information. It is current research focus, how to obtain the information required by the user from the vast information source. This paper presents an extraction method of website information based on DOM to improve the searching efficiency, which only to preserve the theme information and to filter out the noise information that the users are not interested in.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

2889-2893

Citation:

Online since:

August 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] A. Arasu , H . Garcia-Molina, Extracting structured data from web pages, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2003: 480- 499.

DOI: 10.1145/872757.872799

Google Scholar

[2] Suhit Gupta, Gail E. Kaiser, Peter Grimm, Michael F. Chiang, Justin Starren. Automating Content Extraction of HTML Documents[J]. Kluwer Academic Publishers, 2004 : 12.

DOI: 10.1007/s11280-004-4873-3

Google Scholar

[3] CaiD eng, Yu Sh ipeng, W en J irong, M aW eiy ing. VIPS: a V ision-based Pages Segm entation Algorithm [ R ]. M icrosoft T echn ical Report MSR-TR-2003-79, Novem ber, (2003).

Google Scholar

[4] H an W, Buttler D, Pu C. W rapping W eb Data into XML [ J]. S IGMOD Record, 2001, 30 ( 3): 35-38.

Google Scholar

[5] The Open Group. TOGAF Version 9: TheOpen Group Arch itecture Frame-w ork [M ]. Ap r, (2009).

Google Scholar

[6] IBM Corporation. Bu siness Systems Plann ing- In form ation System s P lanning [M]. New York: IBM Press, (1975).

Google Scholar

[7] Dawn G. Gregg. Steven Walczak Adaptive web information extraction 2006(05).

Google Scholar

[8] Valter Crescenzi. Giansalvatore Mecca Automatic Informarion Extraction from Large Websites 2004(05).

Google Scholar