Automatic Batch Extraction of Specific Content of HTML Based on Tag Locations

Article Preview

Abstract:

HTML is utilized widely in web information description and exhibition. Although new technologies continue appearing during the HTML history, the basic structure and principal of HTML remains the same and HTML is still an important part for tasks such as web development and even dynamic page exhibition. We currently have mainly two types of parsers for HTML, SAX and DOM. The problem is that, the former is driven by parsing events but can only access the nodes sequentially with a slow speed, and the latter should load the whole document into memory and will consume a lot of space. In order to solve such problem, we proposed an automatic batch extraction method for specific content of HTML based on tag locations. The extraction process can be divided to two main steps, the first step is locating the start and end positions of HTML tags, the second step is finding the desired content based on the location of tags and corresponding attribute information. The first step is the core of the whole process. An example of extraction of specific content of a search result page verifies the proposed algorithm. The proposed algorithm can be further used for advanced tasks such as data mining and knowledge base establishment.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

3826-3830

Citation:

Online since:

August 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] https: /developer. mozilla. org/en-US/docs/Web/JavaScript.

Google Scholar

[2] http: /docs. oracle. com/javase/tutorial/deployment/applet.

Google Scholar

[3] http: /en. wikipedia. org/wiki/ActiveX.

Google Scholar

[4] http: /www. oracle. com/technetwork/java/javaee/jsp/index. html.

Google Scholar

[5] http: /en. wikipedia. org/wiki/Ajax_(programming).

Google Scholar

[6] http: /www. w3. org/html.

Google Scholar

[7] http: /www. w3. org/TR/html5.

Google Scholar

[8] http: /en. wikipedia. org/wiki/Standard_Generalized_Markup_Language.

Google Scholar

[9] http: /en. wikipedia. org/wiki/Simple_API_for_XML.

Google Scholar

[10] http: /www. w3. org/DOM.

Google Scholar

[11] http: /jsoup. org.

Google Scholar