An Adaptive Web Information Extraction Approach Based on STU-DOM Tree

Article Preview

Abstract:

An adaptive web information extraction approach is presented in this paper. Most of the traditional web information extraction approaches depend on the templates of web sites. If the templates are changed, the information extraction rules should be redesigned. To reduce the maintenance costs and improve the adaptability of information extractors, an adaptive web information extraction approach is proposed based on the STU-DOM tree. The webpage is parsed into DOM Trees based on HTML Parser. Then DOM trees are filtered into STU-DOM trees to confirm blocks which contain keywords of a certain topic. The proposed approach is applied to webpages and the results show that the approach not only extracts information efficiently, but also is irrelevant to site structures.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

1972-1978

Citation:

Online since:

September 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] H. R. Zhang, C. Cui, Web Information Extraction Technology Research Based on Ajax [C]. International Conference on Business Computing and Global Informatization, (2011).

DOI: 10.1109/bcgin.2011.60

Google Scholar

[2] Y. F. Gong, Q. Liu, Automatic web Page Segmentation and Information Extraction Using Conditional Random Fields [C]. Proceedings of the 2012 IEEE 16th International Conference on Computer Supported Cooperative Work in Design (2012).

DOI: 10.1109/cscwd.2012.6221840

Google Scholar

[3] H. Ji, H. B. Deng, J. W. Han, Uncertainty Reduction for Knowledge Discovery and Information Extraction on the World Wide Web [J]. Proceedings of the IEEE 100(9) (2012).

DOI: 10.1109/jproc.2012.2190489

Google Scholar

[4] T. L. Wong, W. Lam, Adapting web Information Extraction Knowledge via Mining Site-Invariant and Site-Dependent Features [J]. ACM Transactions on Internet Technology 7(1) (2007).

DOI: 10.1145/1189740.1189746

Google Scholar

[5] P. Yang, Q. L. Zheng, H. Peng, A Stepwise Learning Approach to Automatic Discover Interest Data Block [C]. The third International Conference on Machine Learning and Cyber2netics (ICMLC) (2004).

Google Scholar

[6] D. S. Jian, Q. L. Zheng, H. Peng, Web-based of keywords clustering and node distance information extraction [J] Computer Science 34 (2007).

Google Scholar

[7] F. Zhao, The Algorithm Analyses and Design about the subjective test online Basing on The DOM Tree [C], International Conference on Computer Science and Software Engineering, (2008).

DOI: 10.1109/csse.2008.57

Google Scholar