Deep Web Data Extraction Based on Regular Expression

Article Preview

Abstract:

Data extraction is an important issue in Deep web data integration. In order to extract the query results of the Deep Web, it is firstly required to locate the target data block correctly. Due to the html source code of web pages can be parsed as well structured DOM, we proposed an effective algorithm for discerning the common path based on hierarchical DOM. Based on the common path and our predefined regular expression, the target data of the Deep Web can be extracted effectively. The experimental results on real websites show that our proposed algorithm is highly effective.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 718-720)

Pages:

2242-2247

Citation:

Online since:

July 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Liu Wei, Meng Xiaofeng, Meng Weiyi. A Survey of Deep Web Data Integration. Chinese Journal of Computers, Vol.30, No.9, (2007)

Google Scholar

[2] Chang K C. He B, Li C, Patel M, Zhang Z. Structured database on the Web: Observations and Implications. SIGMOD Record,2004,33(3):61-70

DOI: 10.1145/1031570.1031584

Google Scholar

[3] Jayant M, Jeffery S R, Cohen S, et a1. Webscale Data Integration: You Call Only Afford to Pay as You Go. Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research. Asilomar, USA, 2007: 342-350

Google Scholar

[4] Lan Yi, Bing Liu, Xiaoli Li. Eliminating noisy Information in Web Pages for Data Mining. KDD, 2003:331-335

Google Scholar

[5] Liu L, Pu C, Han W. WRAP: An XML-enable Wrapper Construction System for Web Information Resource. In Proceedings of the 16th IEEE International Conference on Data Engineering, San Diego, California, (2000)

DOI: 10.1109/icde.2000.839475

Google Scholar

[6] Mei Xue, Cheng Xueqi, Guo Yan. Fully Automatic Wrapper Generation for Web Information Extraction. Journal of Chinese Information Processing, 2008, 22 (1):22-29

Google Scholar

[7] C.H. Zhang, X.G. Wang, X.H. Gu. Web Information Exrtaction Using Ontology and Rule Expression. Computer Engineering, 2004, 30(5): 58

Google Scholar

[8] YNa, X.J. Wu, J.B. Zhu. Web Information Extraction Based on Similar Patterns, Lecture Notes in Computer Science, 2004, 3129:645-651

Google Scholar

[9] S.Soderland. Learning Information Extraction Rules for Semi-structured and Free Text, Machine Leanring, 1999, 34:1-3

Google Scholar

[10] Q.Chen, WSu, G.C. Jisuanji. Web Information Extraction Based on Web Structure Tree. Computer Engineering, 2005, 31(20):54-56

Google Scholar