A Method of Web Information Extraction Based on Building Different Sub Trees

Article Preview

Abstract:

When extracting Web information, most researchers mixed the structure labels of DOM Tree with the text content. For solving this problem, we put forward a method of Web Information automatic extraction. Firstly, we get the set of DOM sub trees by partitioning the DOM Tree of the Web Page. Secondly, the nodes of all DOM sub trees are set the corresponding weights by the method this paper proposes. Based on this method, we get each set of different sub trees by comparing with the DOM sub trees which come from two the same data source and belongs to the same category. Thirdly, we get the data zone which contains the extracted information by computing the similarity of every two DOM sub trees in the set of different sub trees. Finally, the node path of every DOM sub tree in the data zone will be taken as the extraction rules which will be used to automatically extract the information from the new Web page of the same category. The experiment demonstrates that there are higher precision rate and recall rate. Meanwhile this method can save the time which the users spend on filtering the information.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 694-697)

Pages:

2513-2521

Citation:

Online since:

May 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Xiangwen Ji, Jianping Zeng, Shiyong Zhang, Chengong Wu. Tag tree template for Web information and schema extraction [J]. Expert Systems with Applications, (37), (2010), pp.8492-8498.

DOI: 10.1016/j.eswa.2010.05.027

Google Scholar

[2] Calife M, Mooney R. Relational learning of pattern match rules for information extraction [C] //Proc of the 16th National Conf on Artificial Intelligence and 11th Conf on innovative Applications of Artificial Intelligence. M enlo Park, CA: AAAI,1999, pp.328-334.

Google Scholar

[3] M uslea I, Minton S, Knoblock G. A hierar chical approach to wrapper in duction [C] //Proc of the 3rd Conf on Autonomous Agents. New York: ACM,(1999), pp.190-197.

Google Scholar

[4] Wei Liu, Xiaofeng Meng, Weiyi Meng. Vision-based Web data records extraction [C] //Proc of the 9th SIGM OD Int Workshop on Web and Database. New York: ACM, (2006), pp.20-25.

Google Scholar

[5] Giuseppe Della Penna, Daniele Magazzeni, Sergio Orefice. Visual extraction of information from web pages [J]. Journal of Visual Languages and Computing, (21), (2010), pp.23-32.

DOI: 10.1016/j.jvlc.2009.06.001

Google Scholar

[6] Rui-xue Zhang, Ming-qiu Song, Yan-lei Gong. Parsing DOM Tree Reversely and Extracting Web Main Page Information [J]. Computer Science, 38(4), (2011), pp.213-216.

Google Scholar

[7] Seung Min Kim, Suk I. Yoo. DOM tree browsing of a very large XML document: Design and implementation [J], The Journal of Systems and Software, (82),(2009), pp.1843-1858.

DOI: 10.1016/j.jss.2009.05.043

Google Scholar

[8] Shao-hua Yang, Hai-lue Lin, Yan-bo Han. Automatic Data Extraction from Template-Generated Web Pages [J], Journal of Software, 19(2),(2008), pp.209-223. Li Zhang, Meng Li, Nannan Dong, Yuanlong

DOI: 10.3724/sp.j.1001.2008.00209

Google Scholar

[9] Wang. An improved DOM-based algorithm for Web information extraction [J]. Journal of Information &Computational Science, 8 (7),(2011), pp.1113-1121.

Google Scholar

[10] Wang Qiang, Ji-cheng Wang, Gang-shan Wu, et, al. A HTML Parser for Web Cleaning Application Research of Computer, 19(02),(2002), pp.54-57.

Google Scholar