A Method of Web Information Extraction Based on Building Different Sub Trees

Yuan Long Wang; Hong Jiang; Zhao Hong Bing; Li Zhang

doi:10.4028/www.scientific.net/AMR.694-697.2513

Paper Titles

Understanding the Effect of Channel Interference and RSSI in IEEE 802.11 PCF
p.2495

A Design of Supervisory Control System of Vehicles Based on Java NIO
p.2500

Application of LabVIEW-Based Word Report Toolkit in Testing Engineering
p.2504

Seamless Integration Being Set up by Using Web Services
p.2509

A Method of Web Information Extraction Based on Building Different Sub Trees
p.2513

A Novel BCI Classifier Based on Autoregressive Model and Support Vector Machine
p.2522

An Improved EEG Feature Extraction Method Based on Quantum Particle Swarm Optimizer Algorithm
p.2526

Chance Information Analysis of a Residential Building Development Case
p.2530

High Speed Federated Filter Design and Implementation
p.2535

HomeAdvanced Materials ResearchAdvanced Materials Research Vols. 694-697A Method of Web Information Extraction Based on...

A Method of Web Information Extraction Based on Building Different Sub Trees

Abstract:

When extracting Web information, most researchers mixed the structure labels of DOM Tree with the text content. For solving this problem, we put forward a method of Web Information automatic extraction. Firstly, we get the set of DOM sub trees by partitioning the DOM Tree of the Web Page. Secondly, the nodes of all DOM sub trees are set the corresponding weights by the method this paper proposes. Based on this method, we get each set of different sub trees by comparing with the DOM sub trees which come from two the same data source and belongs to the same category. Thirdly, we get the data zone which contains the extracted information by computing the similarity of every two DOM sub trees in the set of different sub trees. Finally, the node path of every DOM sub tree in the data zone will be taken as the extraction rules which will be used to automatically extract the information from the new Web page of the same category. The experiment demonstrates that there are higher precision rate and recall rate. Meanwhile this method can save the time which the users spend on filtering the information.

You might also be interested in these eBooks

View Preview

Info:

Periodical:

Advanced Materials Research (Volumes 694-697)

Pages:

2513-2521

DOI:

https://doi.org/10.4028/www.scientific.net/AMR.694-697.2513

Citation:

Cite this paper

Online since:

May 2013

Authors:

Yuan Long Wang, Hong Jiang, Zhao Hong Bing, Li Zhang

Keywords:

Different Sub Tree, DOM Tree, Extraction Rule, Information Extraction, Similarity

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

References

[1] Xiangwen Ji, Jianping Zeng, Shiyong Zhang, Chengong Wu. Tag tree template for Web information and schema extraction [J]. Expert Systems with Applications, (37), (2010), pp.8492-8498.

DOI: 10.1016/j.eswa.2010.05.027

Google Scholar

[2] Calife M, Mooney R. Relational learning of pattern match rules for information extraction [C] //Proc of the 16th National Conf on Artificial Intelligence and 11th Conf on innovative Applications of Artificial Intelligence. M enlo Park, CA: AAAI,1999, pp.328-334.

Google Scholar

[3] M uslea I, Minton S, Knoblock G. A hierar chical approach to wrapper in duction [C] //Proc of the 3rd Conf on Autonomous Agents. New York: ACM,(1999), pp.190-197.

Google Scholar

[4] Wei Liu, Xiaofeng Meng, Weiyi Meng. Vision-based Web data records extraction [C] //Proc of the 9th SIGM OD Int Workshop on Web and Database. New York: ACM, (2006), pp.20-25.

Google Scholar

[5] Giuseppe Della Penna, Daniele Magazzeni, Sergio Orefice. Visual extraction of information from web pages [J]. Journal of Visual Languages and Computing, (21), (2010), pp.23-32.

DOI: 10.1016/j.jvlc.2009.06.001

Google Scholar

[6] Rui-xue Zhang, Ming-qiu Song, Yan-lei Gong. Parsing DOM Tree Reversely and Extracting Web Main Page Information [J]. Computer Science, 38(4), (2011), pp.213-216.

Google Scholar

[7] Seung Min Kim, Suk I. Yoo. DOM tree browsing of a very large XML document: Design and implementation [J], The Journal of Systems and Software, (82),(2009), pp.1843-1858.

DOI: 10.1016/j.jss.2009.05.043

Google Scholar

[8] Shao-hua Yang, Hai-lue Lin, Yan-bo Han. Automatic Data Extraction from Template-Generated Web Pages [J], Journal of Software, 19(2),(2008), pp.209-223. Li Zhang, Meng Li, Nannan Dong, Yuanlong

DOI: 10.3724/sp.j.1001.2008.00209

Google Scholar

[9] Wang. An improved DOM-based algorithm for Web information extraction [J]. Journal of Information &Computational Science, 8 (7),(2011), pp.1113-1121.

Google Scholar

[10] Wang Qiang, Ji-cheng Wang, Gang-shan Wu, et, al. A HTML Parser for Web Cleaning Application Research of Computer, 19(02),(2002), pp.54-57.

Google Scholar