p.2495
p.2500
p.2504
p.2509
p.2513
p.2522
p.2526
p.2530
p.2535
A Method of Web Information Extraction Based on Building Different Sub Trees
Abstract:
When extracting Web information, most researchers mixed the structure labels of DOM Tree with the text content. For solving this problem, we put forward a method of Web Information automatic extraction. Firstly, we get the set of DOM sub trees by partitioning the DOM Tree of the Web Page. Secondly, the nodes of all DOM sub trees are set the corresponding weights by the method this paper proposes. Based on this method, we get each set of different sub trees by comparing with the DOM sub trees which come from two the same data source and belongs to the same category. Thirdly, we get the data zone which contains the extracted information by computing the similarity of every two DOM sub trees in the set of different sub trees. Finally, the node path of every DOM sub tree in the data zone will be taken as the extraction rules which will be used to automatically extract the information from the new Web page of the same category. The experiment demonstrates that there are higher precision rate and recall rate. Meanwhile this method can save the time which the users spend on filtering the information.
Info:
Periodical:
Pages:
2513-2521
Citation:
Online since:
May 2013
Authors:
Price:
Сopyright:
© 2013 Trans Tech Publications Ltd. All Rights Reserved
Share:
Citation: