Web Object Mining Using Entropy Increasing Rate

Article Preview

Abstract:

In this paper, we proposed a new method of web objects extraction based on entropy theory, which takes both tag structure and content pattern into consideration for object detection. Firstly, it calculates content entropy of each node in HTML tag tree. Then, it uses entropy increasing rate to capture characteristics of object region and identify the minimal sub-tree that contains objects. Finally, a set of heuristics is employed for more accurate extraction. Experimental evaluation shows it can enhance the overall effectiveness of object mining.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 403-408)

Pages:

2602-2606

Citation:

Online since:

November 2011

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2012 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Chang, C-H. et al. . IEPAD: Information extraction based on pattern discovery. WWW-10, (2001).

Google Scholar

[2] Embley, D., Jiang et al. Record-boundary discovery in Web documents. SIGMOD-99, (1999).

Google Scholar

[3] Chang, C-H., Lui et al.: Information extraction based on pattern discovery. WWW-10, (2001).

Google Scholar

[4] C.E. Shannon, A Mathematical Theory of Communication, Bell Syst. Techn. J., Vol. 27, pp.379-423, (Part I), (1948).

Google Scholar

[5] Cohen, W., Hurst, M., and Jensen, L. . A flexible learning system for wrapping tables and lists in HTML documents. WWW-2002, (2002).

DOI: 10.1145/511446.511477

Google Scholar

[6] Embley, D., Jiang et al. Record-boundary discovery in Web documents. SIGMOD-99, (1999).

Google Scholar

[7] Gusfield, D. Algorithms on strings, tree, and sequence. (1997).

Google Scholar

[8] Liu, B., Grossman, R. and Zhai, Y. Mining data records from Web pages., KDD-03, (2003).

Google Scholar

[9] Kushmerick, N. . Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118: 15-68, (2000).

DOI: 10.1016/s0004-3702(99)00100-9

Google Scholar

[10] Buttler, D., Liu, L., Pu, C. A fully automated extraction system for the World Wide Web. IEEE ICDCS-21, (2001).

Google Scholar