An Improved Crawler Algorithm Based on Hierarchical Structure Preservation

Article Preview

Abstract:

This paper proposes an improved web crawler algorithm to climb more useful information since the basic web crawler algorithm is low-efficiency and easy to climb useless repeated information. By the proposed algorithm, the website urls are hierarchical saved to store websites overall topology, which will make crisscross complex web URL system from a graphic structure into a tree structure. The actual website BBS experiments show that the algorithm is much better than the basic web crawler algorithm in crawling speed and download information such as the usefulness of baking. Furthermore, it provides a performing structure mode for the increment crawler algorithm.

You might also be interested in these eBooks

Info:

Periodical:

Key Engineering Materials (Volumes 474-476)

Pages:

2120-2124

Citation:

Online since:

April 2011

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2011 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez, Crawling a Country: better strategies than breadth-first for Web page ordering, in Proceedings of the 14th WWW, pp.864-872, Chiba, Japan, May 10-14, (2005).

DOI: 10.1145/1062745.1062768

Google Scholar

[2] D. Ahlers and S. Boll, Adaptive geospatially focused crawling, in Proceedings of the 18th Conference on Information and Knowledge Management, (2009).

DOI: 10.1145/1645953.1646011

Google Scholar

[3] Tao Peng, Yu Meng, Wan-Li Zuo, Yin Wang, Liang Hu, Tunneling techniques for Focused Web Crawling, Journal of Computer Research and Development, vol. 4, p.628−637, (2010).

Google Scholar

[4] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy, Google's deep-web crawl, in Proceedings of the 34th International Conference on Very Large Data Bases, (2008).

DOI: 10.14778/1454159.1454163

Google Scholar

[5] A. Agarwal, H. S. Koppula, K. P. Leela, K. P. Chitrapura, S. Garg, P. K. GM, C. Haty, A. Roy, and A. Sasturkar, URL normalization for de-duplication of web pages, in Proceedings of the 18th Conferenceon Information and Knowledge Management, (2009).

DOI: 10.1145/1645953.1646283

Google Scholar

[6] Tao Meng, Ji-Min Wang, Hongfei Yan, Web Evolution and Incremental Crawling, Journal of Software, vol17, no 5, p.1051−1067, (2006).

Google Scholar

[7] Y. Guo, K. Li, K. Zhang, and G. Zhang, Board forum crawling: a Web crawling method for Web forum, in Proceedings of the 2006 IEEE/WIC/ACM Int. Conf. Web Intelligence, pages 745−748, Hong Kong, Dec. (2006).

DOI: 10.1109/wi.2006.52

Google Scholar

[8] Y. Wang, J. -M. Yang, W. Lai, R. Cai, L. Zhang, and W. -Y. Ma, Exploring traversal strategy for Web forum crawling, in Proceedings of the 31st SIGIR, pages 459-466. Singapore, July (2008).

DOI: 10.1145/1390334.1390413

Google Scholar

[9] Cai Rui, Yang Jiangming, Lai Wei, et al, iRobot: An Intelligent Crawler for Web Forums, in Proceedings of the 17th International World Wide Web Conference. Beijing, China: [s. n. ], (2008).

DOI: 10.1145/1367497.1367558

Google Scholar