Design of the Distributed Web Crawler

Article Preview

Abstract:

On the current scale of the Internet, the single web crawler is unable to visit the entire web in an effective time-frame. So, we develop a distributed web crawler system to deal with it. In our distribution design, we mainly consider two facets of parallel. One is the multi-thread in the internal nodes; the other is distributed parallel among the nodes. We focus on the distribution and parallel between nodes. We address two issues of the distributed web crawler which include the crawl strategy and dynamic configuration. The results of experiment show that the hash function based on the web site achieves the goal of the distributed web crawler. At the same time, we pursue the load balance of the system, we also should reduce the communication and management spending as much as possible.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 204-210)

Pages:

1454-1458

Citation:

Online since:

February 2011

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2011 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] D. R. Hardy, M. F. S. D. Wessels. Harvest User's Manual. University of Colorado, Boulder, (1996).

Google Scholar

[2] H.F. Shen, Search Engines and Their Function—Optimized Model. Science of information. Vol. 18(2000), p.7–9.

Google Scholar

[3] G. Pant, F. Menczer. Myspiders: Evolve your own intelligent web crawlers. Autonomous Agents and Multi-Agent Systems. Vol. 5(2002), p.221–229.

DOI: 10.1023/a:1014853428272

Google Scholar

[4] X.H. Yang, Distributed collecting technology of WWW information. Computer Engineering and Applications. Vol. 36(2000), p.145–146.

Google Scholar

[5] H. Song Z.Y. Song,L. Zhang, et al. Analysis and Design of URL Indexing in Distributed Information Retrieval System. Journal of Shanghai Jiaotong University Vol. 37(2003), pp.454-457.

Google Scholar