Study and Application of Web Crawler Algorithm Based on Heritrix

Abstract:

Article Preview

In this paper, the web crawler in search engine was introduced firstly, based on the detailed analysis of the system architecture about open source web crawler Heritrix, proposed design of a particular parser, parsed the particular web site to achieve the purpose of particular crawl. Then by eliminating the impact on individual processors caused by robots.txt file, and introduced the ELFHash algorithm implements the purpose of efficient, multi-thread access to the web crawler resources. Finally, by the comparison of the speed of crawl web page between before-improved and after-improved, and the analysis of the number of crawled pages in the same long time, verify the performance of the after-improved web crawler has been more obvious increased.

Info:

Periodical:

Advanced Materials Research (Volumes 219-220)

Edited by:

Helen Zhang, Gang Shen and David Jin

Pages:

1069-1072

DOI:

10.4028/www.scientific.net/AMR.219-220.1069

Citation:

D. F. Liu and X. S. Fan, "Study and Application of Web Crawler Algorithm Based on Heritrix", Advanced Materials Research, Vols. 219-220, pp. 1069-1072, 2011

Online since:

March 2011

Export:

Price:

$35.00

In order to see related information, you need to Login.

In order to see related information, you need to Login.