Study and Application of Web Crawler Algorithm Based on Heritrix
In this paper, the web crawler in search engine was introduced firstly, based on the detailed analysis of the system architecture about open source web crawler Heritrix, proposed design of a particular parser, parsed the particular web site to achieve the purpose of particular crawl. Then by eliminating the impact on individual processors caused by robots.txt file, and introduced the ELFHash algorithm implements the purpose of efficient, multi-thread access to the web crawler resources. Finally, by the comparison of the speed of crawl web page between before-improved and after-improved, and the analysis of the number of crawled pages in the same long time, verify the performance of the after-improved web crawler has been more obvious increased.
Helen Zhang, Gang Shen and David Jin
D. F. Liu and X. S. Fan, "Study and Application of Web Crawler Algorithm Based on Heritrix", Advanced Materials Research, Vols. 219-220, pp. 1069-1072, 2011