Paper Title:
Study and Application of Web Crawler Algorithm Based on Heritrix
  Abstract

In this paper, the web crawler in search engine was introduced firstly, based on the detailed analysis of the system architecture about open source web crawler Heritrix, proposed design of a particular parser, parsed the particular web site to achieve the purpose of particular crawl. Then by eliminating the impact on individual processors caused by robots.txt file, and introduced the ELFHash algorithm implements the purpose of efficient, multi-thread access to the web crawler resources. Finally, by the comparison of the speed of crawl web page between before-improved and after-improved, and the analysis of the number of crawled pages in the same long time, verify the performance of the after-improved web crawler has been more obvious increased.

  Info
Periodical
Advanced Materials Research (Volumes 219-220)
Edited by
Helen Zhang, Gang Shen and David Jin
Pages
1069-1072
DOI
10.4028/www.scientific.net/AMR.219-220.1069
Citation
D. F. Liu, X. S. Fan, "Study and Application of Web Crawler Algorithm Based on Heritrix", Advanced Materials Research, Vols. 219-220, pp. 1069-1072, 2011
Online since
March 2011
Export
Price
$32.00
Share

In order to see related information, you need to Login.

In order to see related information, you need to Login.

Authors: Xi Yang, Hai Feng Wu, Yuan Tan, Ran Qing Lin
Chapter 6: Information Technologies and Information Security
Abstract:Capture effect is very common in wireless communication systems. If the capture effect was properly used, it will enhance the systems...
1484