An Improved Crawler Algorithm Based on Hierarchical Structure Preservation
This paper proposes an improved web crawler algorithm to climb more useful information since the basic web crawler algorithm is low-efficiency and easy to climb useless repeated information. By the proposed algorithm, the website urls are hierarchical saved to store websites overall topology, which will make crisscross complex web URL system from a graphic structure into a tree structure. The actual website BBS experiments show that the algorithm is much better than the basic web crawler algorithm in crawling speed and download information such as the usefulness of baking. Furthermore, it provides a performing structure mode for the increment crawler algorithm.
Z. F. Hao et al., "An Improved Crawler Algorithm Based on Hierarchical Structure Preservation", Key Engineering Materials, Vols. 474-476, pp. 2120-2124, 2011