Research and Implementation of Topic Crawler Based on Hadoop

Article Preview

Abstract:

This article proposes distributed topic-focused crawlers on the basis of HDFS and MapReduce, which are the two core technologies of Hadoop. The crawler has the advantages of distributed processing ability, scalability, high reliability. This article analyses the dependency of subject by using the method of conceptual analysis and uses MapReduce to realize webpage crawling and updating webpage. At last using experiments to verify the performance, expansibility and reliability of the system

You might also be interested in these eBooks

Info:

Periodical:

Pages:

1896-1900

Citation:

Online since:

September 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] Sergey Brin, Lawrence Page. The Anatomy of a Large-Scale Hyper textual Web Search Engine[J]. Computer Networks and ISDN System, 1998, 30(1-7): 107-17.

DOI: 10.1016/s0169-7552(98)00110-x

Google Scholar

[2] Dean J, Ghemawat S. MapReduce: Simplied Data Proc. On Large Clusters[J]. OSD I 2004, San Francisco, 2004: 137-1501.

Google Scholar

[3] WHITE T. Hadoop the definitive guide[M]. O'Reilly Media, Inc, (2009).

Google Scholar

[4] M. Diligenti,F. M. Coetzee,S. Lawrence, et al. Focused crawling using context graphs. Proceedings of the 26th International Conference on Very Large Databases . (2000).

Google Scholar

[5] Peat H Jupiter W. The limitations of term co-occurrence data for query expansion in document retrieval systems. Journal of the American Society for Information Science. (1991).

DOI: 10.1002/(sici)1097-4571(199106)42:5<378::aid-asi8>3.0.co;2-8

Google Scholar

[6] Liu, Wenjun, Du, Yajun. An improved topic-specific crawling approach based on semantic similarity vector space model. Journal of Computational Information Systems. (2012).

Google Scholar