Research and Implementation of Topic Crawler Based on Hadoop

Rui Gu; Jian Feng Jiang

doi:10.4028/www.scientific.net/AMM.651-653.1896

Paper Titles

One-to-One Disjoint Path Covers on WK-Networks
p.1875

Optimization of the Virtual Potential Field Based on Coverage-Enhancing Algorithm for Directional Sensor Networks
p.1882

Real-Time Embedded Software Architecture Modeling and Reliability Evaluation Based on Time-Extended Petri Net
p.1888

Research and Design of Enterprise-Class Network Data Center Based on Cloud Computing
p.1893

Research and Implementation of Topic Crawler Based on Hadoop
p.1896

Research of Android Platform Application Based on HTML5
p.1901

Research on Data Packets Clustering Algorithm in the Wireless Multiple Hop Network
p.1905

Research on HWSN Node Covering Algorithm
p.1909

Research on Information Security Issues Facing the Era of Big Data
p.1913

HomeApplied Mechanics and MaterialsApplied Mechanics and Materials Vols. 651-653Research and Implementation of Topic Crawler Based...

Research and Implementation of Topic Crawler Based on Hadoop

Abstract:

This article proposes distributed topic-focused crawlers on the basis of HDFS and MapReduce, which are the two core technologies of Hadoop. The crawler has the advantages of distributed processing ability, scalability, high reliability. This article analyses the dependency of subject by using the method of conceptual analysis and uses MapReduce to realize webpage crawling and updating webpage. At last using experiments to verify the performance, expansibility and reliability of the system

You might also be interested in these eBooks

Material Science, Civil Engineering and Architecture Science, Mechanical Engineering and Manufacturing Technology II

View Preview

Info:

Periodical:

Applied Mechanics and Materials (Volumes 651-653)

Pages:

1896-1900

DOI:

https://doi.org/10.4028/www.scientific.net/AMM.651-653.1896

Citation:

Cite this paper

Online since:

September 2014

Authors:

Rui Gu*, Jian Feng Jiang

Keywords:

Hadoop, MapReduce, Topic-Focused Crawler, Webpage Update Strategy

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

* - Corresponding Author

References

[1] Sergey Brin, Lawrence Page. The Anatomy of a Large-Scale Hyper textual Web Search Engine[J]. Computer Networks and ISDN System, 1998, 30(1-7): 107-17.

DOI: 10.1016/s0169-7552(98)00110-x

Google Scholar

[2] Dean J, Ghemawat S. MapReduce: Simplied Data Proc. On Large Clusters[J]. OSD I 2004, San Francisco, 2004： 137-1501.

Google Scholar

[3] WHITE T. Hadoop the definitive guide[M]. O'Reilly Media, Inc, (2009).

Google Scholar

[4] M. Diligenti,F. M. Coetzee,S. Lawrence, et al. Focused crawling using context graphs. Proceedings of the 26th International Conference on Very Large Databases . (2000).

Google Scholar

[5] Peat H Jupiter W. The limitations of term co-occurrence data for query expansion in document retrieval systems. Journal of the American Society for Information Science. (1991).

DOI: 10.1002/(sici)1097-4571(199106)42:5<378::aid-asi8>3.0.co;2-8

Google Scholar

[6] Liu, Wenjun, Du, Yajun. An improved topic-specific crawling approach based on semantic similarity vector space model. Journal of Computational Information Systems. (2012).

Google Scholar