An Improved Shark Search Algorithm Based on Domain Ontology

Article Preview

Abstract:

In recent years, the prevailing topic crawler algorithms are concentrated on the contents of topical words. These existing approaches neglect the sematic relationship among textual concepts, which lead to low correlation between crawled webpages. To address the issue, this paper presents a deep analysis of Shark Search algorithm, and makes an optimization in terms of incorporating the characteristics associated with semi-structured webpages. Furthermore, we enhance the performance of vector space model utilized in Shark Search algorithm by virtue of domain ontology, and propose a standardized method based on the vector space of ontology model to improve the evaluation metric of TF-IDF. The experimental results demonstrate the effectiveness of our algorithm that outperforms the state-of-the-art significantly in precision and recall.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

2252-2257

Citation:

Online since:

September 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] D. Zhou and Z. Li. Survey of High performance Web Crawler, Computer Science, Vol. 36 (2009), pp.26-29.

Google Scholar

[2] C. Olston and S. Pandey. Recrawl Scheduling Based on Information Longevity, Proceedings of t he 17t h International World Wide Web Conference, ACM Press (2008).

DOI: 10.1145/1367497.1367557

Google Scholar

[3] J. Chen and Z. Chen. Improved Shark-Search algorithm based on page segmentation, Journal of Shandong University( Natural Science), Vol. 42 (2007), pp.62-66.

Google Scholar

[4] Y. Gong. Research on the Crawler of Search Engine, Wuhan University of Technology (2010).

Google Scholar

[5] B. Lin and Z. Yin. Research on vertical search engine based on domain ontology, Railway Computer Application, Vol. 19 (2010), p.1952-(1960).

Google Scholar

[6] H. Zhang, W. Liu, and Q. Xiong. A Web Crawler Model Based on Semantic Ontology, Computer Application and Software, Vol. 11 (2009), pp.45-48.

Google Scholar

[7] H. T. Lee and D. Leonard. IRLbot : Scaling to 6 Billion Pages and Beyond, Proceedings of t he 17t h International World Wide Web Conference, ACM Press (2008).

DOI: 10.1145/1367497.1367556

Google Scholar