Correlation Based Method to Detect and Remove Redundant Web Document

Article Preview

Abstract:

The enrichment of internet has resulted in the flooding of abundant information on WWW with more replicas. As the duplicated web pages increase the indexing space and time complexity, finding and removing these pages becomes significant for search engines and other likely system which will improve on accuracy of search results as well as search speed. Web content mining plays a vital role in resolving these aspects. Existing algorithm for web content mining focus attention on applying weightage to structured documents whereas in this research work, a mathematical approach based on linear correlation is developed to detect and remove the duplicates present in both structured and unstructured web document. In the proposed work, linear correlation between two web documents is found out. If the correlated value is 1 then the documents are said to be exactly redundant and it should be eliminated otherwise not redundant.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 171-172)

Pages:

543-546

Citation:

Online since:

December 2010

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2011 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] G. Poonkuzhali, K. Thiagarajan, K. Sarukesi: Elimination of redundant links in web pages –Mathematical Approach, World Academy of Science, Engineering and Technology, V52, (2009) pp.562-565.

Google Scholar

[2] Giuseppe Antoio Di Lucca, Massimiliano: Anna Rita Fasolina: An Approach to identify Duplicated web pages, in proceedings of the 28th Annual International Computer Software and Applications Conference, IEEE computer Society press(2002).

DOI: 10.1109/cmpsac.2002.1045051

Google Scholar

[3] Min-yan Wang, Dong-Sheng Liux: The Research of web page De-duplication based on web pages Re-shipment Statement, First Interrnational Workshop on Database Technology and Applicationsv(2000), pp.271-274.

DOI: 10.1109/dbta.2009.64

Google Scholar

[4] Raymond Kosala: Web Mining Research: A Survey, IEEE(2000).

Google Scholar

[5] Shiguang Ju, Zheng Wang, Xia Lv: Improvement of page ranking algorithm based on timestamp and link, International Symposium on Information Processing(2008), pp.36-40.

DOI: 10.1109/isip.2008.61

Google Scholar

[6] Yunhe Weng, Lei Li, Yixin Zhong: Semantic keywords-based duplicated web pages removing, IEEE(2008).

DOI: 10.1109/nlpke.2008.4906751

Google Scholar

[7] Zhongming Han, Qian Mo, Liu, Jianzhi: Effectively and Efficiently Detect Web Page Duplication, IEEE(2009).

Google Scholar

[8] Robert Johnson in: Elementary Statistics, sixth edition, Duxbury press, Belmount California.

Google Scholar