Correlation Based Method to Detect and Remove Redundant Web Document

Abstract:

Article Preview

The enrichment of internet has resulted in the flooding of abundant information on WWW with more replicas. As the duplicated web pages increase the indexing space and time complexity, finding and removing these pages becomes significant for search engines and other likely system which will improve on accuracy of search results as well as search speed. Web content mining plays a vital role in resolving these aspects. Existing algorithm for web content mining focus attention on applying weightage to structured documents whereas in this research work, a mathematical approach based on linear correlation is developed to detect and remove the duplicates present in both structured and unstructured web document. In the proposed work, linear correlation between two web documents is found out. If the correlated value is 1 then the documents are said to be exactly redundant and it should be eliminated otherwise not redundant.

Info:

Periodical:

Advanced Materials Research (Volumes 171-172)

Edited by:

Zhihua Xu, Gang Shen and Sally Lin

Pages:

543-546

DOI:

10.4028/www.scientific.net/AMR.171-172.543

Citation:

G. Poonkuzhali et al., "Correlation Based Method to Detect and Remove Redundant Web Document", Advanced Materials Research, Vols. 171-172, pp. 543-546, 2011

Online since:

December 2010

Export:

Price:

$35.00

In order to see related information, you need to Login.

In order to see related information, you need to Login.