Correlation Based Method to Detect and Remove Redundant Web Document
The enrichment of internet has resulted in the flooding of abundant information on WWW with more replicas. As the duplicated web pages increase the indexing space and time complexity, finding and removing these pages becomes significant for search engines and other likely system which will improve on accuracy of search results as well as search speed. Web content mining plays a vital role in resolving these aspects. Existing algorithm for web content mining focus attention on applying weightage to structured documents whereas in this research work, a mathematical approach based on linear correlation is developed to detect and remove the duplicates present in both structured and unstructured web document. In the proposed work, linear correlation between two web documents is found out. If the correlated value is 1 then the documents are said to be exactly redundant and it should be eliminated otherwise not redundant.
Zhihua Xu, Gang Shen and Sally Lin
G. Poonkuzhali et al., "Correlation Based Method to Detect and Remove Redundant Web Document", Advanced Materials Research, Vols. 171-172, pp. 543-546, 2011