Correlation Based Method to Detect and Remove Redundant Web Document

G. Poonkuzhali; R. Kishore Kumar; R. Kripa Keshav; P. Sudhakar; K. Sarukesi

doi:10.4028/www.scientific.net/AMR.171-172.543

Paper Titles

To Design an Effective E-Learning System in the Civil Servants' Job-Training
p.523

Realization of Adaptive Learning Content Based on Learning Object and Domain Ontology
p.527

An Improved Error Concealment Algorithm for Wireless Video Applications
p.531

The Application of PSO in the Thunderstorms Forecast
p.536

Correlation Based Method to Detect and Remove Redundant Web Document
p.543

Climate Change of Tibetan Plateau and its Impact on Water Resources of the Source Region of Yangtze River and Yellow River in the next 30~50 Years
p.547

The Study of Multi-Agent System Based on Decentralized Control Supply Chain Planning System
p.551

New Immune Algorithm and Application in Power Filter Optimization Design
p.555

A Data Cooperative Caching Policy Using Human Mobile Patterns for Delay Tolerant Networks
p.561

HomeAdvanced Materials ResearchAdvanced Materials Research Vols. 171-172Correlation Based Method to Detect and Remove...

Correlation Based Method to Detect and Remove Redundant Web Document

Abstract:

The enrichment of internet has resulted in the flooding of abundant information on WWW with more replicas. As the duplicated web pages increase the indexing space and time complexity, finding and removing these pages becomes significant for search engines and other likely system which will improve on accuracy of search results as well as search speed. Web content mining plays a vital role in resolving these aspects. Existing algorithm for web content mining focus attention on applying weightage to structured documents whereas in this research work, a mathematical approach based on linear correlation is developed to detect and remove the duplicates present in both structured and unstructured web document. In the proposed work, linear correlation between two web documents is found out. If the correlated value is 1 then the documents are said to be exactly redundant and it should be eliminated otherwise not redundant.

You might also be interested in these eBooks

Engineering Materials, Energy, Management and Control

View Preview

Info:

Periodical:

Advanced Materials Research (Volumes 171-172)

Pages:

543-546

DOI:

https://doi.org/10.4028/www.scientific.net/AMR.171-172.543

Citation:

Cite this paper

Online since:

December 2010

Authors:

G. Poonkuzhali, R. Kishore Kumar, R. Kripa Keshav, P. Sudhakar, K. Sarukesi

Keywords:

Duplicates, Linear Correlation, Term Frequency, Web Document

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

References

[1] G. Poonkuzhali, K. Thiagarajan, K. Sarukesi: Elimination of redundant links in web pages –Mathematical Approach, World Academy of Science, Engineering and Technology, V52, (2009) pp.562-565.

Google Scholar

[2] Giuseppe Antoio Di Lucca, Massimiliano: Anna Rita Fasolina: An Approach to identify Duplicated web pages, in proceedings of the 28th Annual International Computer Software and Applications Conference, IEEE computer Society press(2002).

DOI: 10.1109/cmpsac.2002.1045051

Google Scholar

[3] Min-yan Wang, Dong-Sheng Liux: The Research of web page De-duplication based on web pages Re-shipment Statement, First Interrnational Workshop on Database Technology and Applicationsv(2000), pp.271-274.

DOI: 10.1109/dbta.2009.64

Google Scholar

[4] Raymond Kosala: Web Mining Research: A Survey, IEEE(2000).

Google Scholar

[5] Shiguang Ju, Zheng Wang, Xia Lv: Improvement of page ranking algorithm based on timestamp and link, International Symposium on Information Processing(2008), pp.36-40.

DOI: 10.1109/isip.2008.61

Google Scholar

[6] Yunhe Weng, Lei Li, Yixin Zhong: Semantic keywords-based duplicated web pages removing, IEEE(2008).

DOI: 10.1109/nlpke.2008.4906751

Google Scholar

[7] Zhongming Han, Qian Mo, Liu, Jianzhi: Effectively and Efficiently Detect Web Page Duplication, IEEE(2009).

Google Scholar

[8] Robert Johnson in: Elementary Statistics, sixth edition, Duxbury press, Belmount California.

Google Scholar