Template-Based Delta Compression of Large Scale Web Pages

Article Preview

Abstract:

Delta compression techniques are commonly used in the context of version control systems and the World Wide Web. They are used to compactly encode the differences between two files or strings in order to reduce communication or storage costs. In this paper, we study the use of delta compression in compressing massive web pages according to the similarity of their templates. We propose a framework for template-based delta compression which uses template-based clustering techniques to find the web pages that have similar templates and then encode their differences with delta compression techniques to reduce the storage cost. We also propose a filter-based optimization of Diff algorithm to improve the efficiency of the delta compression approach. To demonstrate the efficiency of our approach, we present experimental results on massive web pages. Our experiments show that template-based delta compression achieves significant improvements in compression ratio as compared to individually compressing each web page.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

2666-2672

Citation:

Online since:

August 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] J. Hunt, K. P. Vo, and W. Tichy. ACM Transactions on Software Engineering and Methodology, 7 (1998).

Google Scholar

[2] D. Gibson, K. Punera, and A. Tomkins. In Proc. 14th WWW (Special interest tracks and posters), pages 830–839 (2005).

DOI: 10.1145/1062745.1062763

Google Scholar

[3] E. W. Myers. Algorithmica, 1(2): 251–266 (1986).

Google Scholar

[4] L. Huang, H. Yan, and X. Li. In Proceedings of The World Engineers' Convention, volume A, pages 217–222. China Science and Technology Press, Nov (2004).

Google Scholar

[5] L. P. Deutsch, RFC 1952: GZIP file format specification version 4. 3, May (1996).

DOI: 10.17487/rfc1952

Google Scholar

[6] W. Tichy. RCS: A system for version control. Software - Practice and Experience, 15, July (1985).

Google Scholar

[7] Eddy, W.F.; Mockus, A. & Oue, S. (1996). Journal of Computational Statistics and Data Analysis, Vol 23, p.29 – 43.

Google Scholar

[8] D. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender. In Proceedings of the International Conference on the World Wide Web, pages 502–511 (2004).

Google Scholar

[9] G. Valiente. In Proceedings of the International Symposium on String Processing and Information Retrieval, pages 212–219. IEEE Computer Science Press (2001).

Google Scholar

[10] W. Yang. Software – Practice And Experience, 21(7): 739–755 (1991).

Google Scholar

[11] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Fifth International Workshop on the Web and Databases (2002).

Google Scholar

[12] Buttler, D. In: IC '04: Proceedings of the International Conference on Internet Computing, CSREA Press (2004) 3–9.

Google Scholar

[13] A. Broder. pages 21–29. IEEE Computer Society (1997).

Google Scholar

[14] A. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G. Computer Networks 29(8-13) (1997) 1157–1166.

DOI: 10.1016/s0169-7552(97)00031-7

Google Scholar

[15] Lian'en Huang, Lei Wang and Xiaoming Li. Proceeding of the 17th ACM conference on Information and knowledge management, 63-72 (2008).

Google Scholar

[16] A. Broder. Methods in Communications, Security, and Computer Science, 143–152 (1993).

Google Scholar