A New Document Representation Using a Unified Graph to Document Similarity Search

Article Preview

Abstract:

Document similarity search is to retrieve a ranked list of similar documents and find documents similar to a query document in a text corpus or a web page on the web. But most of the previous researches regarding searching for similar documents are focused on classifying documents based on the contents of documents. To solve this problem, we propose a novel retrieval approach based on undirected graphs to represent each document in corpus. In addition, this study also considers unified graph in conjunction with multiple graphs to improve the quality of searching for similar documents. Experimental results on the Reuters-21578 data demonstrate that the proposed system has better performance and success than the traditional approach.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

394-400

Citation:

Online since:

December 2012

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Martínez-Trinidad, J, F. Beltrán-Martínez, B. & Ruiz-Shulcloper, J, A tool to discover the main themes in a Spanish or English document, Expert Systems with Applications, Elsevier, November 2000, pp.319-327.

DOI: 10.1016/s0957-4174(00)00043-9

Google Scholar

[2] Berry, M, W. Castellanos, M, Survey of Text Mining: Clustering, Classification, and Retrieval, Springer, 30 September (2007).

Google Scholar

[3] Salton, G. Automatic Text Processing, the transformation, analysis, and retrieval of information by computer, Addison-Wesley, (1989).

Google Scholar

[4] Baeza-Yates, R. Ribeiro-Neto, B, Modern information retrival, Addison Wesley, (1999).

Google Scholar

[5] Deerwester, S., Dumais, S. R, Indexing by latent semantic analysis, Journal of the American Society of Informatio Science, 41(6), 1990, pp.391-407.

DOI: 10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9

Google Scholar

[6] Hofmann, T, "Probabilistic latent semantic indexing. In Proceedings of the tweenty-second annual international SIGIR conference.

Google Scholar

[7] Blei, D., Ng, A., Jordan, M, Latent Dirichlet allocation, Journal of Machine Learning Research, 2003, pp.993-1022.

Google Scholar

[8] Welling, M., Rosen-Zvi, M., Hinton, G, Exponential family harmoniums with an application to information retrieval, Advances in neural information processing systems, vol. 17, 2004, pp.1481-1488.

Google Scholar

[9] Horng, Yih-Jen, Chen, Shyi-Ming, Chang, Yu-Chuan, Lee, Chia-Hoang, A new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques, IEEE Transaction on Fuzzy Systems, 13(2), 2005, pp.216-228.

DOI: 10.1109/tfuzz.2004.840134

Google Scholar

[10] Rldvan, Saracoglu, Tutuncu, Kemal, Allahverdi, Novruz, A fuzzy clustering approach for finding similar documents using a novel similarity measure, Expert Systems with Applications, 33, 1980, pp.600-605.

DOI: 10.1016/j.eswa.2006.06.002

Google Scholar

[11] Wan, X., Yang, J., Xiao, J, Towards a unified approach to document similarity search using manifold-ranking of blocks, Information Processing & Management, ScienceDirect, 2008, pp.1032-1048.

DOI: 10.1016/j.ipm.2007.07.012

Google Scholar

[12] Rıdvan Saraçoğlu, Kemal Tütüncü, Novruz Allahverdi, A new approach on search for similar documents with multiple categories using fuzzy clustering, Expert Systems with Applications, Volume 34, Issue 4, May 2008, pp.2545-2554.

DOI: 10.1016/j.eswa.2007.04.003

Google Scholar

[13] SS Weng, YJ Lin and F. Jen, 'A study on searching for similar documents based on multiple concepts and distribution of concepts, Expert Systems with Applications, 25, 2003, p.355–368.

DOI: 10.1016/s0957-4174(03)00076-9

Google Scholar

[14] The Standford Parser. http: /nlp. stanford. edu/software/lex-parser. shtml.

Google Scholar

[15] Cormack, G, V. Lhotak, O. Palmer, C, R, Estimating precision by random sampling, ACM/SIGIR, 1999, pp.273-274.

DOI: 10.1145/312624.312692

Google Scholar

[16] Robertson, S., Walker, S, Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval, In Proc. Of the 17th international ACM/SIGIR conference on research and development in information retrieval, 1994, pp.232-241.

DOI: 10.1007/978-1-4471-2099-5_24

Google Scholar

[17] Singhal, A., Buckley, C., Mitra, M, Pivoted document length normalization", In Proceedings of SIGIR, 96.

DOI: 10.1145/3130348.3130365

Google Scholar