Document Similarity Measure Based on Topic Model

Article Preview

Abstract:

Document similarity computation is an exciting research topic in information retrieval (IR) and it is a key issue for automatic document categorization, clustering analysis, fuzzy query and question answering. Topic model is an emerging field in natural language processing (NLP), IR and machine learning (ML). In this paper, we apply a latent Dirichlet allocation (LDA) topic model-based method to compute similarity between documents. By mapping a document with term space representation into a topic space, a distribution over topics derived for computing document similarity. An empirical study using real data set demonstrates the efficiency of our method.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

1280-1284

Citation:

Online since:

February 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation, Journal of Mach ine Learning Research, vol. 3, 2003, pp.993-1022.

Google Scholar

[2] Steyvers M, Griffiths T L. Probabilistic topic models, Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, (2006).

Google Scholar

[3] Hofmann T. Probabilistic latent semantic indexing, Proc. 22nd Annual International SIGIR Conference, New York: ACM Press, 1999, pp.50-57.

DOI: 10.1145/312624.312649

Google Scholar

[4] Griffiths T L, Steyvers M, Blei D M et al. Finding scintific topics, Procedings of the National Academy of Science, USA: Springer , vol. 101, 2004, pp: 5228-5235.

Google Scholar

[5] Heinrich G. Parameter Estimation for Text Analysis, Technical Report, University of Leipzig, Germany, 2008, http: /www. arbylon. net/publications/text-est. pdf.

Google Scholar

[6] Hoffman, M, Blei, D, Bach, F. On-line learning for latent Dirichlet allocation,. In Neural Information Processing Systems, (2010).

Google Scholar

[7] Salton G, Wong A, Yang C S. A vector space model for information retrieva, Communications of the ACM, vol. 18, no. 11, 1975, pp.613-620.

DOI: 10.1145/361219.361220

Google Scholar

[8] K. Cios, W. Pedrycz. Data Mining Methods for Knowledge Discovery, Kluwer Academic Publishers, (1998).

Google Scholar