A Hierarchical Bayesian Model for Text Corpora

Article Preview

Abstract:

We propose a new generative probabilistic Dirich- let Author-Topic (DAT) Model for extracting information about authors and topics from large text collections. DAT is a three-level hierarchical Bayesian model. The model builds on the Author Topic (AT) model, adding the key attribute that distribution over author is conditioned on a Dirichlet prior. The probability distribution over topics in a multi-author document is a mixture of the distributions associated with the authors. The three level distributions including document-author, author-topic and topic-word are learned from data in an unsupervised manner using a Gibbs sampling algorithm. We give results on a large corpus which contains 1740 papers from the Neural Information Processing Systems Conference (NIPS). Experiments based on perplexity scores for test documents are used to illustrate systematic differences between the proposed model and a number of alternatives.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

1237-1240

Citation:

Online since:

November 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] D. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,. Journal of Machine Learning Research, pp.993-1022, (2003).

Google Scholar

[2] M. Rosen-Zvi, T. Griffiths, M. Steyvers & P. Smyth. The Author-Topic Model for Authors and Documents,. In 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada, (2004).

Google Scholar

[3] M. Rosen-Zvi, C. Chemudugunta, T. Griffiths, P. Smyth, &M. Steyvers. Learning author-topic models from text corpora,. ACM Transactions on Information Systems, 28(1), Article 4, (2008).

DOI: 10.1145/1658377.1658381

Google Scholar

[4] T. Hofmann. Probabilistic latent semantic indexing,. In Proceedings of the 22nd Annual ACM Conference on Research and Development in Information Retrieval, p.50–57, (1999).

DOI: 10.1145/312624.312649

Google Scholar

[5] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity,. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems, 13, , pp.430-436, (2001).

Google Scholar

[6] E. Erosheva, S. Fienberg, and J. Lafferty. Mixedmembership models of scientific publications,. Proceedings of the National Academy of Sciences, pp.5220-5227, (2004).

DOI: 10.1073/pnas.0307760101

Google Scholar

[7] A. McCallum, X. Wang and A. Emmanuel. Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email,. Journal of Artificial Intelligence Research, (2007).

DOI: 10.1613/jair.2229

Google Scholar

[8] D. Mimno and A. McCallum. Expertise Modeling for Matching Papers with Reviewers,. Conference on Knowledge Discovery and Data Mining, (2007).

DOI: 10.1145/1281192.1281247

Google Scholar

[9] A. McCallum. Multi-Label Text Classiffication with a Mixture Model Trained by EM". AAAI, 99 Workshop on Text Learning, (1999).

Google Scholar

[10] T. L. Griffiths and M. Steyvers. Finding scientific topics,. Proceedings of the National Academy of Sciences of the United States of America, pp.5228-5235, (2004).

DOI: 10.1073/pnas.0307752101

Google Scholar

[11] T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model,. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp.352-359, (2002).

Google Scholar

[12] W. Gilks, S. Richardson, and D. Spiegelhalter. Markov Chain Monte Carlo in Practice,. Chapman & Hall, New York, NY, (1996).

DOI: 10.1201/b14835

Google Scholar