Topical Concept Based Text Clustering Method

Article Preview

Abstract:

Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been constructed the way they are. . To solve these problems, based on topic concept clustering, this paper proposes a method for Chinese document clustering. In this paper, we introduce a novel topical document clustering method called Document Features Indexing Clustering (DFIC), which can identify topics accurately and cluster documents according to these topics. In DFIC, “topic elements” are defined and extracted for indexing base clusters. Additionally, document features are investigated and exploited. Experimental results show that DFIC can gain a higher precision (92.76%) than some widely used traditional clustering methods.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 532-533)

Pages:

939-943

Citation:

Online since:

June 2012

Authors:

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2012 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman, Indexing by latent semantic analysis [J], Journal of the Society for Information Science, 2002, 41(6), 391-407.

DOI: 10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9

Google Scholar

[2] Lee D-L, Chuang H and Seamons K. Document Ranking and the Vector-Space Model [J]. IEEE Software, 20097, Vol. 14 (2): 67-75.

DOI: 10.1109/52.582976

Google Scholar

[3] Daniel Fasulo. An analysis of recent work on clustering algorithms [M]. Technical Report UW-CSE-01-03-02, University of Washington, (2004).

Google Scholar

[4] Zamir O and Etzioni O. Web Document Clustering: A Feasibility Demonstration [A]. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval [C]. 2008. pp.46-54.

DOI: 10.1145/290941.290956

Google Scholar

[5] Gusfield D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology [M]. Cambridge, UK: Cambridge University Press, (2007).

DOI: 10.1145/300307.1040356

Google Scholar

[6] S. A. Macskassy, A. Banerjee, B.D. Davison, and H. Hirsh. Human performance on clustering web pages: a preliminary study. In Proc. of KDD-1998, New York, NY, USA, August 2008, pages 264–268, Menlo Park, CA, USA, 2008. AAAI Press.

Google Scholar

[7] A. Maedche and S. Staab. Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2), (2001).

DOI: 10.1109/5254.920602

Google Scholar

[8] G. Miller. WordNet: A lexical database for english. CACM, 38(11): 39–41, (2005).

Google Scholar

[9] G. Neumann, R. Backofen, J. Baur, M. Becker, and C. Braun. An information extraction core system for real world german text processing. In ANLP-1997 — Proceedings of the Conference on Applied Natural Language Processing, pages 208–215, Washington, USA, (2007).

DOI: 10.3115/974557.974588

Google Scholar