A Novel Method for Text Similarity Calculation

Article Preview

Abstract:

In view of the fact that traditional vector space model for text similarity calculation which does not take word order into consideration leads to bias, this paper puts forward a longest common subsequence and the traditional vector space model of combining text similarity calculation. This method takes the word order and word frequency information into account, using the texts of the longest common subsequence and substring of their information from all public records and the use of word order and word frequency in the text. The importance of similarity calculation is acknowledged, and the traditional vector space model in the calculation of the weight is used on the word frequency information. Some of the dataset collected through the web crawler are used in the proposed text similarity calculation method for testing, and the results proved the effectivity of the method.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

202-206

Citation:

Online since:

February 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] J Allan, J Carbonell, G Doddington, J Yamron and Y Yang. Topic Detection and Tracking Pilot Study: Final Report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Virginia: Lansdowne, February 1998, 194-218.

Google Scholar

[2] Chuang, S.L. and L.F. Chien. A Practical Web-Based Approach to Generating Topic Hierarchy for Text Segments. In the 13th ACM Conference on Information and Knowledge Management. (2004).

DOI: 10.1145/1031171.1031193

Google Scholar

[3] Raghavan, V.V. and H. Sever. On the Reuse of Past Optimal Queries. In the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (1995).

DOI: 10.1145/215206.215381

Google Scholar

[4] Fitzpatrick, L. and M. Dent. Automatic Feedback Using Past Queries: Social Searching? In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1997. Philadelphia, Pennsylvania, United States.

DOI: 10.1145/258525.258597

Google Scholar

[5] Sahami, M. and T.D. Heilman. A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets. In the 15th International Conference on World Wide Web. (2006).

DOI: 10.1145/1135777.1135834

Google Scholar

[6] Rudi, L.C. and M.B. Paul, The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering IEEE Transactions on Knowledge and Data Engineering, 2007. 19(3): 370-383.

DOI: 10.1109/tkde.2007.48

Google Scholar

[7] Zelikovitz, S. and H. Hirsh. Improving Short-Text Classification Using Unlabeled Background Knowledge to Assess Document Similarity. In the 17th International Conference on Machine Learning. (2000).

Google Scholar

[8] PENG Jing, YANG DongQing, TANG ShiWei, FU Yan, JIANG HanKui. A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic. CHINESE JOURNAL OF COMPUTERS, 2007, 30(8):1354—1363.

Google Scholar

[9] A. Passant, T. Hastrup, U. Bojars and J. Breslin, Microblogging: A Semantic Web and Distributed Approach, Proceedings of the 4th Workshop on Scriptingfor the Semantic Web, CEUR Workshop Proceedings, (2008).

DOI: 10.1609/icwsm.v4i1.14067

Google Scholar

[10] Feng Yi, An order-based taxonomy for text similarity, Lecture Notes in Electrical Engineering, v 107 LNEE, pp.1617-1623, 2012, Computer, Informatics, Cybernetics and Applications - Proceedings of the CICA (2011).

DOI: 10.1007/978-94-007-1839-5_174

Google Scholar