Research on the Web Chinese Keywords Extraction Algorithm Based on the Improved TFIDF

Article Preview

Abstract:

An improved extraction algorithm of Web Chinese keywords is proposed in this paper based on the traditional feature words weighted algorithm—TFIDF. A series of controlled experiments have proved that the improved algorithm is superior to the traditional one for higher accuracy and recall rate, and it can precisely and automatically extract the key words from the target documents.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

915-919

Citation:

Online since:

January 2015

Authors:

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2015 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[2] Design Thoughts Generally, the thought of traditional TFIDF algorithm can be interpreted as following: if a certain word or word string frequently exists in a document but seldom appears in other documents, then this word or word string has great contribution to this category and the text categorization can be effectively accomplished. Actually, TFIDF is TF×IDF, and the weight of feature word can be defined as:. However, the traditional TFIDF has its drawbacks and this is undeniable. If a certain feature word is frequently found in certain category, then its weight will be high. As shown in the above TF and IDF formula, this phenomenon is not embodied in TFIDF algorithm. Hence, the traditional TFIDF weighted formula is modified to: where, stands for the total document number of a certain category; refers to the frequency occurrence of feature word in certain category.

DOI: 10.1145/3232116.3232152

Google Scholar

[3] Detailed Description of the Improved Algorithm This new algorithm is developed with Windows XP SP3 operation platform on Acer notebook. VC 2008 is selected as the development tool and VC++ as the programming language. The process of text document is done with file-handling functions of standard C and CstdioFile. In this stimulated experiment, there are totally 8 documents and the contents are related to warfare. First, conduct manual word segmentation to the text waiting for mining; then, define the feature term for the obtained word segments through the improved TFIDF; last, write the program to realize the algorithm. Calculate the weights (Wij) of all the words that once appeared in the text dataset by running programs. In the case of entering phrase database, acquire the feature term of the document via the comparison between weights and the definition of data range. This process is called document feature selection. Otherwise, the program will stop if no phrase database is entered. Phrase database is saved in a phrase database TXT text of the program. Word segmentation can be done at any time to the under mining text. When it is saved, the key information of the document can be sort out by calculating various feature terms and comparing the difference between each weight. With the help of TFIDF, these feature terms will be able to transform the non-structural data into structural one and save them. This is the main content of text categorization as well as the key point of textual data mining.

Google Scholar

[4] Stimulated Testing Process Conduct word segmentation to the under mining documents and save the results in the corpus" of the program. As soon as the word segmentation of phrase database is done, run the executable program and click "weight calculation". Thus, the program will run the calculation automatically according to the algorithm. Tooltip will pop up when the calculation is over. Since TFIDF is mainly aimed at multiple documents, a class of related documents is chosen for this test. When the word segmentation is finished, the under mining text can be selected by clicking the pull-down menu of the weight. For instance, when the word segmentation of "Great Campaign with One Hundred Regiments" is completed, chose "Great Campaign with One Hundred Regiments, in the pull-down menu. Then, the test result is shown as Table 1. Tab. 1 Classification results Key Words Weights(TFIDF) Great Campaign with One Hundred Regiments.

Google Scholar

049131 the Eighth Route Army.

Google Scholar

029684 Anti-Japanese.

Google Scholar

021862 Anti-Japanese.

Google Scholar

021862 North China.

Google Scholar

020040 North China.

Google Scholar

020040 Japanese Army.

Google Scholar

019339 Japanese Army.

Google Scholar

019339 Peng Dehuai.

Google Scholar

018424 Japanese Puppet Army.

Google Scholar

016377 Japanese Puppet Army.

Google Scholar

016377 Jiang Jieshi.

Google Scholar

013537 Jiang Jieshi.

Google Scholar

013537 the Kuomintang.

Google Scholar

012570 To improve the accuracy rate of programming and the comparability of weights, all the weights are saved with six significant digits. With the aid of the improved TFIDF, convert the unrecognizable non-structural data of the text categorization items into recognizable structural data, thus to achieve the purpose of text categorization presentation. When it comes to the question of how to conduct feature selection to the weights of calculation, the relationship between the feature term (t) of the space vector model and the feature term weight (Wid) should be considered. In vector space model, each feature term and its weight forms a coordinate relationship. For this reason, a document can make up a n-dimension vector space through its feature terms and weights. Next, it comes to feature selection. If the number of feature terms is too large, the dimension number of vector space will increase, thus bringing about much meaningless calculation. Therefore, the process of dimensions should be reduced. This issue will not be discussed for now since it's not the emphasis of this paper. In article, a good deal of existing experiment data is applied along with the improved TFIDF to delimit the weight of feature selection to the range of ≥0. 001000. Consequently, select the document of Great Campaign with One Hundred Regiments" and click "Display Results", and the results are shown in Table 2. Tab. 2 Key information Key information of Great Campaign with One Hundred Regiments. txt: Great Campaign with One Hundred Regiments the Eighth Route Army Anti-Japanese North China Japanese Army Peng Dehuai Japanese Puppet Army Jiang Jieshi the Kuomintang traffic line base Shijiazhuang-Taiyuan Line Puppet Army enemy's rear area North China Liu Bocheng Shanxi-Chahar-Hebei the Communist Party mop up expose stronghold Anti-Japanese War positive anti-war the Communist Party of China military region railway Mao Zedong results of battle confidence sabotage operations road Frontline battlefield senior officers headquarter campaign traffic Zhu De Tai-hang die in battle Japan difficulty cannot be defeated armed force China strive for formal number strengthen inactive destroy the army and the people KMT-CPC conflicts Japanese self-criticism consistent Chinese nation statistics carry out encourage attack strategic region monument Tongpu Railway The content of Table 2 is the key information obtained from the data mining of "Great Campaign with One Hundred Regiments" through the modified TFIDF. In this way, the requisite key information can be sorted out directly from the target document, which greatly reduces the amount of text reading and realizes the purpose of feature representation and selection. As the "Display Results" button is clicked, the TXT document will update automatically and show the corresponding key information. If the under mining document saves no segmented phrase in the phrase database, then none of the weight information will be displayed and the program will stop. Comparisons of the traditional TFIDF, s performance with those of the improved TFIDF are shown in the following figures. 100 (percent, %) 95 90 85.

Google Scholar