Paper Titles

Research Progress of Energy Management System for New Energy Vehicles
p.896

Plant Electrical Signal De-Noising Based on the Lifting Wavelet Transform
p.900

The Design of System to Texture Feature Analysis Based on Gray Level Co-Occurrence Matrix
p.904

Mainly Talks about the Electronic Communication System Key Technology Analysis
p.911

Research on the Web Chinese Keywords Extraction Algorithm Based on the Improved TFIDF
p.915

Research on Wireless Resource Management
p.920

Data Collection and Processing of Cloud Security Botnet Protection System
p.923

The Design of Communication Interface for Profibus-DP Slave Based on SPC3
p.926

The Development and Design of Online Examination System Based on ASP.NET
p.930

HomeApplied Mechanics and MaterialsApplied Mechanics and Materials Vols. 727-728Research on the Web Chinese Keywords Extraction...

Research on the Web Chinese Keywords Extraction Algorithm Based on the Improved TFIDF

Article Preview

Abstract:

An improved extraction algorithm of Web Chinese keywords is proposed in this paper based on the traditional feature words weighted algorithm—TFIDF. A series of controlled experiments have proved that the improved algorithm is superior to the traditional one for higher accuracy and recall rate, and it can precisely and automatically extract the key words from the target documents.

You might also be interested in these eBooks

Frontiers of Mechanical Engineering and Materials Engineering III

Info:

Periodical:

Applied Mechanics and Materials (Volumes 727-728)

Pages:

915-919

DOI:

https://doi.org/10.4028/www.scientific.net/AMM.727-728.915

Citation:

Cite this paper

Online since:

January 2015

Authors:

Cui Yuan Yu, Jie Shan

Keywords:

Extraction, Feature Word, TFIDF, Word Segmentation

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

© 2015 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[2] Design Thoughts Generally, the thought of traditional TFIDF algorithm can be interpreted as following: if a certain word or word string frequently exists in a document but seldom appears in other documents, then this word or word string has great contribution to this category and the text categorization can be effectively accomplished. Actually, TFIDF is TF×IDF, and the weight of feature word can be defined as:. However, the traditional TFIDF has its drawbacks and this is undeniable. If a certain feature word is frequently found in certain category, then its weight will be high. As shown in the above TF and IDF formula, this phenomenon is not embodied in TFIDF algorithm. Hence, the traditional TFIDF weighted formula is modified to: where, stands for the total document number of a certain category; refers to the frequency occurrence of feature word in certain category.

DOI: 10.1145/3232116.3232152

[3] Detailed Description of the Improved Algorithm This new algorithm is developed with Windows XP SP3 operation platform on Acer notebook. VC 2008 is selected as the development tool and VC++ as the programming language. The process of text document is done with file-handling functions of standard C and CstdioFile. In this stimulated experiment, there are totally 8 documents and the contents are related to warfare. First, conduct manual word segmentation to the text waiting for mining; then, define the feature term for the obtained word segments through the improved TFIDF; last, write the program to realize the algorithm. Calculate the weights (Wij) of all the words that once appeared in the text dataset by running programs. In the case of entering phrase database, acquire the feature term of the document via the comparison between weights and the definition of data range. This process is called document feature selection. Otherwise, the program will stop if no phrase database is entered. Phrase database is saved in a phrase database TXT text of the program. Word segmentation can be done at any time to the under mining text. When it is saved, the key information of the document can be sort out by calculating various feature terms and comparing the difference between each weight. With the help of TFIDF, these feature terms will be able to transform the non-structural data into structural one and save them. This is the main content of text categorization as well as the key point of textual data mining.

[4] Stimulated Testing Process Conduct word segmentation to the under mining documents and save the results in the corpus" of the program. As soon as the word segmentation of phrase database is done, run the executable program and click "weight calculation". Thus, the program will run the calculation automatically according to the algorithm. Tooltip will pop up when the calculation is over. Since TFIDF is mainly aimed at multiple documents, a class of related documents is chosen for this test. When the word segmentation is finished, the under mining text can be selected by clicking the pull-down menu of the weight. For instance, when the word segmentation of "Great Campaign with One Hundred Regiments" is completed, chose "Great Campaign with One Hundred Regiments, in the pull-down menu. Then, the test result is shown as Table 1. Tab. 1 Classification results Key Words Weights（TFIDF） Great Campaign with One Hundred Regiments.

049131 the Eighth Route Army.

029684 Anti-Japanese.

021862 Anti-Japanese.

021862 North China.

020040 North China.

020040 Japanese Army.

019339 Japanese Army.

019339 Peng Dehuai.

018424 Japanese Puppet Army.

016377 Japanese Puppet Army.

016377 Jiang Jieshi.

013537 Jiang Jieshi.

013537 the Kuomintang.

012570 To improve the accuracy rate of programming and the comparability of weights, all the weights are saved with six significant digits. With the aid of the improved TFIDF, convert the unrecognizable non-structural data of the text categorization items into recognizable structural data, thus to achieve the purpose of text categorization presentation. When it comes to the question of how to conduct feature selection to the weights of calculation, the relationship between the feature term (t) of the space vector model and the feature term weight (Wid) should be considered. In vector space model, each feature term and its weight forms a coordinate relationship. For this reason, a document can make up a n-dimension vector space through its feature terms and weights. Next, it comes to feature selection. If the number of feature terms is too large, the dimension number of vector space will increase, thus bringing about much meaningless calculation. Therefore, the process of dimensions should be reduced. This issue will not be discussed for now since it's not the emphasis of this paper. In article, a good deal of existing experiment data is applied along with the improved TFIDF to delimit the weight of feature selection to the range of ≥0. 001000. Consequently, select the document of Great Campaign with One Hundred Regiments" and click "Display Results", and the results are shown in Table 2. Tab. 2 Key information Key information of Great Campaign with One Hundred Regiments. txt: Great Campaign with One Hundred Regiments the Eighth Route Army Anti-Japanese North China Japanese Army Peng Dehuai Japanese Puppet Army Jiang Jieshi the Kuomintang traffic line base Shijiazhuang-Taiyuan Line Puppet Army enemy's rear area North China Liu Bocheng Shanxi-Chahar-Hebei the Communist Party mop up expose stronghold Anti-Japanese War positive anti-war the Communist Party of China military region railway Mao Zedong results of battle confidence sabotage operations road Frontline battlefield senior officers headquarter campaign traffic Zhu De Tai-hang die in battle Japan difficulty cannot be defeated armed force China strive for formal number strengthen inactive destroy the army and the people KMT-CPC conflicts Japanese self-criticism consistent Chinese nation statistics carry out encourage attack strategic region monument Tongpu Railway The content of Table 2 is the key information obtained from the data mining of "Great Campaign with One Hundred Regiments" through the modified TFIDF. In this way, the requisite key information can be sorted out directly from the target document, which greatly reduces the amount of text reading and realizes the purpose of feature representation and selection. As the "Display Results" button is clicked, the TXT document will update automatically and show the corresponding key information. If the under mining document saves no segmented phrase in the phrase database, then none of the weight information will be displayed and the program will stop. Comparisons of the traditional TFIDF, s performance with those of the improved TFIDF are shown in the following figures. 100 （percent, %） 95 90 85.