A Web Text De-Noising Algorithm Based on Machine Learning

Article Preview

Abstract:

The Web has become a huge information resources distributed information space, contains a huge amount of various types of Web documents.Search engine, it is difficult to meet different user requirements for the elaboration of the retrieval results.To noise method, this paper proposes a text first to find out the noise page information, then according to the rules of human interaction way to generate a denoising, finally find and remove the noise, the experiments has been proved this method to be effective on improving the accuracy of classification.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

516-519

Citation:

Online since:

April 2014

Authors:

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Zhang Xueying. Review of Machine Learning in Automatic Text Categorization[J], Journal of the china society for scientific and technical information. 2006, 25(6): 730-739.

Google Scholar

[2] Hasan M M, Rahman C M. Text Categorization using association rule based decision tree. In: Proceedings of the 6th International Conference on Computer and Information Technology, Dhaka, Bangladesh, December 19~21, 2003: 453-456.

Google Scholar

[3] Chen Lifu, Zhou Ning, Li Dan. Study on Machine Learning Based Automatic Text Categorization Model[J]. New Technology of Library and Information Service, 2005(10): 23-27.

Google Scholar

[4] Chen Xue, Xu Hui, Shen Jiajun. The Research and Algorithm Design of Web De-noising Technology[J]. Computer Engineering & Software. 2013(8): 95-97.

Google Scholar

[5] Nie Hui, Zhang jinhua. Based on The Contents of Web Page Subject of Web Page Layout Extraction[J]. Journal of theory and practice of intelligence. 2012, 31(1).

Google Scholar

[6] Guo Miaoxia. Research on the Data Preprocessing of Webpage Categorization[J]. Journal of Putian university. 2011, 18(5).

Google Scholar

[7] He Youquan, Xu xiaole, Tang Huajiao, Xu Cheng. Approach of Eliminating Web Page Noise Based on Statistical Characteristics and DOM tree[J]. Journal of Chongqing Institute of Technology. 2011, 25(1).

Google Scholar

[8] Xinghua Fan, Zhiwei Zhou. A Link Strength Based Language Model for Chinese Short-text Categorization. Journal of Information and Computational Science. 2010 Vol. 7(2): 373-381.

Google Scholar

[9] Qun Li, Honggang Zhang, Jun Guo, Bir Bhanu. Learning Reference-based Representation for Image Categorization. Journal of Information and Computational Science. 2012 vol. 9(15): 4261-4269.

Google Scholar