Automatically Extracting University Scholar Names Information and Classification

Article Preview

Abstract:

High-tech talent is one of the important social resources such as energy and material, and introducing high-tech talent is an important strategy for the development of national science and technology. To extract high-tech talent information of variety research fields from massive websites. Firstly, we study the principles of Web crawler and Web data Extraction in the paper. Then taking the U.S universities as an example, we propose an intelligent method and procedure which can extract scholars name information from websites. Finally, we apply a classification algorithm to identify Chinese scholars working at overseas and verify the validity of the method in the experimental system. The accuracy of the classification algorithm is higher than 90%, the average accuracy of result information is higher than 77%.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

2065-2068

Citation:

Online since:

January 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] Page L, Brin S, Motwani R, et al. The PageRank citation ranking: bringing order to the web[J]. (1999).

Google Scholar

[2] Johnson J, Tsioutsiouliklis K, Giles C L. Evolving strategies for focused web crawling[C]/ICML. 2003: 298-305.

Google Scholar

[3] Feldman R, Sanger J. The text mining handbook: advanced approaches in analyzing unstructured data[M]. Cambridge University Press, (2007).

DOI: 10.1017/cbo9780511546914

Google Scholar

[4] Mohr G, Stack M, Rnitovic I, et al. Introduction to heritrix[C]/4th International Web Archiving Workshop. (2004).

Google Scholar

[5] Miller R. Websphinx, a personal, customizable web crawler[J]. 2011-02-12]. http: /www. cs. cmu. edu/~ rcm/websphinx, (2002).

Google Scholar

[6] Finkel J R, Grenager T, Manning C. Incorporating non-local information into information extraction systems by gibbs sampling[C]/Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005: 363-370.

DOI: 10.3115/1219840.1219885

Google Scholar

[7] Lian Li, Aihong Zhu, TaoSu. Research and implementation of an improved text similarity algorithm based on the vector space, Computer Applications and Software Vol. 29(2), 2012, pp.282-284.

Google Scholar