Extracting Images from Chinese PDF Documents

Article Preview

Abstract:

In order to efficient tap the potential value in Chinese PDF documents and use Chinese PDF documents, an unique idea that extracting images from Chinese PDF documents is proposed in this paper. The idea combines PDFs document structure and page tree to extract images. Based on this idea, the experiments in this paper are done with one hundred Chinese PDF documents. And the extraction rate of the experiments obtains 83.56 percent. According to the analysis of experimental results, it is proved that the idea proposed in this paper is applicable to most of Chinese PDF documents and it is able to meet most of the needs of practical application.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

887-890

Citation:

Online since:

February 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] Fang Yuan, Bo Liu, Ge Yu, in: A Study on Information Extraction from PDF Files, edited by Advances in Machine Learning and Cybernetics. Vol. 3930 (2006), pp.258-267.

DOI: 10.1007/11739685_27

Google Scholar

[2] Ermelinda Oro, Massimo Ruffolo, in: Towards a System for Ontology-Based Information Extraction from PDF Documents, edited by On the Move to Meaningful Internet Systems, Vol. 5332 (2008), pp.1482-1499.

DOI: 10.1007/978-3-540-88873-4_38

Google Scholar

[3] Rosmayati Mohemad, Abdul Razak Hamdan, Zulaiha Ali Othman, Noor Maizura Mohamad Noor, in: Automatic Recognition of Document Structure from PDF Files, edited by Software Engineering and Computer Systems, Vol. 181 (2011), pp.274-282.

DOI: 10.1007/978-3-642-22203-0_24

Google Scholar

[4] J ran Beel, Bela Gipp, Ammar Shaker, Nick Friedrich, in: SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size), edited by Research and Advanced Technology for Digital Libraries, Vol. 6273 (2010).

DOI: 10.1007/978-3-642-15464-5_45

Google Scholar

[5] Mingyan Shao, Robert P. Futrelle, in: Recognition and Classification of Figures in PDF Documents, edited by Graphics Recognition, Vol. 3926 (2006), pp.231-242.

DOI: 10.1007/11767978_21

Google Scholar

[6] Chakraborty, A., Liu, P., Hsu, L., in: Extracting anchorable information units from PDF files, edited by Multimedia and Expo, Vol. 1 (2003), pp.173-176.

DOI: 10.1109/icme.2003.1220882

Google Scholar

[7] Ying Liu, Kun Bai, Liangcai Gao, in: An Efficient Pre-processing Method to Identify Logical Components from PDF Documents, edited by Advances in Knowledge Discovery and Data Mining, Vol. 6634 (2011), pp.500-511.

DOI: 10.1007/978-3-642-20841-6_41

Google Scholar

[8] Hervé Déjean, Jean-Luc Meunier, in: A System for Converting PDF Documents into Structured XML Format, edited by Document Analysis Systems VII, Vol. 3872 (2006), pp.129-140.

DOI: 10.1007/11669487_12

Google Scholar

[9] Hui Chao, Jian Fan, in: Layout and Content Extraction for PDF Documents, edited by Document Analysis Systems VI, Vol. 3163 (2004), pp.213-224.

DOI: 10.1007/978-3-540-28640-0_20

Google Scholar