Extracting Images from Chinese PDF Documents

Yong Hua Yin; Ying Jin; Quan Yin Zhu; Yun Yang Yan

doi:10.4028/www.scientific.net/AMM.530-531.887

Paper Titles

Program Design of Dynamic Arrangement-Based Immediate Generation of Electronic Seals
p.869

The Application of Machine Learning to Study Malware Evolution
p.875

The Design of a Physical Education Subject Testing System
p.879

A Design of a Sci-Tech Information Retrieval Platform Based on Apache Solr and Web Mining
p.883

Extracting Images from Chinese PDF Documents
p.887

Research on the Realization of LRU Algorithm
p.891

Study of Page Replacement Algorithm Based on Experiment
p.895

Design of Three-Dimensional Clothing Dressing System
p.901

Finite Element Analysis for Water Turbine of Horizontal Axis Rotor Wave Energy Converter
p.906

HomeApplied Mechanics and MaterialsApplied Mechanics and Materials Vols. 530-531Extracting Images from Chinese PDF Documents

Extracting Images from Chinese PDF Documents

Abstract:

In order to efficient tap the potential value in Chinese PDF documents and use Chinese PDF documents, an unique idea that extracting images from Chinese PDF documents is proposed in this paper. The idea combines PDFs document structure and page tree to extract images. Based on this idea, the experiments in this paper are done with one hundred Chinese PDF documents. And the extraction rate of the experiments obtains 83.56 percent. According to the analysis of experimental results, it is proved that the idea proposed in this paper is applicable to most of Chinese PDF documents and it is able to meet most of the needs of practical application.

You might also be interested in these eBooks

View Preview

Info:

Periodical:

Applied Mechanics and Materials (Volumes 530-531)

Pages:

887-890

DOI:

https://doi.org/10.4028/www.scientific.net/AMM.530-531.887

Citation:

Cite this paper

Online since:

February 2014

Authors:

Yong Hua Yin, Ying Jin, Quan Yin Zhu, Yun Yang Yan*

Keywords:

Chinese PDF Documents, Document Structure, Extracting Images, Page Tree

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

* - Corresponding Author

References

[1] Fang Yuan, Bo Liu, Ge Yu, in: A Study on Information Extraction from PDF Files, edited by Advances in Machine Learning and Cybernetics. Vol. 3930 (2006), pp.258-267.

DOI: 10.1007/11739685_27

Google Scholar

[2] Ermelinda Oro, Massimo Ruffolo, in: Towards a System for Ontology-Based Information Extraction from PDF Documents, edited by On the Move to Meaningful Internet Systems, Vol. 5332 (2008), pp.1482-1499.

DOI: 10.1007/978-3-540-88873-4_38

Google Scholar

[3] Rosmayati Mohemad, Abdul Razak Hamdan, Zulaiha Ali Othman, Noor Maizura Mohamad Noor, in: Automatic Recognition of Document Structure from PDF Files, edited by Software Engineering and Computer Systems, Vol. 181 (2011), pp.274-282.

DOI: 10.1007/978-3-642-22203-0_24

Google Scholar

[4] J ran Beel, Bela Gipp, Ammar Shaker, Nick Friedrich, in: SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size), edited by Research and Advanced Technology for Digital Libraries, Vol. 6273 (2010).

DOI: 10.1007/978-3-642-15464-5_45

Google Scholar

[5] Mingyan Shao, Robert P. Futrelle, in: Recognition and Classification of Figures in PDF Documents, edited by Graphics Recognition, Vol. 3926 (2006), pp.231-242.

DOI: 10.1007/11767978_21

Google Scholar

[6] Chakraborty, A., Liu, P., Hsu, L., in: Extracting anchorable information units from PDF files, edited by Multimedia and Expo, Vol. 1 (2003), pp.173-176.

DOI: 10.1109/icme.2003.1220882

Google Scholar

[7] Ying Liu, Kun Bai, Liangcai Gao, in: An Efficient Pre-processing Method to Identify Logical Components from PDF Documents, edited by Advances in Knowledge Discovery and Data Mining, Vol. 6634 (2011), pp.500-511.

DOI: 10.1007/978-3-642-20841-6_41

Google Scholar

[8] Hervé Déjean, Jean-Luc Meunier, in: A System for Converting PDF Documents into Structured XML Format, edited by Document Analysis Systems VII, Vol. 3872 (2006), pp.129-140.

DOI: 10.1007/11669487_12

Google Scholar

[9] Hui Chao, Jian Fan, in: Layout and Content Extraction for PDF Documents, edited by Document Analysis Systems VI, Vol. 3163 (2004), pp.213-224.

DOI: 10.1007/978-3-540-28640-0_20

Google Scholar