Welcome, guest | Sign In | My Account | Store | Cart

Two small scripts to extract images contained in a PDF document as PNG files. (1) Script 1 extracts all images (2) Script 2 extracts only images that are referenced by a page

Python, 95 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
Script 1: Extract ALL images
----------------------------
#! python
'''
This demo extracts all images of a PDF as PNG files, whether they are
referenced by pages or not.
It scans through all objects and selects /Type/XObject with /Subtype/Image.
So runtime is determined by number of objects and image volume.
Usage:
extract_img2.py input.pdf
'''
from __future__ import print_function
import fitz
import sys, time, re

checkXO = r"/Type(?= */XObject)"       # finds "/Type/XObject"   
checkIM = r"/Subtype(?= */Image)"      # finds "/Subtype/Image"

if len(sys.argv) != 2:
    print('Usage: %s <input file>' % sys.argv[0])
    exit(0)
    
t0 = time.clock()
doc = fitz.open(sys.argv[1])
imgcount = 0
lenXREF = doc._getXrefLength()         # number of objects - do not use entry 0!

# display some file info
print("file: %s, pages: %s, objects: %s" % (sys.argv[1], len(doc), lenXREF-1))

for i in range(1, lenXREF):            # scan through all objects
    text = doc._getObjectString(i)     # string defining the object
    isXObject = re.search(checkXO, text)    # tests for XObject
    isImage   = re.search(checkIM, text)    # tests for Image
    if not isXObject or not isImage:   # not an image object if not both True
        continue
    imgcount += 1
    pix = fitz.Pixmap(doc, i)          # make pixmap from image
    if pix.n < 5:                      # can be saved as PNG
        pix.writePNG("img-%s.png" % (i,))
    else:                              # must convert the CMYK first
        pix0 = fitz.Pixmap(fitz.csRGB, pix)
        pix0.writePNG("img-%s.png" % (i,))
        pix0 = None                    # free Pixmap resources
    pix = None                         # free Pixmap resources
        
t1 = time.clock()
print("run time", round(t1-t0, 2))
print("extracted images", imgcount)
--------------------------------------------------------------------------------------------------

Script 2: Only extract page-referenced images
---------------------------------------------
#! python
'''
This demo extracts all images of a PDF as PNG files that are referenced
by pages.
Runtime is determined by number of pages and volume of stored images.
Usage:
extract_img1.py input.pdf
'''
from __future__ import print_function
import fitz
import sys, time

if len(sys.argv) != 2:
    print('Usage: %s <input file>' % sys.argv[0])
    exit(0)
    
t0 = time.clock()
doc = fitz.open(sys.argv[1])
imgcount = 0
lenXREF = doc._getXrefLength()

# display some file info
print("file: %s, pages: %s, objects: %s" % (sys.argv[1], len(doc), lenXREF-1))

for i in range(len(doc)):
    imglist = doc.getPageImageList(i)
    for img in imglist:
        xref = img[0]                  # xref number
        pix = fitz.Pixmap(doc, xref)   # make pixmap from image
        imgcount += 1
        if pix.n < 5:                  # can be saved as PNG
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:                          # must convert CMYK first
            pix0 = fitz.Pixmap(fitz.csRGB, pix)
            pix0.writePNG("p%s-%s.png" % (i, xref))
            pix0 = None                # free Pixmap resources
        pix = None                     # free Pixmap resources

t1 = time.clock()
print("run time", round(t1-t0, 2))
print("extracted images", imgcount)
--------------------------------------------------------------------------------------------------
Comments:

(1) All Python versions from 2.7 to 3.6 are supported. As is common with MuPDF-based software, these scripts run very fast - much faster than most other products in this field (I do not know a faster alternative for this task).

(2) Runtime of extracting all images (script 1) depends on number of objects in the PDF (and total image sizes). Extracting the about 180 images of the Adobe PDF manual (330'000 objects) took 7 seconds on my machine.

(3) Runtime of extracting only images referenced by a page depends on number of pages (and total image sizes). Extracting the 179 page images of the Adobe PDF (1310 pages) took 3 seconds on my machine.