Two small scripts to extract images contained in a PDF document as PNG files. (1) Script 1 extracts all images (2) Script 2 extracts only images that are referenced by a page
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | Script 1: Extract ALL images
----------------------------
#! python
'''
This demo extracts all images of a PDF as PNG files, whether they are
referenced by pages or not.
It scans through all objects and selects /Type/XObject with /Subtype/Image.
So runtime is determined by number of objects and image volume.
Usage:
extract_img2.py input.pdf
'''
from __future__ import print_function
import fitz
import sys, time, re
checkXO = r"/Type(?= */XObject)" # finds "/Type/XObject"
checkIM = r"/Subtype(?= */Image)" # finds "/Subtype/Image"
if len(sys.argv) != 2:
print('Usage: %s <input file>' % sys.argv[0])
exit(0)
t0 = time.clock()
doc = fitz.open(sys.argv[1])
imgcount = 0
lenXREF = doc._getXrefLength() # number of objects - do not use entry 0!
# display some file info
print("file: %s, pages: %s, objects: %s" % (sys.argv[1], len(doc), lenXREF-1))
for i in range(1, lenXREF): # scan through all objects
text = doc._getObjectString(i) # string defining the object
isXObject = re.search(checkXO, text) # tests for XObject
isImage = re.search(checkIM, text) # tests for Image
if not isXObject or not isImage: # not an image object if not both True
continue
imgcount += 1
pix = fitz.Pixmap(doc, i) # make pixmap from image
if pix.n < 5: # can be saved as PNG
pix.writePNG("img-%s.png" % (i,))
else: # must convert the CMYK first
pix0 = fitz.Pixmap(fitz.csRGB, pix)
pix0.writePNG("img-%s.png" % (i,))
pix0 = None # free Pixmap resources
pix = None # free Pixmap resources
t1 = time.clock()
print("run time", round(t1-t0, 2))
print("extracted images", imgcount)
--------------------------------------------------------------------------------------------------
Script 2: Only extract page-referenced images
---------------------------------------------
#! python
'''
This demo extracts all images of a PDF as PNG files that are referenced
by pages.
Runtime is determined by number of pages and volume of stored images.
Usage:
extract_img1.py input.pdf
'''
from __future__ import print_function
import fitz
import sys, time
if len(sys.argv) != 2:
print('Usage: %s <input file>' % sys.argv[0])
exit(0)
t0 = time.clock()
doc = fitz.open(sys.argv[1])
imgcount = 0
lenXREF = doc._getXrefLength()
# display some file info
print("file: %s, pages: %s, objects: %s" % (sys.argv[1], len(doc), lenXREF-1))
for i in range(len(doc)):
imglist = doc.getPageImageList(i)
for img in imglist:
xref = img[0] # xref number
pix = fitz.Pixmap(doc, xref) # make pixmap from image
imgcount += 1
if pix.n < 5: # can be saved as PNG
pix.writePNG("p%s-%s.png" % (i, xref))
else: # must convert CMYK first
pix0 = fitz.Pixmap(fitz.csRGB, pix)
pix0.writePNG("p%s-%s.png" % (i, xref))
pix0 = None # free Pixmap resources
pix = None # free Pixmap resources
t1 = time.clock()
print("run time", round(t1-t0, 2))
print("extracted images", imgcount)
--------------------------------------------------------------------------------------------------
|
Comments:
(1) All Python versions from 2.7 to 3.6 are supported. As is common with MuPDF-based software, these scripts run very fast - much faster than most other products in this field (I do not know a faster alternative for this task).
(2) Runtime of extracting all images (script 1) depends on number of objects in the PDF (and total image sizes). Extracting the about 180 images of the Adobe PDF manual (330'000 objects) took 7 seconds on my machine.
(3) Runtime of extracting only images referenced by a page depends on number of pages (and total image sizes). Extracting the 179 page images of the Adobe PDF (1310 pages) took 3 seconds on my machine.