Welcome, guest | Sign In | My Account | Store | Cart

A new method select() in PyMuPDF 1.9.0 allows selecting pages of a PDF document to create a new one. Any Python list of integers (0 <= n < page count) can be taken.

The resulting PDF contains all links, annotations and bookmarks (provided they still point to valid targets).

Python, 37 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import fitz                             # this is PyMuPDF 1.9.0
doc = fitz.open("some.pdf")

# An easy start: create new PDFs of the first and last 10 pages ...
l = list(range(10))                     # first 10 pages
doc.select(l)                           # delete all others
doc.save("some-first-10.pdf", garbage=3)# save and clean new PDF
doc.close()
doc = fitz.open("some.pdf")             # recycle PDF
l = list(range(doc.pageCount-10, doc.pageCount))    # last 10 pages
doc.select(l)                           # delete all others
doc.save("some-last-10.pdf", garbage=3) # save and clean new PDF
doc.close()

# page numbers may occur multiple times and in any order ...
doc = fitz.open("some.pdf")             # recycle PDF
doc.select([1,1,1,3,3,3,5,5,5,0,0,0])   # create crazily tripled pages
doc.save("some-crazy-triples.pdf", garbage=3)   # save that & clean new PDF
doc.close()

# new PDF containing the original 2 times
doc = fitz.open("some.pdf")             # recycle PDF
l = list(range(doc.pageCount))          # list of all pages
l += l                                  # two times that [0,...,n,0,...,n]
doc.select(l)                           # PDF will now contain itself twice ...
doc.save("some-times-2.pdf")            # will hardly be bigger than original!
doc.close()

# delete pages without text (or whatever ...)
doc = fitz.open("some.pdf")             # recycle PDF
l = list(range(doc.pageCount))          # list of all pages
for i in l:
    if not doc.getPageText(i)           # if no text on page number i ...
        l.remove(i)                     # delete that page from list
doc.select(l)                           # select remaining pages from the PDF
doc.save("some-non-empty.pdf", garbage=3)         # save PDF, every page has some text now ...
doc.close()
  • PyMuPDF actually supports Python versions 2.7 to 3.5 (x86 and x64).

  • other possibilities of this technique include selection of only the odd (even) pages or reverting the page sequence