Welcome, guest | Sign In | My Account | Store | Cart

This example shows how to extract text informations from a PDF file without the need of system dependent tools or code. Just use the pyPdf library from http://pybrary.net/pyPdf/

Python, 15 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace("\xa0", " ").strip().split())
    return content

print getPDFContent("test.pdf")

There are more nice PDF manipulations possible with pyPdf. An other way to extract the text from PDF files is to call the Linux command "pdftotext" and catch its output.

6 comments

Josiah Carlson 17 years ago  # | flag

The pdftotxt tool in Xpdf (http://www.foolabs.com/xpdf/download.html) can do a similar thing, though not in Python.

Paul Rougieux 16 years, 5 months ago  # | flag

backslash should be escaped. This code doesn't work as it is here. The backslash should be escaped on this line: content = " ".join(content.replace("\xa0", " ").strip().split())

Narendran Subra 16 years, 2 months ago  # | flag

Error found. Given code doesn't work. Error shows when running my system:

Traceback (most recent call last):
  File "pdfext.py", line 15, in
    print getPDFContent("testds.pdf")
  File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xde' in position 1
018: character maps to

Can anyone could solve the problem and may I know the reason for error?

Jesse Aldridge 15 years, 5 months ago  # | flag

This version takes care of Unicode errors:

import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

print getPDFContent("test.pdf").encode("ascii", "ignore")
ccpizza 14 years, 6 months ago  # | flag

'ignore' is too rough, i'd rather use print getPDFContent("test.pdf").encode("ascii", "xmlcharrefreplace")

Anupam Patel 9 years ago  # | flag

how to get text with spaces between them ,, the output for this program seems to be merging all texts in a single string.

Created by Dirk Holtwick on Thu, 12 Apr 2007 (PSF)
Python recipes (4591)
Dirk Holtwick's recipes (15)

Required Modules

Other Information and Tasks