ActiveState Code

Recipe 511465: Pure Python PDF to text converter


This example shows how to extract text informations from a PDF file without the need of system dependent tools or code. Just use the pyPdf library from http://pybrary.net/pyPdf/

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace("\xa0", " ").strip().split())
    return content

print getPDFContent("test.pdf")

Discussion

There are more nice PDF manipulations possible with pyPdf. An other way to extract the text from PDF files is to call the Linux command "pdftotext" and catch its output.

Comments

  1. 1. At 10:20 p.m. on 12 apr 2007, Josiah Carlson said:

    The pdftotxt tool in Xpdf (http://www.foolabs.com/xpdf/download.html) can do a similar thing, though not in Python.

  2. 2. At 6:36 a.m. on 6 dec 2007, Paul Rougieux said:

    backslash should be escaped. This code doesn't work as it is here. The backslash should be escaped on this line: content = " ".join(content.replace("\xa0", " ").strip().split())

  3. 3. At 7:54 p.m. on 20 feb 2008, Narendran Subra said:

    Error found. Given code doesn't work. Error shows when running my system:

    Traceback (most recent call last):
      File "pdfext.py", line 15, in
        print getPDFContent("testds.pdf")
      File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
        return codecs.charmap_encode(input,errors,encoding_map)
    UnicodeEncodeError: 'charmap' codec can't encode character u'\xde' in position 1
    018: character maps to
    

    Can anyone could solve the problem and may I know the reason for error?

  4. 4. At 9:26 p.m. on 9 nov 2008, Jesse Aldridge said:

    This version takes care of Unicode errors:

    import pyPdf
    
    def getPDFContent(path):
        content = ""
        # Load PDF into pyPDF
        pdf = pyPdf.PdfFileReader(file(path, "rb"))
        # Iterate pages
        for i in range(0, pdf.getNumPages()):
            # Extract text from page and add to content
            content += pdf.getPage(i).extractText() + "\n"
        # Collapse whitespace
        content = " ".join(content.replace(u"\xa0", " ").strip().split())
        return content
    
    print getPDFContent("test.pdf").encode("ascii", "ignore")
    

Sign in to comment