Welcome, guest | Sign In | My Account | Store | Cart

Convert PDF to plain text (Python recipe) by ccpizza
ActiveState Code (http://code.activestate.com/recipes/577095/)

This is a very raw PDF converter which has absolutely no idea of the page layout or text positioning.

To install the required module try easy_install pypdf in a console.

      import sys
import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + " \n"
    # Collapse whitespace
    content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
    return content

f = open(sys.argv[1]+'.txt','w+')
f.write(getPDFContent(sys.argv[1]))
f.close()
#print getPDFContent(sys.argv[1]).encode("ascii", "xmlcharrefreplace")

      

Tags: converter, pdf

Created by ccpizza on Tue, 9 Mar 2010 (MIT)

◄	Python recipes (4591)	►
◄	ccpizza's recipes (18)	►

Required Modules

pypdf

Other Information and Tasks

Licensed under the MIT License
Viewed 19567 times
Revision 1

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

Convert PDF to plain text (Python recipe) by ccpizza ActiveState Code (http://code.activestate.com/recipes/577095/)