This example shows how to extract text informations from a PDF file without the need of system dependent tools or code. Just use the pyPdf library from http://pybrary.net/pyPdf/
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content
print getPDFContent("test.pdf")
|
There are more nice PDF manipulations possible with pyPdf. An other way to extract the text from PDF files is to call the Linux command "pdftotext" and catch its output.
Tags: text
The pdftotxt tool in Xpdf (http://www.foolabs.com/xpdf/download.html) can do a similar thing, though not in Python.
backslash should be escaped. This code doesn't work as it is here. The backslash should be escaped on this line: content = " ".join(content.replace("\xa0", " ").strip().split())
Error found. Given code doesn't work. Error shows when running my system:
Can anyone could solve the problem and may I know the reason for error?
This version takes care of Unicode errors:
'ignore' is too rough, i'd rather use print getPDFContent("test.pdf").encode("ascii", "xmlcharrefreplace")
how to get text with spaces between them ,, the output for this program seems to be merging all texts in a single string.