This example shows how to extract text informations from a PDF file without the need of system dependent tools or code. Just use the pyPdf library from http://pybrary.net/pyPdf/
Python, 15 lines
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
import pyPdf def getPDFContent(path): content = "" # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages for i in range(0, pdf.getNumPages()): # Extract text from page and add to content content += pdf.getPage(i).extractText() + "\n" # Collapse whitespace content = " ".join(content.replace("\xa0", " ").strip().split()) return content print getPDFContent("test.pdf")
There are more nice PDF manipulations possible with pyPdf. An other way to extract the text from PDF files is to call the Linux command "pdftotext" and catch its output.
The pdftotxt tool in Xpdf (http://www.foolabs.com/xpdf/download.html) can do a similar thing, though not in Python.
backslash should be escaped. This code doesn't work as it is here. The backslash should be escaped on this line: content = " ".join(content.replace("\xa0", " ").strip().split())
Error found. Given code doesn't work. Error shows when running my system:
Can anyone could solve the problem and may I know the reason for error?
This version takes care of Unicode errors:
'ignore' is too rough, i'd rather use print getPDFContent("test.pdf").encode("ascii", "xmlcharrefreplace")
how to get text with spaces between them ,, the output for this program seems to be merging all texts in a single string.