Products.PDFtoOCR | Python Package Manager Index (PyPM)

INSTALL>

pypm install products.pdftoocr

[+]

How to install Products.PDFtoOCR

Download and install ActivePython
Open Command Prompt
Type pypm install products.pdftoocr

Python 2.7

Python 3.2

Python 3.3

Windows (32-bit)

The build is available for this platform; click to see other versions

1.1

Available

View build log

Windows (64-bit)

1.1

Available

View build log

Mac OS X (10.5+)

1.1

Available

View build log

Linux (32-bit)

1.1

Available

View build log

Linux (64-bit)

1.1

Available

View build log

web zope plone theme

Author

Plone Collective

License

GPL

Dependencies

Imports

Lastest release

version 1.1 on Jan 5th, 2011

Introduction

PDFtoOCR processes text in PDF documents using OCR. This is needed when text cannot be extracted from a (scanned) PDF. PDFtoOCR uses content rules to schedule the OCR processing. The processing cannot be done one the fly, for example with a custom TextIndexNG plugin. Processing large PDF documents using OCR is a time/processor consuming task.

Configuration

On the operating system

PDF to Text uses three tools that are available for under Linux. The cooperation with the tools is only tested in Debian. But it the will probably work in in other *NIX enviroments.

Install requirements, PDF to OCR uses the following programs:

pdftotext, checks if OCR processing is necessary
ghostscript, converts the pdf documents to tiff images
tesseract, does the OCR processing (make sure you've got all language packs!*)

Set the environment variables:

The environment variable $GS must be set and point to the ghostscript binary.
The environment variable $TESSERACT must be set and point to the tesseract binary.

On the Plone site

Add a content rule

Event trigger: Object modified and object added
Condition: Content type is file
Actions: Store OCR output from a PDF in searchable text

Assign content rule to a Plone site or a folder

Install cron4plone and add the following cronjob: portal/@@do_pdf_ocr_index

Usage

Just add a file with a PDF document. Optionally you can select the language so the OCR engine can use dictionaries when indexing. Only a limited amount of languages are supported by Tesseract.

An overview of indexed documents is found in the control panel, 'PDF to OCR status'. In this status page (re)indexing of documents is possible.

PDF Processing

Each time a file is added or modified the unique id (uid) of the file is added to a queue. This queue is persistent and has two functions, for indexing en reindexing. The indexing function uses the queue to process the documents. When reindexing is used all files in the queue history are processed.

If the text from a PDF document is extracted using pdftotext no OCR is done. Else the OCR extracts the text and stores it the content type file. The ATFile is patched with an extra field to accommodate the extracted text and the language of the PDF.

Page views:

@@do_pdf_ocr_index - indexes documents in the queue
@@do_pdf_ocr_reindex - reindexes all pdf documents in the Plone site
@@pdf_ocr_status - Show the queue and a history 10 documents

Futher reading:

http://plone.org/documentation/how-to/ocr-in-plone-using-tesseract-ocr/ http://code.google.com/p/tesseract-ocr/

Make sure you don't got empty language files in /usr/local/share/tessdata/

Maybe a good alternative in the future, uses tesseract but hard to setup and still too much beta: http://sites.google.com/site/ocropus/

Changelog

1.1

Compatible with Plone 4
Added a control panel page
Field 'text from ocr' is added using archetypes.schemaextender instead of a monkey patch
No more old style external method for doing things on the filesystem.
Added doc tests

1.0 - First release

Initial release

PyPM Index

Products.PDFtoOCR 1.1

PDFtoOCR does OCR processing on PDF documents. The text from OCR is used in the search results.

How to install Products.PDFtoOCR

Links

Author

License

Dependencies

Imports

Lastest release

Introduction

Configuration

On the operating system

On the Plone site

Usage

PDF Processing

Futher reading:

Changelog

1.1

1.0 - First release

Subscribe to package updates

Download Stats

What does the lock icon mean?

Need custom builds or support?

Plan on re-distributing ActivePython?

Accounts

PyPM

Feedback & Information

ActiveState

PyPM Index

Products.PDFtoOCR 1.1 PDFtoOCR does OCR processing on PDF documents. The text from OCR is used in the search results.

How to install Products.PDFtoOCR

Links

Author

License

Dependencies

Imports

Lastest release

Introduction

Configuration

On the operating system

On the Plone site

Usage

PDF Processing

Futher reading:

Changelog

1.1

1.0 - First release

Subscribe to package updates

Download Stats

What does the lock icon mean?

Need custom builds or support?

Plan on re-distributing ActivePython?

Accounts

PyPM

Feedback & Information

ActiveState

Products.PDFtoOCR 1.1

PDFtoOCR does OCR processing on PDF documents. The text from OCR is used in the search results.