PDF Text Extraction using fitz / MuPDF (PyMuPDF) (Python)

2016-03-17T12:00:06-07:00

Python recipe 580626 by Jorj X. McKie (cbz, epub, mupdf, openxps, pdf, pymupdf, text_extraction, xps).

Extract all the text of a PDF (or other supported container types) at very high speed. In general, text pieces of a PDF page are not arranged in natural reading order, but in the order they were entered during PDF creation. This script re-arranges text blocks according to their pixel coordinates to achieve a more readable output, i.e. top-down, left-right.

Popular recipes tagged "text_extraction"

PDF Text Extraction using fitz / MuPDF (PyMuPDF) (Python)