Popular recipes tagged "text_extraction"http://code.activestate.com/recipes/tags/text_extraction/2016-03-17T12:00:06-07:00ActiveState Code RecipesPDF Text Extraction using fitz / MuPDF (PyMuPDF) (Python)
2016-03-17T12:00:06-07:00Jorj X. McKiehttp://code.activestate.com/recipes/users/4193772/http://code.activestate.com/recipes/580626-pdf-text-extraction-using-fitz-mupdf-pymupdf/
<p style="color: grey">
Python
recipe 580626
by <a href="/recipes/users/4193772/">Jorj X. McKie</a>
(<a href="/recipes/tags/cbz/">cbz</a>, <a href="/recipes/tags/epub/">epub</a>, <a href="/recipes/tags/mupdf/">mupdf</a>, <a href="/recipes/tags/openxps/">openxps</a>, <a href="/recipes/tags/pdf/">pdf</a>, <a href="/recipes/tags/pymupdf/">pymupdf</a>, <a href="/recipes/tags/text_extraction/">text_extraction</a>, <a href="/recipes/tags/xps/">xps</a>).
</p>
<p>Extract all the text of a PDF (or other supported container types) at very high speed.
In general, text pieces of a PDF page are not arranged in natural reading order, but in the order they were entered during PDF creation.
This script re-arranges text blocks according to their pixel coordinates to achieve a more readable output, i.e. top-down, left-right.</p>