Popular Python recipes tagged "parsing"http://code.activestate.com/recipes/langs/python/tags/parsing/2016-04-10T22:43:57-07:00ActiveState Code RecipesHow to parse a table in a PDF document (Python) 2016-04-10T22:43:57-07:00Jorj X. McKiehttp://code.activestate.com/recipes/users/4193772/http://code.activestate.com/recipes/580635-how-to-parse-a-table-in-a-pdf-document/ <p style="color: grey"> Python recipe 580635 by <a href="/recipes/users/4193772/">Jorj X. McKie</a> (<a href="/recipes/tags/cbz/">cbz</a>, <a href="/recipes/tags/epub/">epub</a>, <a href="/recipes/tags/fitz/">fitz</a>, <a href="/recipes/tags/mupdf/">mupdf</a>, <a href="/recipes/tags/openxps/">openxps</a>, <a href="/recipes/tags/parsing/">parsing</a>, <a href="/recipes/tags/pdf/">pdf</a>, <a href="/recipes/tags/pymupdf/">pymupdf</a>, <a href="/recipes/tags/table/">table</a>, <a href="/recipes/tags/xps/">xps</a>). Revision 4. </p> <p>A Python function that converts a table contained in a page of a PDF (or OpenXPS, EPUB, CBZ, XPS) document to a matrix-like Python object (list of lists of strings).</p> MicroXml: Stand-alone library for basic XML features (Python) 2015-12-04T22:36:56-08:00Jack Trainorhttp://code.activestate.com/recipes/users/4076953/http://code.activestate.com/recipes/579133-microxml-stand-alone-library-for-basic-xml-feature/ <p style="color: grey"> Python recipe 579133 by <a href="/recipes/users/4076953/">Jack Trainor</a> (<a href="/recipes/tags/parsing/">parsing</a>, <a href="/recipes/tags/xml/">xml</a>). </p> <p>MicroXml provides stand-alone support for the basic, most-used features of XML -- tags, attributes, and element values. It produces a DOM tree of XML nodes. It's compatible with Python 2.7 and Python 3. MicroXml does not support DTDs, CDATAs and other advanced XML features.</p> <p>MicroXml is easy to use and easy to view/navigate its nodes in a debugger. It also includes a minimal XPath-like implementation.</p> A Basic USe flag EDitor for Gentoo Linux supporting on-the-fly editing (Python) 2015-02-28T07:04:31-08:00Mike 'Fuzzy' Partinhttp://code.activestate.com/recipes/users/4179778/http://code.activestate.com/recipes/579028-a-basic-use-flag-editor-for-gentoo-linux-supportin/ <p style="color: grey"> Python recipe 579028 by <a href="/recipes/users/4179778/">Mike 'Fuzzy' Partin</a> (<a href="/recipes/tags/parsing/">parsing</a>, <a href="/recipes/tags/popen/">popen</a>, <a href="/recipes/tags/subprocess/">subprocess</a>, <a href="/recipes/tags/user_input/">user_input</a>). </p> <p>This allows for on-the-fly editing. Simply drop abused.py into your path, and ensure that -a is not set in EMERGE_DEFAULT_OPTS in /etc/portage/make.conf. Then whenver you are installing new packages, use abused in place of emerge (eg: abused multitail) you will be presented with a list of use flags that are used in this action, and a prompt for editing any of them, simply hit enter with no changes to fire off the build.</p> Flexible datetime parsing (Python) 2012-08-21T07:35:34-07:00Glenn Hutchingshttp://code.activestate.com/recipes/users/4175415/http://code.activestate.com/recipes/578245-flexible-datetime-parsing/ <p style="color: grey"> Python recipe 578245 by <a href="/recipes/users/4175415/">Glenn Hutchings</a> (<a href="/recipes/tags/datetime/">datetime</a>, <a href="/recipes/tags/parsing/">parsing</a>). </p> <p>The strptime() method of datetime accepts a format string that you have to specify in advance. What if you want to be more flexible in the kinds of date your program accepts? Here's a recipe for a function that tries many different formats until it finds one that works.</p> Extracting structured text or code (Python) 2011-05-18T13:04:01-07:00Mike Sweeneyhttp://code.activestate.com/recipes/users/4177990/http://code.activestate.com/recipes/577700-extracting-structured-text-or-code/ <p style="color: grey"> Python recipe 577700 by <a href="/recipes/users/4177990/">Mike Sweeney</a> (<a href="/recipes/tags/parsing/">parsing</a>, <a href="/recipes/tags/structured/">structured</a>, <a href="/recipes/tags/text_processing/">text_processing</a>, <a href="/recipes/tags/token/">token</a>). Revision 2. </p> <p>This function uses the power of regular expressions to extract parts of a structured text string. It can build a token list from many types of code and data formats. It finds string types (with quotes) and nested structures that use parentheses, brackets, and braces. If you need to extract a different syntax, you can provide a custom token pattern in the function arguments.</p> python xml parsing (Python) 2011-05-28T19:36:00-07:00abhijeet vaidyahttp://code.activestate.com/recipes/users/4178141/http://code.activestate.com/recipes/577727-python-xml-parsing/ <p style="color: grey"> Python recipe 577727 by <a href="/recipes/users/4178141/">abhijeet vaidya</a> (<a href="/recipes/tags/parsing/">parsing</a>, <a href="/recipes/tags/python/">python</a>, <a href="/recipes/tags/xml/">xml</a>). </p> <p>xml parsing to how to extract data</p> Get columns of data from text files (Python) 2010-10-28T16:18:19-07:00aliniumhttp://code.activestate.com/recipes/users/4175605/http://code.activestate.com/recipes/577444-get-columns-of-data-from-text-files/ <p style="color: grey"> Python recipe 577444 by <a href="/recipes/users/4175605/">alinium</a> (<a href="/recipes/tags/columns/">columns</a>, <a href="/recipes/tags/file/">file</a>, <a href="/recipes/tags/parsing/">parsing</a>). </p> <p>Read in a tab-delimited (or any separator-delimited like CSV) file and store each column in a list that can be referenced from a dictionary. The keys for the dictionary are the headings for the columns (if any). All data is read in as strings.</p> Simple tabulator (Python) 2010-11-09T12:50:06-08:00Noufal Ibrahimhttp://code.activestate.com/recipes/users/4173873/http://code.activestate.com/recipes/577458-simple-tabulator/ <p style="color: grey"> Python recipe 577458 by <a href="/recipes/users/4173873/">Noufal Ibrahim</a> (<a href="/recipes/tags/parsing/">parsing</a>, <a href="/recipes/tags/text_processing/">text_processing</a>). </p> <p>This is a simple script to covert a top to bottom list of items into a left to right list.</p> <pre class="prettyprint"><code> a b c d e f g h i j k l m </code></pre> <p>into</p> <pre class="prettyprint"><code> a b c d e f g h i j k l m </code></pre> <p>A few command line options allow some amount of customisation. </p> Dragon Lexical Analyzer (Python) 2010-09-01T14:49:37-07:00Jack Trainorhttp://code.activestate.com/recipes/users/4076953/http://code.activestate.com/recipes/577380-dragon-lexical-analyzer/ <p style="color: grey"> Python recipe 577380 by <a href="/recipes/users/4076953/">Jack Trainor</a> (<a href="/recipes/tags/educational/">educational</a>, <a href="/recipes/tags/lexical_analyzer/">lexical_analyzer</a>, <a href="/recipes/tags/parsing/">parsing</a>). Revision 2. </p> <p>The lexical analyzer from "Compliers: Principles, Techniques and Tools," Chapter 2, by Aho, Sethi, Ullman (1986) implemented in Python.</p> Simple regex engine, elementary Python (Python) 2010-07-10T10:43:30-07:00Joost Behrendshttp://code.activestate.com/recipes/users/4174081/http://code.activestate.com/recipes/577251-simple-regex-engine-elementary-python/ <p style="color: grey"> Python recipe 577251 by <a href="/recipes/users/4174081/">Joost Behrends</a> (<a href="/recipes/tags/cached/">cached</a>, <a href="/recipes/tags/parse/">parse</a>, <a href="/recipes/tags/parsing/">parsing</a>, <a href="/recipes/tags/recursion/">recursion</a>, <a href="/recipes/tags/regular_expressions/">regular_expressions</a>). Revision 40. </p> <p>A short engine for testing against a regex, understanding the 3 common quantifiers ?,+,* (non-greedy) working on characters, ., [...], [^...], \s, \S, bracketed patterns and group designators \N. Accepts unicode objects and fixed-width encoded strings (but problems with eventual comparisons of trailing bytes in multi-byte utf-letters). Captures up to 10 groups ( (?:...) implemented), which can be used for back referencing and in xreplace(). Captured groups are accessible after the search in the global list xGroups. | is supported, but only in groups and needing nested=True. nested=False is making '(' and ')' common letters.</p> <p>This is not about Python or for Python, there it has little use beside re. But regarding that re needs about 6,000 lines you might agree with the author, that these 176 lines are powerful. This was the reason to publish it as a recipe - as a kind of (fairly complete) minimal example of a regex tester and as an example for corresponding recursive structures in data (TokenListCache) and code.</p> <p>Working on this improved the author's understanding of regular expressions - especially of their eventual "greed". "Greedy" quantifiers are a concept, which has to be explained seperately and is coming unexpected: Whoever is scanning a text for <code>'&lt;.*&gt;'</code>, s/he will search SGML tags, not the whole text. Even with the star's "greediness" the code has to take care, that <code>'.*'</code> doesn't eat the whole text finding no match for <code>'&lt;.*&gt;'</code> at all. Thus the standard syntax with greedy quantifiers cannot be simpler to implement than this with its mere 3 lines 101, 111 and 121 preventing any greed. Perhaps it is faster - otherwise it is difficult to understand, why the concept "greed" is existing at all.</p> <p>This engine might be useful here and then under circumstances with nothing else available. Its brevity eases translation to other languages and it can work with arbitrary characters for STAR or PERHAPS (for example).</p> Parse HTTP date-time string (Python) 2010-01-20T13:47:50-08:00Sridhar Ratnakumarhttp://code.activestate.com/recipes/users/4169511/http://code.activestate.com/recipes/577015-parse-http-date-time-string/ <p style="color: grey"> Python recipe 577015 by <a href="/recipes/users/4169511/">Sridhar Ratnakumar</a> (<a href="/recipes/tags/datetime/">datetime</a>, <a href="/recipes/tags/http/">http</a>, <a href="/recipes/tags/parsing/">parsing</a>). </p> <p>This recipe will help you parse datetime strings returned by HTTP servers following the RFC 2616 standard (which <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.3">supports three datetime formats</a>). Credit for this recipe goes to <a href="http://stackoverflow.com/questions/1471987/how-do-i-parse-an-http-date-string-in-python/1472336#1472336">ΤΖΩΤΖΙΟΥ</a>.</p> Expression Evaluator (Python) 2009-06-02T23:57:44-07:00Stephen Chappellhttp://code.activestate.com/recipes/users/2608421/http://code.activestate.com/recipes/576790-expression-evaluator/ <p style="color: grey"> Python recipe 576790 by <a href="/recipes/users/2608421/">Stephen Chappell</a> (<a href="/recipes/tags/evaluation/">evaluation</a>, <a href="/recipes/tags/expressions/">expressions</a>, <a href="/recipes/tags/parsing/">parsing</a>). Revision 4. </p> <p>After reading a portion of the book "The C# Programming Language: Third Edition," I found a section in the introduction that introduced abstract classes and methods that involved an example that included the concept of expression trees. The code was easy to implement since it just had to be copied out of the book. After playing around with the program a little and extending it, I thought that it would be fun to write a program in C# that could (interactively) evaluate expressions and display the results. Not knowing C# quite as well as Python led to the following program written and tested in Python 3.0 (not sure about previous languages).</p> <p>The first section of the code includes port of the program from the aforementioned book along with extra code that allows for further features not originally included in the C# version. Those sections are clearly marked as being new code written by yours truly. The second area of the program has six functions that are profusely documented so as to explain how they go about parsing and processing expressions entered for evaluation. For those wishing to use the code, the "run" function should be all that you need. The final part of the module contains a test program that can be used to check the validity of the how well the program works.</p> <p>The parser is not very complicated and will except expressions that are both normal to Python and completely illegal in Python. The main features are its ability to (1) identify simple assignment and mathematical operations, (2) identify constant floating point numbers, and (3) identify variables that would otherwise have no other meaning to the program. A limited number of error messages are given when appropriate but may leave one guessing what the problem really is. Mathematical operations are evaluated from left to right without regards to precedence, and assignment statements are evaluated from right to left.</p> Simple Web Crawler (Python) 2011-01-31T21:57:58-08:00James Millshttp://code.activestate.com/recipes/users/4167757/http://code.activestate.com/recipes/576551-simple-web-crawler/ <p style="color: grey"> Python recipe 576551 by <a href="/recipes/users/4167757/">James Mills</a> (<a href="/recipes/tags/crawler/">crawler</a>, <a href="/recipes/tags/network/">network</a>, <a href="/recipes/tags/parsing/">parsing</a>, <a href="/recipes/tags/web/">web</a>). Revision 2. </p> <p>NOTE: This recipe has been updated with suggested improvements since the last revision.</p> <p>This is a simple web crawler I wrote to test websites and links. It will traverse all links found to any given depth.</p> <p>See --help for usage.</p> <p>I'm posting this recipe as this kind of problem has been asked on the Python Mailing List a number of times... I thought I'd share my simple little implementation based on the standard library and BeautifulSoup.</p> <p>--JamesMills</p> Nicer struct syntax thanks to Py3 metaclasses (Python) 2009-02-25T07:39:52-08:00Daniel Brodiehttp://code.activestate.com/recipes/users/1892511/http://code.activestate.com/recipes/576666-nicer-struct-syntax-thanks-to-py3-metaclasses/ <p style="color: grey"> Python recipe 576666 by <a href="/recipes/users/1892511/">Daniel Brodie</a> (<a href="/recipes/tags/binary/">binary</a>, <a href="/recipes/tags/parsing/">parsing</a>, <a href="/recipes/tags/py3/">py3</a>, <a href="/recipes/tags/struct/">struct</a>). </p> <p>This is a quick-hack module I wrote up in a couple of hours that allows for a nicer syntax to build up struct-like binary packing and unpacking. The point was to get it to be concise and as C-like as possible. This script requires python3 for it's improved metaclass support.</p> Parse call function for Py2.6 and Py2.7 (Python) 2009-02-28T20:13:15-08:00Jervis Whitleyhttp://code.activestate.com/recipes/users/4169341/http://code.activestate.com/recipes/576671-parse-call-function-for-py26-and-py27/ <p style="color: grey"> Python recipe 576671 by <a href="/recipes/users/4169341/">Jervis Whitley</a> (<a href="/recipes/tags/ast/">ast</a>, <a href="/recipes/tags/call/">call</a>, <a href="/recipes/tags/function/">function</a>, <a href="/recipes/tags/namedtuple/">namedtuple</a>, <a href="/recipes/tags/nodevisitor/">nodevisitor</a>, <a href="/recipes/tags/parsing/">parsing</a>). Revision 14. </p> <p>In some cases it may be desirable to parse the string expression "f1(*args)" and return some of the key features of the represented function-like call. </p> <p>This recipe returns the key features in the form of a namedtuple. </p> <p>e.g. (for the above)</p> <pre class="prettyprint"><code>&gt;&gt;&gt; explain("f1(*args)") [ Call(func='f1', starargs='args') ] </code></pre> <p>The recipe will return a list of such namedtuples for <code>"f1(*args)\nf2(*args)"</code> Note that while the passed string expression must evaluate to valid python syntax, names needn't be declared in current scope.</p> Remove the .pyc files from current directory tree and from svn (Python) 2009-02-03T23:38:43-08:00Senthil Kumaranhttp://code.activestate.com/recipes/users/4165833/http://code.activestate.com/recipes/576641-remove-the-pyc-files-from-current-directory-tree-a/ <p style="color: grey"> Python recipe 576641 by <a href="/recipes/users/4165833/">Senthil Kumaran</a> (<a href="/recipes/tags/parsing/">parsing</a>). </p> <p>I had mistakenly checked in .pyc files into svn, So I took this approach of deleting all the .pyc files in the current working copy directory tree and then using svn remove to the remove from the repository. The following is the snippet I wrote then to for the purpose.</p> Copy directory tree recursively while ignoring cvs, git and svn directories (Python) 2008-12-18T21:36:12-08:00Senthil Kumaranhttp://code.activestate.com/recipes/users/4165833/http://code.activestate.com/recipes/576588-copy-directory-tree-recursively-while-ignoring-cvs/ <p style="color: grey"> Python recipe 576588 by <a href="/recipes/users/4165833/">Senthil Kumaran</a> (<a href="/recipes/tags/parsing/">parsing</a>). </p> <p>I wanted to do a conditional copy of a directory tree. Noticed a ignore parameter introduced in Python 2.6. Thats very handy. This snippet gives the example of its usage.</p>