Most viewed recipes tagged "parser"http://code.activestate.com/recipes/tags/parser/views/2013-05-27T05:32:07-07:00ActiveState Code RecipesA Simple Webcrawler (Python)
2012-03-03T02:37:30-08:00Johnhttp://code.activestate.com/recipes/users/4181142/http://code.activestate.com/recipes/578060-a-simple-webcrawler/
<p style="color: grey">
Python
recipe 578060
by <a href="/recipes/users/4181142/">John</a>
(<a href="/recipes/tags/crawler/">crawler</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/page/">page</a>, <a href="/recipes/tags/parser/">parser</a>, <a href="/recipes/tags/scraping/">scraping</a>, <a href="/recipes/tags/urllib/">urllib</a>, <a href="/recipes/tags/urlopen/">urlopen</a>, <a href="/recipes/tags/web/">web</a>).
</p>
<p>This is my simple web crawler. It takes as input a list of seed pages (web urls) and 'scrapes' each page of all its absolute path links (i.e. links in the format <a href="http://" rel="nofollow">http://</a>) and adds those to a dictionary. The web crawler can take all the links found in the seed pages and then scrape those as well. You can continue scraping as deep as you like. You can control how "deep you go" by specifying the depth variable passed into the WebCrawler class function start_crawling(seed_pages,depth). Think of the depth as the recursion depth (or the number of web pages deep you go before returning back up the tree).</p>
<p>To make this web crawler a little more interesting I added some bells and whistles. I added the ability to pass into the WebCrawler class constructor a regular expression object. The regular expression object is used to "filter" the links found during scraping. For example, in the code below you will see:</p>
<p>cnn_url_regex = re.compile('(?<=[.]cnn)[.]com') # cnn_url_regex is a regular expression object</p>
<p>w = WebCrawler(cnn_url_regex)</p>
<p>This particular regular expression says:</p>
<p>1) Find the first occurence of the string '.com'</p>
<p>2) Then looking backwards from where '.com' was found it attempts to find '.cnn'</p>
<p>Why do this?</p>
<p>You can control where the crawler crawls. In this case I am constraining the crawler to operate on webpages within cnn.com.</p>
<p>Another feature I added was the ability to parse a given page looking for specific html tags. I chose as an example the <h1> tag. Once a <h1> tag is found I store all the words I find in the tag in a dictionary that gets associated with the page url.</p>
<p>Why do this?</p>
<p>My thought was that if I scraped the page for text I could eventually use this data for a search engine request. Say I searched for 'Lebron James'. And suppose that one of the pages my crawler scraped found an article that mentions Lebron James many times. In response to a search request I could return the link with the Lebron James article in it.</p>
<p>The web crawler is described in the WebCrawler class. It has 2 functions the user should call:</p>
<p>1) start_crawling(seed_pages,depth)</p>
<p>2) print_all_page_text() # this is only used for debug purposes</p>
<p>The rest of WebCrawler's functions are internal functions that should not be called by the user (think private in C++).</p>
<p>Upon construction of a WebCrawler object, it creates a MyHTMLParser object. The MyHTMLParser class inherits from the built-in Python class HTMLParser. I use the MyHTMLParser object when searching for the <h1> tag. The MyHTMLParser class creates instances of a helper class named Tag. The tag class is used in creating a "linked list" of tags.</p>
<p>So to get started with WebCrawler make sure to use Python 2.7.2. Enter the code a piece at a time into IDLE in the order displayed below. This ensures that you import libs before you start using them.</p>
<p>Once you have entered all the code into IDLE, you can start crawling the 'interwebs' by entering the following:</p>
<p>import re</p>
<p>cnn_url_regex = re.compile('(?<=[.]cnn)[.]com') </p>
<p>w = WebCrawler(cnn_url_regex)</p>
<p>w.start_crawling(['http://www.cnn.com/2012/02/24/world/americas/haiti-pm-resigns/index.html?hpt=hp_t3'],1)</p>
<p>Of course you can enter any page you want. But the regular expression object is already setup to filter on <a href="http://cnn.com" rel="nofollow">cnn.com</a>. Remember the second parameter passed into the start_crawling function is the recursion depth.</p>
<p>Happy Crawling!</p>
easyjson.py - parsing JSON from buffer and from file (Python)
2013-05-27T05:32:07-07:00Thomas Lehmannhttp://code.activestate.com/recipes/users/4174477/http://code.activestate.com/recipes/578529-easyjsonpy-parsing-json-from-buffer-and-from-file/
<p style="color: grey">
Python
recipe 578529
by <a href="/recipes/users/4174477/">Thomas Lehmann</a>
(<a href="/recipes/tags/json/">json</a>, <a href="/recipes/tags/parser/">parser</a>).
Revision 4.
</p>
<p><strong>JSON Parser</strong>:</p>
<ul>
<li>Refering to <a href="http://www.json.org/" rel="nofollow">http://www.json.org/</a></li>
<li>Just one simple Python file you can integrate where you want to</li>
<li>No imports (very important) -> no dependencies!!!</li>
<li>Should work with really older versions of Python!!!</li>
</ul>
<p><strong>Todo's</strong>:</p>
<ul>
<li>Doesn't cover full number format</li>
<li>...</li>
</ul>
<p><strong>Done</strong></p>
<ul>
<li>Allows string in string (revision 2)</li>
<li>Covers objects in an array (revision 2)</li>
<li>Provides a mechanism to allow other dictionaries (like collections.OrderedDict) (revision 3)</li>
<li>Conversion of numbers to integer or float types (revision 4)</li>
</ul>
Parser Keylogger based on a Finite State Machine (Python)
2011-11-21T14:53:52-08:00Filippo Squillacehttp://code.activestate.com/recipes/users/4174931/http://code.activestate.com/recipes/577952-parser-keylogger-based-on-a-finite-state-machine/
<p style="color: grey">
Python
recipe 577952
by <a href="/recipes/users/4174931/">Filippo Squillace</a>
(<a href="/recipes/tags/keylogger/">keylogger</a>, <a href="/recipes/tags/parser/">parser</a>, <a href="/recipes/tags/state_machine/">state_machine</a>).
Revision 3.
</p>
<p>This program parses the logfile given by the execution of the keylogger
command <strong>'script -c "xinput test ID_CODE" | cat LOG_FILE'</strong> and
it is based on a Finite State Machine (FSM) to manage all
the possible combinations of the modifiers that represent the state of the FSM.
The parser gets the mapping between the couple of keycode and modifier typed
and the corresponding char by xmodmap command. The parser is able to manage also extended
combinations such as Control or Super that don't give a real char.
To introduce new possible states that represent new combinations between modifiers,
it's just necessary to update the list of state (<em>mod_keys</em>) and add new rules in the transition function properly.
For example to introduce the Caps Lock state just add it in mod_keys and the data structure transition has to handle
the release event of the corresponding key.
For the dependency of xmodmap the parser works only in X11 based systems.</p>
Evaluator 2.0 (Python)
2010-11-25T18:24:09-08:00Stephen Chappellhttp://code.activestate.com/recipes/users/2608421/http://code.activestate.com/recipes/577469-evaluator-20/
<p style="color: grey">
Python
recipe 577469
by <a href="/recipes/users/2608421/">Stephen Chappell</a>
(<a href="/recipes/tags/evaluator/">evaluator</a>, <a href="/recipes/tags/expressions/">expressions</a>, <a href="/recipes/tags/math/">math</a>, <a href="/recipes/tags/parser/">parser</a>).
</p>
<p>This is a complete rewrite of <a href="http://code.activestate.com/recipes/576790/">recipe 576790</a>. While aiming to maintain similar functionality and continuing its implementation for self-academic purposes, a much cleaner parser / tokenizer and operator execution engine were developed. A slightly different math syntax is supported in this version, but it is arguably better and more capable than it previously was. Base prefixes are a feature now supported, and the single downgrade is calculating with integers instead of floats.</p>
slurp.py (Regex based simple parsing engine) (Python)
2013-05-26T18:00:58-07:00Mike 'Fuzzy' Partinhttp://code.activestate.com/recipes/users/4179778/http://code.activestate.com/recipes/578532-slurppy-regex-based-simple-parsing-engine/
<p style="color: grey">
Python
recipe 578532
by <a href="/recipes/users/4179778/">Mike 'Fuzzy' Partin</a>
(<a href="/recipes/tags/parser/">parser</a>, <a href="/recipes/tags/regex/">regex</a>, <a href="/recipes/tags/text_processing/">text_processing</a>).
</p>
<p>A parsing engine that allows you to define sets of patterns and callbacks, and process any I/O object in Pyton that has a readline() method.</p>