Popular recipes tagged "meta:requires=htmlentitydefs"http://code.activestate.com/recipes/tags/meta:requires=htmlentitydefs/2012-03-03T02:37:30-08:00ActiveState Code RecipesA Simple Webcrawler (Python)
2012-03-03T02:37:30-08:00Johnhttp://code.activestate.com/recipes/users/4181142/http://code.activestate.com/recipes/578060-a-simple-webcrawler/
<p style="color: grey">
Python
recipe 578060
by <a href="/recipes/users/4181142/">John</a>
(<a href="/recipes/tags/crawler/">crawler</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/page/">page</a>, <a href="/recipes/tags/parser/">parser</a>, <a href="/recipes/tags/scraping/">scraping</a>, <a href="/recipes/tags/urllib/">urllib</a>, <a href="/recipes/tags/urlopen/">urlopen</a>, <a href="/recipes/tags/web/">web</a>).
</p>
<p>This is my simple web crawler. It takes as input a list of seed pages (web urls) and 'scrapes' each page of all its absolute path links (i.e. links in the format <a href="http://" rel="nofollow">http://</a>) and adds those to a dictionary. The web crawler can take all the links found in the seed pages and then scrape those as well. You can continue scraping as deep as you like. You can control how "deep you go" by specifying the depth variable passed into the WebCrawler class function start_crawling(seed_pages,depth). Think of the depth as the recursion depth (or the number of web pages deep you go before returning back up the tree).</p>
<p>To make this web crawler a little more interesting I added some bells and whistles. I added the ability to pass into the WebCrawler class constructor a regular expression object. The regular expression object is used to "filter" the links found during scraping. For example, in the code below you will see:</p>
<p>cnn_url_regex = re.compile('(?<=[.]cnn)[.]com') # cnn_url_regex is a regular expression object</p>
<p>w = WebCrawler(cnn_url_regex)</p>
<p>This particular regular expression says:</p>
<p>1) Find the first occurence of the string '.com'</p>
<p>2) Then looking backwards from where '.com' was found it attempts to find '.cnn'</p>
<p>Why do this?</p>
<p>You can control where the crawler crawls. In this case I am constraining the crawler to operate on webpages within cnn.com.</p>
<p>Another feature I added was the ability to parse a given page looking for specific html tags. I chose as an example the <h1> tag. Once a <h1> tag is found I store all the words I find in the tag in a dictionary that gets associated with the page url.</p>
<p>Why do this?</p>
<p>My thought was that if I scraped the page for text I could eventually use this data for a search engine request. Say I searched for 'Lebron James'. And suppose that one of the pages my crawler scraped found an article that mentions Lebron James many times. In response to a search request I could return the link with the Lebron James article in it.</p>
<p>The web crawler is described in the WebCrawler class. It has 2 functions the user should call:</p>
<p>1) start_crawling(seed_pages,depth)</p>
<p>2) print_all_page_text() # this is only used for debug purposes</p>
<p>The rest of WebCrawler's functions are internal functions that should not be called by the user (think private in C++).</p>
<p>Upon construction of a WebCrawler object, it creates a MyHTMLParser object. The MyHTMLParser class inherits from the built-in Python class HTMLParser. I use the MyHTMLParser object when searching for the <h1> tag. The MyHTMLParser class creates instances of a helper class named Tag. The tag class is used in creating a "linked list" of tags.</p>
<p>So to get started with WebCrawler make sure to use Python 2.7.2. Enter the code a piece at a time into IDLE in the order displayed below. This ensures that you import libs before you start using them.</p>
<p>Once you have entered all the code into IDLE, you can start crawling the 'interwebs' by entering the following:</p>
<p>import re</p>
<p>cnn_url_regex = re.compile('(?<=[.]cnn)[.]com') </p>
<p>w = WebCrawler(cnn_url_regex)</p>
<p>w.start_crawling(['http://www.cnn.com/2012/02/24/world/americas/haiti-pm-resigns/index.html?hpt=hp_t3'],1)</p>
<p>Of course you can enter any page you want. But the regular expression object is already setup to filter on <a href="http://cnn.com" rel="nofollow">cnn.com</a>. Remember the second parameter passed into the start_crawling function is the recursion depth.</p>
<p>Happy Crawling!</p>
Cross-site scripting (XSS) defense (Python)
2006-08-05T10:45:10-07:00Josh Goldfoothttp://code.activestate.com/recipes/users/2960005/http://code.activestate.com/recipes/496942-cross-site-scripting-xss-defense/
<p style="color: grey">
Python
recipe 496942
by <a href="/recipes/users/2960005/">Josh Goldfoot</a>
(<a href="/recipes/tags/web/">web</a>).
</p>
<p>This cleanses user input of potentially dangerous HTML or scripting code that can be used to launch "cross-site scripting" ("XSS") attacks, or run other harmful or annoying code. You want to run this on any user-entered text that will be saved and retransmitted to other users of your web site. This uses only standard Python libraries.</p>
Yet another reinvention of a Python HTML generation mechanism (Python)
2005-11-22T07:25:44-08:00Josiah Carlsonhttp://code.activestate.com/recipes/users/1241800/http://code.activestate.com/recipes/440563-yet-another-reinvention-of-a-python-html-generatio/
<p style="color: grey">
Python
recipe 440563
by <a href="/recipes/users/1241800/">Josiah Carlson</a>
(<a href="/recipes/tags/web/">web</a>).
Revision 3.
</p>
<p>The other day I was complaining about writing html, forms, etc., for Python cgi and/or web programming. I had pointed out a selection of three examples, the first of which ended up being very much like Nevow.stan . Thinking a bit about it, I realized that stan had issues in that you couldn't really re-use pre-defined tags with attributes via map, and keyword arguments were just too darn convenient to swap the calling and getitem syntax.</p>
<p>Instead, I hacked together a mechanism that supports:
T.tagname("content", T.tagname(...), ..., attr1='value', ...)
T.tagname(attr1='value', ...)("content", T.tagname(...), ...)
x = T.tagname(attr1='value', ...)
y = T.tagname(*map(x, ['content', ...]))
... and many other options.</p>
<p>Essentially, you can mix and match calls as much as you want, with three memory and sanity saving semantics:
1. Creating a new tag object via T.tagname, or any call of such, will create a shallow copy of the object you are accessing.
2. smallred = T.font(size='-1', color='red');bigred = smallred(size='+1') Works exactly the way you expect it to. If it doesn't work the way you expect it to, then your expectations are confused.
3. If you are adding content that sites within the tag, the content is replaced, not updated, like attributes.</p>
<p>This simple version handles auto-indentation of content as necessary (or desireable), auto-escaping of text elements, and includes an (I believe) nearly complete listing of entities which don't require closing tags.</p>
<p>I don't know where this is going, whether it can or will expand into something more, or what, but I believe what I have managed to hack together is better than other similar packages available elsewhere (including this recipe over here <a href="http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/366000" rel="nofollow">http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/366000</a> , which I discovered after writing my own). Funny how these things work out. Astute observers will note that I borrow nevow.stan's meme of using T.tagname for generating tag objects.</p>
xmlgettext.py (Python)
2003-07-28T12:29:21-07:00Fritz Cizmarovhttp://code.activestate.com/recipes/users/1031326/http://code.activestate.com/recipes/212728-xmlgettextpy/
<p style="color: grey">
Python
recipe 212728
by <a href="/recipes/users/1031326/">Fritz Cizmarov</a>
(<a href="/recipes/tags/xml/">xml</a>).
</p>
<p>extract the texts from an XML-file and write it into an *.pot</p>
Strip tags and Javascript from HTML page, leaving only safe tags (Python)
2001-03-19T12:58:08-08:00Itamar Shtull-Trauringhttp://code.activestate.com/recipes/users/98053/http://code.activestate.com/recipes/52281-strip-tags-and-javascript-from-html-page-leaving-o/
<p style="color: grey">
Python
recipe 52281
by <a href="/recipes/users/98053/">Itamar Shtull-Trauring</a>
(<a href="/recipes/tags/web/">web</a>).
</p>
<p>Sometimes we are getting HTML input from the user. We want to only allow valid, undangerous tags, we want all tags to be balanced (i.e. an unclosed <b> will leave all text on your page bold), and we want to strip out all Javascript.</p>
<p>This recipe demonstrates how to do this using the sgmllib parser to parse HTML.</p>