Latest recipes tagged "meta:requires=httplib"http://code.activestate.com/recipes/tags/meta:requires=httplib/new/2013-05-26T10:54:25-07:00ActiveState Code RecipesA Simple Webcrawler (Python)
2012-03-03T02:37:30-08:00Johnhttp://code.activestate.com/recipes/users/4181142/http://code.activestate.com/recipes/578060-a-simple-webcrawler/
<p style="color: grey">
Python
recipe 578060
by <a href="/recipes/users/4181142/">John</a>
(<a href="/recipes/tags/crawler/">crawler</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/page/">page</a>, <a href="/recipes/tags/parser/">parser</a>, <a href="/recipes/tags/scraping/">scraping</a>, <a href="/recipes/tags/urllib/">urllib</a>, <a href="/recipes/tags/urlopen/">urlopen</a>, <a href="/recipes/tags/web/">web</a>).
</p>
<p>This is my simple web crawler. It takes as input a list of seed pages (web urls) and 'scrapes' each page of all its absolute path links (i.e. links in the format <a href="http://" rel="nofollow">http://</a>) and adds those to a dictionary. The web crawler can take all the links found in the seed pages and then scrape those as well. You can continue scraping as deep as you like. You can control how "deep you go" by specifying the depth variable passed into the WebCrawler class function start_crawling(seed_pages,depth). Think of the depth as the recursion depth (or the number of web pages deep you go before returning back up the tree).</p>
<p>To make this web crawler a little more interesting I added some bells and whistles. I added the ability to pass into the WebCrawler class constructor a regular expression object. The regular expression object is used to "filter" the links found during scraping. For example, in the code below you will see:</p>
<p>cnn_url_regex = re.compile('(?<=[.]cnn)[.]com') # cnn_url_regex is a regular expression object</p>
<p>w = WebCrawler(cnn_url_regex)</p>
<p>This particular regular expression says:</p>
<p>1) Find the first occurence of the string '.com'</p>
<p>2) Then looking backwards from where '.com' was found it attempts to find '.cnn'</p>
<p>Why do this?</p>
<p>You can control where the crawler crawls. In this case I am constraining the crawler to operate on webpages within cnn.com.</p>
<p>Another feature I added was the ability to parse a given page looking for specific html tags. I chose as an example the <h1> tag. Once a <h1> tag is found I store all the words I find in the tag in a dictionary that gets associated with the page url.</p>
<p>Why do this?</p>
<p>My thought was that if I scraped the page for text I could eventually use this data for a search engine request. Say I searched for 'Lebron James'. And suppose that one of the pages my crawler scraped found an article that mentions Lebron James many times. In response to a search request I could return the link with the Lebron James article in it.</p>
<p>The web crawler is described in the WebCrawler class. It has 2 functions the user should call:</p>
<p>1) start_crawling(seed_pages,depth)</p>
<p>2) print_all_page_text() # this is only used for debug purposes</p>
<p>The rest of WebCrawler's functions are internal functions that should not be called by the user (think private in C++).</p>
<p>Upon construction of a WebCrawler object, it creates a MyHTMLParser object. The MyHTMLParser class inherits from the built-in Python class HTMLParser. I use the MyHTMLParser object when searching for the <h1> tag. The MyHTMLParser class creates instances of a helper class named Tag. The tag class is used in creating a "linked list" of tags.</p>
<p>So to get started with WebCrawler make sure to use Python 2.7.2. Enter the code a piece at a time into IDLE in the order displayed below. This ensures that you import libs before you start using them.</p>
<p>Once you have entered all the code into IDLE, you can start crawling the 'interwebs' by entering the following:</p>
<p>import re</p>
<p>cnn_url_regex = re.compile('(?<=[.]cnn)[.]com') </p>
<p>w = WebCrawler(cnn_url_regex)</p>
<p>w.start_crawling(['http://www.cnn.com/2012/02/24/world/americas/haiti-pm-resigns/index.html?hpt=hp_t3'],1)</p>
<p>Of course you can enter any page you want. But the regular expression object is already setup to filter on <a href="http://cnn.com" rel="nofollow">cnn.com</a>. Remember the second parameter passed into the start_crawling function is the recursion depth.</p>
<p>Happy Crawling!</p>
HTTPS httplib Client Connection with Certificate Validation (Python)
2011-01-18T18:30:45-08:00Marcelo Fernándezhttp://code.activestate.com/recipes/users/4173551/http://code.activestate.com/recipes/577548-https-httplib-client-connection-with-certificate-v/
<p style="color: grey">
Python
recipe 577548
by <a href="/recipes/users/4173551/">Marcelo Fernández</a>
(<a href="/recipes/tags/certificate/">certificate</a>, <a href="/recipes/tags/client/">client</a>, <a href="/recipes/tags/client_server/">client_server</a>, <a href="/recipes/tags/httplib/">httplib</a>, <a href="/recipes/tags/https/">https</a>, <a href="/recipes/tags/networking/">networking</a>, <a href="/recipes/tags/ssl/">ssl</a>, <a href="/recipes/tags/validation/">validation</a>).
</p>
<p>Despite httplib.HTTPSConnection lets the programmer specify the client's pair of certificates, it doesn't force the underlying SSL library to check the server certificate against the client keys (from the client point of view).</p>
<p>This class allows to force this check, to ensure the python client is connecting to the right server.</p>
Pastebin Upload (Python)
2013-05-26T10:54:25-07:00Joe Smithhttp://code.activestate.com/recipes/users/4168055/http://code.activestate.com/recipes/576805-pastebin-upload/
<p style="color: grey">
Python
recipe 576805
by <a href="/recipes/users/4168055/">Joe Smith</a>
(<a href="/recipes/tags/code/">code</a>, <a href="/recipes/tags/post/">post</a>, <a href="/recipes/tags/source/">source</a>, <a href="/recipes/tags/urllib2/">urllib2</a>).
Revision 2.
</p>
<p>A little script I made for some buddies and I. We are constantly collaborating on code. This scrips takes a source code file as it's parameter and uploads it to <a href="http://pastebin.com" rel="nofollow">pastebin.com</a> or any sub domain of pastebin.
I integrated it with the righ click window in windows. Without that integration the script wouldn't be as cool!
Hope others find it useful.</p>
Python HTTP Pipelining (Python)
2009-02-27T16:21:23-08:00Markus Jhttp://code.activestate.com/recipes/users/4169350/http://code.activestate.com/recipes/576673-python-http-pipelining/
<p style="color: grey">
Python
recipe 576673
by <a href="/recipes/users/4169350/">Markus J</a>
(<a href="/recipes/tags/http/">http</a>, <a href="/recipes/tags/pipelining/">pipelining</a>).
Revision 5.
</p>
<p>Gets several pages in parallel, without threads. It exploits HTTP pipelining by resetting the state of HTTPConnection to trick it into sending the next request ahead of time.</p>
<p>More information about HTTP pipelining can be found on Wikipedia: <a href="http://en.wikipedia.org/wiki/HTTP_pipelining">http://en.wikipedia.org/wiki/HTTP_pipelining</a></p>
Search Google scholar (Python)
2007-07-13T15:20:12-07:00Yusdi Santosohttp://code.activestate.com/recipes/users/4068334/http://code.activestate.com/recipes/523047-search-google-scholar/
<p style="color: grey">
Python
recipe 523047
by <a href="/recipes/users/4068334/">Yusdi Santoso</a>
.
</p>
<p>This code allows you to search Google scholar from Python code. The result is returned in a nice dictionary format with each field addressed by its key.</p>
comixGetter (Python)
2007-06-24T13:24:47-07:00sami janhttp://code.activestate.com/recipes/users/4064232/http://code.activestate.com/recipes/522983-comixgetter/
<p style="color: grey">
Python
recipe 522983
by <a href="/recipes/users/4064232/">sami jan</a>
(<a href="/recipes/tags/web/">web</a>).
Revision 2.
</p>
<p>Download daily comics from <a href="http://comics.com" rel="nofollow">comics.com</a> and <a href="http://ucomics/gocomics.com" rel="nofollow">ucomics/gocomics.com</a> e.g. peanuts, dilbert, calvin & hobbes etc.</p>
Caching and throttling for urllib2 (Python)
2006-04-14T15:59:41-07:00Staffan Malmgrenhttp://code.activestate.com/recipes/users/1149346/http://code.activestate.com/recipes/491261-caching-and-throttling-for-urllib2/
<p style="color: grey">
Python
recipe 491261
by <a href="/recipes/users/1149346/">Staffan Malmgren</a>
(<a href="/recipes/tags/web/">web</a>).
</p>
<p>This code implements a cache (CacheHandler) and a throttling mechanism (ThrottlingProcessor) for urllib2. By using them, you can ensure that subsequent GET requests for the same URL returns a cached copy instead of causing a roundtrip to the remote server, and/or that subsequent requests to a server are paused for a couple of seconds to avoid overloading it. The test code at the end explains all there is to it.</p>
Write a file to a WebDAV Server (Python)
2005-12-14T20:14:24-08:00Nick Matsakishttp://code.activestate.com/recipes/users/2476988/http://code.activestate.com/recipes/464731-write-a-file-to-a-webdav-server/
<p style="color: grey">
Python
recipe 464731
by <a href="/recipes/users/2476988/">Nick Matsakis</a>
(<a href="/recipes/tags/network/">network</a>).
</p>
<p>An extremely simple example of using httplib to write a file to a WebDAV server. This version does not use any authentication mechanism.</p>
urrlib2 opener for SSL proxy (CONNECT method) (Python)
2005-11-16T15:04:54-08:00Alessandro Budaihttp://code.activestate.com/recipes/users/2668504/http://code.activestate.com/recipes/456195-urrlib2-opener-for-ssl-proxy-connect-method/
<p style="color: grey">
Python
recipe 456195
by <a href="/recipes/users/2668504/">Alessandro Budai</a>
(<a href="/recipes/tags/network/">network</a>).
Revision 2.
</p>
<p>This small module builds an urllib2 opener that can be used to make a connection through a proxy using the http CONNECT method (that can be used to proxy SSLconnections).
The current urrlib2 seems to not support this method.</p>
Stoppable HTTP server (Python)
2004-11-17T15:08:24-08:00wurst2http://code.activestate.com/recipes/users/1981772/http://code.activestate.com/recipes/336012-stoppable-http-server/
<p style="color: grey">
Python
recipe 336012
by <a href="/recipes/users/1981772/">wurst2</a>
(<a href="/recipes/tags/web/">web</a>).
</p>
<p>Starting a SimpleHTTPServer instance in a separate thread makes it run forever. To solve this problem the server is augmented with a QUIT command. If sent it makes the server stop serving requests.</p>
simplest useful HTTPS with basic proxy authentication (Python)
2005-12-28T17:27:47-08:00John Nielsenhttp://code.activestate.com/recipes/users/135654/http://code.activestate.com/recipes/301740-simplest-useful-https-with-basic-proxy-authenticat/
<p style="color: grey">
Python
recipe 301740
by <a href="/recipes/users/135654/">John Nielsen</a>
(<a href="/recipes/tags/network/">network</a>).
Revision 4.
</p>
<p>This is just about the most simple snippet of how to do proxy authentication with SSL using python. The current httplib only supports ssl through a proxy _without_ authentication. This example does basic proxy auth that a lot of proxy servers can support. This will at least give someone an idea of how to do it and then improve it and incorporate it however they want.</p>
httpExists - find out whether an http reference is valid (Python)
2004-07-12T08:51:49-07:00James Thielehttp://code.activestate.com/recipes/users/1779700/http://code.activestate.com/recipes/286225-httpexists-find-out-whether-an-http-reference-is-v/
<p style="color: grey">
Python
recipe 286225
by <a href="/recipes/users/1779700/">James Thiele</a>
(<a href="/recipes/tags/network/">network</a>).
</p>
<p>Quickly find out whether a web file exists.</p>
FedEX Tracking Information (Python)
2003-12-19T14:26:41-08:00Chris Moffitthttp://code.activestate.com/recipes/users/137137/http://code.activestate.com/recipes/259097-fedex-tracking-information/
<p style="color: grey">
Python
recipe 259097
by <a href="/recipes/users/137137/">Chris Moffitt</a>
(<a href="/recipes/tags/network/">network</a>).
</p>
<p>This short script allows a user to track the current status of a package sent through FedEx. It is meant to be run from the command line and takes 1 option argument (-v) to determine whether or not it shows all tracking information, or just the most recent entry. The user can enter multiple tracking numbers at run time.</p>
Http client to POST using multipart/form-data (Python)
2002-08-23T07:56:39-07:00Wade Leftwichhttp://code.activestate.com/recipes/users/98656/http://code.activestate.com/recipes/146306-http-client-to-post-using-multipartform-data/
<p style="color: grey">
Python
recipe 146306
by <a href="/recipes/users/98656/">Wade Leftwich</a>
(<a href="/recipes/tags/web/">web</a>).
</p>
<p>A scripted web client that will post data to a site as if from a form using ENCTYPE="multipart/form-data". This is typically used to upload files, but also gets around a server's (e.g. ASP's) limitation on the amount of data that can be accepted via a standard POST (application/x-www-form-urlencoded).</p>
Reverse Lookup Yellow Page (Python)
2002-08-15T16:07:17-07:00Victor Yanghttp://code.activestate.com/recipes/users/627255/http://code.activestate.com/recipes/145126-reverse-lookup-yellow-page/
<p style="color: grey">
Python
recipe 145126
by <a href="/recipes/users/627255/">Victor Yang</a>
(<a href="/recipes/tags/web/">web</a>).
</p>
<p>look up personal/home address from a phone number (US & Canada)
usage: ryp.py -s <a href="http://sendmail.your.com" rel="nofollow">sendmail.your.com</a> -e <a href="mailto:dude@your.com">dude@your.com</a> -p 416-345-3432
it has built in email & phone # validation.
The phone number can take most common format.</p>
SSL Client Authentication over HTTPS (Python)
2002-02-28T13:43:27-08:00Rob Riggshttp://code.activestate.com/recipes/users/217820/http://code.activestate.com/recipes/117004-ssl-client-authentication-over-https/
<p style="color: grey">
Python
recipe 117004
by <a href="/recipes/users/217820/">Rob Riggs</a>
(<a href="/recipes/tags/web/">web</a>).
</p>
<p>A 16-line python application that demonstrates SSL client authentication over HTTPS. We also explain the basics of how to set up Apache to require SSL client authentication. This assumes at least Python-2.2 compiled with SSL support, and Apache with mod_ssl.</p>
Check web page exists (Python)
2001-12-06T18:47:01-08:00andy mckayhttp://code.activestate.com/recipes/users/92886/http://code.activestate.com/recipes/101276-check-web-page-exists/
<p style="color: grey">
Python
recipe 101276
by <a href="/recipes/users/92886/">andy mckay</a>
(<a href="/recipes/tags/cgi/">cgi</a>).
Revision 4.
</p>
<p>For when you need to check a web page is still working.</p>
Grab a part of a web page (Python)
2001-05-26T05:21:46-07:00Oliver Dissarshttp://code.activestate.com/recipes/users/110615/http://code.activestate.com/recipes/59862-grab-a-part-of-a-web-page/
<p style="color: grey">
Python
recipe 59862
by <a href="/recipes/users/110615/">Oliver Dissars</a>
(<a href="/recipes/tags/web/">web</a>).
</p>
<p>Grab a part of a web page and generarte a new page with a base href to the source server, so that relative links will still work.</p>