Latest recipes tagged "urllib"http://code.activestate.com/recipes/tags/urllib/new/2014-11-06T18:31:44-08:00ActiveState Code RecipesUrllib handler for Amazon S3 buckets (Python)
2014-11-06T18:31:44-08:00Andrea Corbellinihttp://code.activestate.com/recipes/users/4186880/http://code.activestate.com/recipes/578957-urllib-handler-for-amazon-s3-buckets/
<p style="color: grey">
Python
recipe 578957
by <a href="/recipes/users/4186880/">Andrea Corbellini</a>
(<a href="/recipes/tags/aws/">aws</a>, <a href="/recipes/tags/s3/">s3</a>, <a href="/recipes/tags/urllib/">urllib</a>).
</p>
<p>This is a handler for the standard <a href="https://docs.python.org/dev/library/urllib.request.html">urllib.request</a> module capable of opening buckets stored on <a href="http://aws.amazon.com/s3/">Amazon S3</a>.</p>
<p>Here is an usage example:</p>
<pre class="prettyprint"><code>>>> from urllib.request import build_opener
>>> opener = build_opener(S3Handler)
>>> response = opener.open('s3://bucket-name/key-name')
>>> response.read()
b'contents'
</code></pre>
Music Downloader with Wx GUI! (Python)
2013-11-05T02:52:29-08:00Christian Careagahttp://code.activestate.com/recipes/users/4186639/http://code.activestate.com/recipes/578681-music-downloader-with-wx-gui/
<p style="color: grey">
Python
recipe 578681
by <a href="/recipes/users/4186639/">Christian Careaga</a>
(<a href="/recipes/tags/beautifulsoup/">beautifulsoup</a>, <a href="/recipes/tags/downloader/">downloader</a>, <a href="/recipes/tags/gui/">gui</a>, <a href="/recipes/tags/music/">music</a>, <a href="/recipes/tags/python/">python</a>, <a href="/recipes/tags/urllib/">urllib</a>, <a href="/recipes/tags/urllib2/">urllib2</a>, <a href="/recipes/tags/wxpyton/">wxpyton</a>).
</p>
<p>Just type in a song and the artist and the program will get the YouTube video, convert it to an mp3 then download it!
It has a high quality function and a medium quality function and also the user can choose the directory and name they want the file to be saved to!</p>
<p>It is the first time I've used threads and my second time with WxPython! I used BeautifulSoup for the scraping and I'm pretty familiar with that.just thought id share it with you guys and see if you have any feedback or suggestions!</p>
<p>also you may get an error saying self.convhtml doesn't exist just wait then retry</p>
<p>Here is a link to an screenshot:</p>
<p><a href="http://adf.ly/XJaoU" rel="nofollow">http://adf.ly/XJaoU</a></p>
<p>if you want you can checkout the Github page:</p>
<p><a href="http://adf.ly/XGL6P" rel="nofollow">http://adf.ly/XGL6P</a></p>
<p>also you will need to make a folder called Files and put a file called dir.txt and in the file write /Files. this is where the music will be downloaded to!</p>
<p>I just made the .exe so you can just use that and its easier!
Here:
<a href="http://adf.ly/XRjRH" rel="nofollow">http://adf.ly/XRjRH</a></p>
Music Downloader (Python)
2013-05-25T06:52:51-07:00Christian Careagahttp://code.activestate.com/recipes/users/4186639/http://code.activestate.com/recipes/578530-music-downloader/
<p style="color: grey">
Python
recipe 578530
by <a href="/recipes/users/4186639/">Christian Careaga</a>
(<a href="/recipes/tags/download/">download</a>, <a href="/recipes/tags/downloader/">downloader</a>, <a href="/recipes/tags/music/">music</a>, <a href="/recipes/tags/program/">program</a>, <a href="/recipes/tags/python/">python</a>, <a href="/recipes/tags/python_scripts/">python_scripts</a>, <a href="/recipes/tags/selenium/">selenium</a>, <a href="/recipes/tags/urllib/">urllib</a>, <a href="/recipes/tags/urllib2/">urllib2</a>).
</p>
<p>A Python Program i wrote that downloads music from the web</p>
Multithreading Downloader Class (Python)
2012-07-22T07:44:20-07:00Itay Brandeshttp://code.activestate.com/recipes/users/4182927/http://code.activestate.com/recipes/578220-multithreading-downloader-class/
<p style="color: grey">
Python
recipe 578220
by <a href="/recipes/users/4182927/">Itay Brandes</a>
(<a href="/recipes/tags/downloader/">downloader</a>, <a href="/recipes/tags/multithread/">multithread</a>, <a href="/recipes/tags/multithreading/">multithreading</a>, <a href="/recipes/tags/urllib/">urllib</a>, <a href="/recipes/tags/urllib2/">urllib2</a>, <a href="/recipes/tags/urlopen/">urlopen</a>).
</p>
<p>Garbs files from the web using multithreading in an attempt to enhance transfer rate.</p>
Cosign Handler (Python)
2012-07-18T13:30:10-07:00Colin Higgshttp://code.activestate.com/recipes/users/4182866/http://code.activestate.com/recipes/578217-cosign-handler/
<p style="color: grey">
Python
recipe 578217
by <a href="/recipes/users/4182866/">Colin Higgs</a>
(<a href="/recipes/tags/authentication/">authentication</a>, <a href="/recipes/tags/cosign/">cosign</a>, <a href="/recipes/tags/handler/">handler</a>, <a href="/recipes/tags/urllib/">urllib</a>).
</p>
<p>Handler (python 3.x urllib.request style) for web pages where cosign authentication is required.</p>
<p>See <a href="http://weblogin.org/">http://weblogin.org/</a> for details of the cosign authentication system.</p>
A Simple Webcrawler (Python)
2012-03-03T02:37:30-08:00Johnhttp://code.activestate.com/recipes/users/4181142/http://code.activestate.com/recipes/578060-a-simple-webcrawler/
<p style="color: grey">
Python
recipe 578060
by <a href="/recipes/users/4181142/">John</a>
(<a href="/recipes/tags/crawler/">crawler</a>, <a href="/recipes/tags/html/">html</a>, <a href="/recipes/tags/page/">page</a>, <a href="/recipes/tags/parser/">parser</a>, <a href="/recipes/tags/scraping/">scraping</a>, <a href="/recipes/tags/urllib/">urllib</a>, <a href="/recipes/tags/urlopen/">urlopen</a>, <a href="/recipes/tags/web/">web</a>).
</p>
<p>This is my simple web crawler. It takes as input a list of seed pages (web urls) and 'scrapes' each page of all its absolute path links (i.e. links in the format <a href="http://" rel="nofollow">http://</a>) and adds those to a dictionary. The web crawler can take all the links found in the seed pages and then scrape those as well. You can continue scraping as deep as you like. You can control how "deep you go" by specifying the depth variable passed into the WebCrawler class function start_crawling(seed_pages,depth). Think of the depth as the recursion depth (or the number of web pages deep you go before returning back up the tree).</p>
<p>To make this web crawler a little more interesting I added some bells and whistles. I added the ability to pass into the WebCrawler class constructor a regular expression object. The regular expression object is used to "filter" the links found during scraping. For example, in the code below you will see:</p>
<p>cnn_url_regex = re.compile('(?<=[.]cnn)[.]com') # cnn_url_regex is a regular expression object</p>
<p>w = WebCrawler(cnn_url_regex)</p>
<p>This particular regular expression says:</p>
<p>1) Find the first occurence of the string '.com'</p>
<p>2) Then looking backwards from where '.com' was found it attempts to find '.cnn'</p>
<p>Why do this?</p>
<p>You can control where the crawler crawls. In this case I am constraining the crawler to operate on webpages within cnn.com.</p>
<p>Another feature I added was the ability to parse a given page looking for specific html tags. I chose as an example the <h1> tag. Once a <h1> tag is found I store all the words I find in the tag in a dictionary that gets associated with the page url.</p>
<p>Why do this?</p>
<p>My thought was that if I scraped the page for text I could eventually use this data for a search engine request. Say I searched for 'Lebron James'. And suppose that one of the pages my crawler scraped found an article that mentions Lebron James many times. In response to a search request I could return the link with the Lebron James article in it.</p>
<p>The web crawler is described in the WebCrawler class. It has 2 functions the user should call:</p>
<p>1) start_crawling(seed_pages,depth)</p>
<p>2) print_all_page_text() # this is only used for debug purposes</p>
<p>The rest of WebCrawler's functions are internal functions that should not be called by the user (think private in C++).</p>
<p>Upon construction of a WebCrawler object, it creates a MyHTMLParser object. The MyHTMLParser class inherits from the built-in Python class HTMLParser. I use the MyHTMLParser object when searching for the <h1> tag. The MyHTMLParser class creates instances of a helper class named Tag. The tag class is used in creating a "linked list" of tags.</p>
<p>So to get started with WebCrawler make sure to use Python 2.7.2. Enter the code a piece at a time into IDLE in the order displayed below. This ensures that you import libs before you start using them.</p>
<p>Once you have entered all the code into IDLE, you can start crawling the 'interwebs' by entering the following:</p>
<p>import re</p>
<p>cnn_url_regex = re.compile('(?<=[.]cnn)[.]com') </p>
<p>w = WebCrawler(cnn_url_regex)</p>
<p>w.start_crawling(['http://www.cnn.com/2012/02/24/world/americas/haiti-pm-resigns/index.html?hpt=hp_t3'],1)</p>
<p>Of course you can enter any page you want. But the regular expression object is already setup to filter on <a href="http://cnn.com" rel="nofollow">cnn.com</a>. Remember the second parameter passed into the start_crawling function is the recursion depth.</p>
<p>Happy Crawling!</p>
Improvements of the urllib.URLopen.retrieve() method (Python)
2010-01-16T04:50:07-08:00Kévin Gomezhttp://code.activestate.com/recipes/users/4172815/http://code.activestate.com/recipes/577009-improvements-of-the-urlliburlopenretrieve-method/
<p style="color: grey">
Python
recipe 577009
by <a href="/recipes/users/4172815/">Kévin Gomez</a>
(<a href="/recipes/tags/retrieve/">retrieve</a>, <a href="/recipes/tags/urllib/">urllib</a>).
</p>
<p>I improved the urllib.URLopen.retrieve() method so that it can restart a download if it failed. And like wget does (with wget -c), it restarts where it stopped.
The number of maximum tries can be changed.</p>
Fetch all (new) xkcd strips (Python)
2009-08-15T12:15:17-07:00xipe totechttp://code.activestate.com/recipes/users/4171453/http://code.activestate.com/recipes/576881-fetch-all-new-xkcd-strips/
<p style="color: grey">
Python
recipe 576881
by <a href="/recipes/users/4171453/">xipe totec</a>
(<a href="/recipes/tags/file_download/">file_download</a>, <a href="/recipes/tags/urllib/">urllib</a>).
</p>
<p>Downloads and saves all xkcd strips (with the exception of #404, as it's intentionally left 404...)</p>