Welcome, guest | Sign In | My Account | Store | Cart

Downloads and saves all xkcd strips (with the exception of #404, as it's intentionally left 404...)

Python, 26 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import urllib
import re
import os
 
path = "set your path or '.'"
i = 1
content = True
while content:
        dir = os.listdir(path)
        for file in dir:
                if file[:-4].isdigit():
                        if int(file[:-4]) >= i:
                                i = int(file[:-4])+1
        while content:
                url = "http://www.xkcd.com/"+str(i)+"/"
                rd = urllib.urlopen(url)
                data = rd.read()
                res = re.search("/comics/[a-z0-9_()]*.(jpg|png)", data)
                if res:
                        imgurl = "http://imgs.xkcd.com"+res.group()
                        image = urllib.URLopener()
                        image.retrieve(imgurl, path+str(i)+imgurl[-4:])
                else:
                        if re.search("Not Found", data) and i != 404:
                                content = False
                i += 1

4 comments

Charlie Clark 14 years, 8 months ago  # | flag

Have you looked at wget?

This isn't really a recipe and because it doesn't document anything it's not very helpful for others.

A couple of comments:

  • best to put module level code in a function which can be called using if __name__ == 'main'

  • use explicit path manipulation.

  • the initial while loop could make better use of the os.path module for dealing with filenames. You can always count how many files you have if you want to restart.

  • it's not a good idea to overwrite Python's keywords such as dir or file

  • the main loop can be better expresssed with a for/continue/break construction

  • you can catch 404 directly by reading the url headers without downloading the page

  • the regular expression will be compiled each time, better to compile a pattern and search on it. You might also want to look at SGMLParser or lxml for parsing HTML without having to worry about writing your own regular expressions

  • urllib.urlretrieve() is simpler

My version

PAGE_URL = "http://www.xkcd.com/"
COMIC_URL = "http://imgs.xkcd.com/"

def download():
    count = len(os.listdir(DOWNLOAD_PATH)) + 1
    pattern = re.compile("/comics/[a-z0-9_()]*.(jpg|png)")
    while True:
        count += 1
        if count == 404:
            count += 1
        print count
        page = urllib.urlopen("%s%s" %(PAGE_URL, count))
        if "Last-Modified" not in page.headers:
            # no comic
            break
        res = pattern.search(page.read())
        if res:
            match = res.group()
            filename = os.path.basename(match)
            urllib.urlretrieve(COMIC_URL + match,
                               os.path.join(DOWNLOAD_PATH, filename)
                              )

if __name__ == '__main__':
    download()
Chris Jones 14 years, 8 months ago  # | flag

A couple comments not really related to Python:

  1. This is really rude behavior on the net. Introduce a sleep between fetches when mirroring someone's content, at the very least. You can do this in your script with time.sleep(SLEEP_TIME). At least put a second or two in there. This is known as the Not Being A Dick design pattern. Your script will slam the XKCD website while not viewing any ads that pay for the bandwidth you are consuming as well as place unnecessary load on the webserver.

  2. Your script misses the alt-text. I'm not sure what you're doing with a mirror of XKCD comics (if you're putting them up with your own ads somewhere, you're just an a-hole, seriously), but half the punchline is contained in the alt-text of this site. I would parse out the alt-text and dump it in a text file that indexes image# to alt-text.

  3. This is better done with wget. I love Python, but it's not the best solution for everything. Why re-invent the wheel?

  4. Use BeautifulSoup to parse HTML. I realize this is a very simple example, making a DOM parser overkill. However this rapidly ceases to be the case for anything more complicated. I feel stuff posted on a site like ActiveState should at least demonstrate best practices, otherwise I'm not sure what the point is. Using regular expressions to scrape HTML is decidedly not best practice.

xipe totec (author) 14 years, 8 months ago  # | flag

Your comments are appreciated. I'm moving over to Python from C# and PHP (which you probably can tell from the awkward while nesting, I'm also new to the ActiveState recipe's and I'm sorry that I didn't comment it as well as I should have.

Your comments were very helpful anyways.

@Charlie Clark: All very good suggestions, most of them I should have figured out myself/were due to laziness, but esp. the python specifics I will try to incorporate in the future.

@Chris Jones: Agree with taking it easy on the server, and it's just for my personal pleasure (not even sure if I will use it anymore, just wanted to code up some quick Python).

Charlie Clark 14 years, 8 months ago  # | flag

@Lord of the Flayed Hide: this website cannot replace mailing lists, newsgroups and forums for finding out about Python. Laziness has its place in Python - good Python programming lets us be lazy. I heartily recommend you get hold of the print copy of this site as the chapter introductions and discussions contained the distilled wisdom of many Python luminaries and will save you lots of time and give you that smug, Pythonic feeling when looking at other newbie code.