Welcome, guest | Sign In | My Account | Store | Cart

Quickly find out whether a web file exists.

Python, 39 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
"""
httpExists.py

A quick and dirty way to to check whether a web file is there.

Usage:
>>> from httpExists import *
>>> httpExists('http://www.python.org/')
1
>>> httpExists('http://www.python.org/PenguinOnTheTelly')
Status 404 Not Found : http://www.python.org/PenguinOnTheTelly
0
"""

import httplib
import urlparse

def httpExists(url):
    host, path = urlparse.urlsplit(url)[1:3]
    found = 0
    try:
        connection = httplib.HTTPConnection(host)  ## Make HTTPConnection Object
        connection.request("HEAD", path)
        responseOb = connection.getresponse()      ## Grab HTTPResponse Object

        if responseOb.status == 200:
            found = 1
        else:
            print "Status %d %s : %s" % (responseOb.status, responseOb.reason, url)
    except Exception, e:
        print e.__class__,  e, url
    return found

def _test():
    import doctest, httpExists
    return doctest.testmod(httpExists)

if __name__ == "__main__":
    _test()

I needed to check whether some URLs were valid and I didn't need all the functionality of webchecker.py so I wrote this little recipe.

The URL must start with 'http://' due to the way urlparse.urlsplit() interprets URLs.

2 comments

Rogier Steehouder 19 years, 9 months ago  # | flag

Catch "302: Moved temporarily" I added this to allow for 302 responses, which are processed automatically by most (all?) browsers.

elif responseOb.status == 302:
    found = httpExists(urlparse.urljoin(url, responseOb.getheader('location', '')))
Sam Peterson 19 years, 1 month ago  # | flag

Hmmm... Part of me thinks that either httplib should have stock code for handling 300 status code redirections, or urllib should handle HEAD requests. The fact that this isn't provided by the standard libraries is crappy in my opinion.

Here's what I use to handle redirection. Recursion's a bad idea, it should be an iterative loop with a limit to avoid infinite redirection.

def head_url(url):
    """Perform HEAD, may throw socket errors"""

    import httplib, urlparse

    def _head(url):
        """Returns a http response object"""

        host, path = urlparse.urlparse(url)[1:3]

        connection = httplib.HTTPConnection(host)
        connection.request("HEAD", path)
        return connection.getresponse()

    # redirection limit, default of 10
    redirect = 10

    # Perform HEAD
    resp = _head(url)

    # check for redirection
    while (resp.status >= 300) and (resp.status <= 399):
        # tick the redirect
        redirect -= 1

        # if redirect is 0, we tried :-(
        if redirect == 0:
            # we hit our redirection limit, raise exception
            raise IOError, (0, "Hit redirection limit")

        # Perform HEAD
        url = resp.getheader('location')
        resp = _head(url)

    if resp.status >= 200 and resp.status <= 299:
        # horray!  We found what we were looking for.
        return (resp.status, url, resp.reason)

    else:
        # Status unsure, might be, 404, 500, 401, 403, raise error
        # with actual status code.
        raise IOError, (resp.status, url, resp.reason)