ActiveState Code

Recipe 391929: Access password-protected web applications for scraping.


Use John J. Lee's ClientCookie and ClientForm classes to easily access password-protected web applications. A group on http://yahoo.com is used as an example.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import sys
sys.path.append('ClientCookie-1.0.3')
import ClientCookie
sys.path.append('ClientForm-0.1.17')
import ClientForm

# Create special URL opener (for User-Agent) and cookieJar
cookieJar = ClientCookie.CookieJar()

opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cookieJar))
opener.addheaders = [("User-agent","Mozilla/5.0 (compatible)")]
ClientCookie.install_opener(opener)
fp = ClientCookie.urlopen("http://login.yahoo.com")
forms = ClientForm.ParseResponse(fp)
fp.close()

# print forms on this page
for form in forms: 
    print "***************************"
    print form

form = forms[0]
form["login"]  = "yahoo-user-id" # use your userid
form["passwd"] = "password"      # use your password
fp = ClientCookie.urlopen(form.click())
fp.close()
fp = ClientCookie.urlopen("http://groups.yahoo.com/group/mygroup") # use your group
fp.readlines()
fp.close()

Discussion

Many web applications require the user to fill out a login form. This recipe shows a very easy way to do it in Python so that you can get data from the site for scraping purposes.

I simply establish a persistent connection to a site (http://groups.yahoo.com) that requires you to fill out a form. The recipe should be easily adaptable to other sites such as eBay or PayPal. The task is easy using John J. Lee's CleintCookie and ClientForm classes.

I downloaded the classes from: http://wwwsearch.sourceforge.net/ClientCookie/src/ClientCookie-1.0.3.tar.gz http://wwwsearch.sourceforge.net/ClientForm/src/ClientForm-0.1.17.tar.gz

After untarring the tar.gz files, I used the above python to access my yahoo account (Look, Ma! No browser!)

I am using Python version 2.3.4 on Fedora Core 3.

Note that this kind of form-based authentication is nothing like http basic authentication. Therefore, you can't simply put the username and password in the url as in:

http://username:password@login.yahoo.com # This does not work.

Refer to Mike Foord's recipe at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305288 to find out how to access sites that use http basic authentication.

All kudos to John J. Lee

Comments

  1. 1. At 6:45 a.m. on 26 mar 2005, George Geller (the author) said:

    ssl support is required. This recipe requires the socket library to be compiled with ssl support. See http://docs.python.org/lib/module-httplib.html. The ssl support is already in the Python install on Fedora Core 3 (and I guessing other Linux installs as well). Based on a report by a Python user under Windows, at least on version of Python for Windows does not have ssl support.

    Without ssl support you get an exeption that looks, in part, like:

    File "C:\Python24\lib\urllib2.py", line 1053, in unknown_open

    raise URLError('unknown url type: %s' % type)
    

    URLError: urlopen error unknown url type: https

    George

  2. 2. At 9:09 a.m. on 4 mar 2009, Lorne Walker said:

    This worked like a charm for logging in to Yahoo! with python. Thanks!

Sign in to comment