A very simple script that fetches a webpage for you. Useful for a cgiproxy in situations where web access is restricted - and to illustrate the basic case use of urllib2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | #!/usr/bin/python
# v0.01
# cgiproxy.py
# Copyright Michael Foord
# Not for use in commercial projects without permission. (Although permission will probably be given).
# If you use this code in a project then please credit me and include a link back.
# If you release the project then let me know (and include this message with my code !)
# No warranty express or implied for the accuracy, fitness to purpose or otherwise for this code....
# Use at your own risk !!!
# E-mail or michael AT foord DOT me DOT uk
# Maintained at www.voidspace.org.uk/atlantibots/pythonutils.html
import sys
import cgi
import urllib2
sys.stderr = sys.stdout
HOMEPAGE = 'www.google.co.uk'
######################################################
def getform(valuelist, theform, notpresent=''):
"""This function, given a CGI form, extracts the data from it, based on
valuelist passed in. Any non-present values are set to '' - although this can be changed.
(e.g. to return None so you can test for missing keywords - where '' is a valid answer but to have the field missing isn't.)"""
data = {}
for field in valuelist:
if not theform.has_key(field):
data[field] = notpresent
else:
if type(theform[field]) != type([]):
data[field] = theform[field].value
else:
values = map(lambda x: x.value, theform[field]) # allows for list type values
data[field] = values
return data
def pagefetch(thepage):
req = urllib2.Request(thepage)
u = urllib2.urlopen(req)
data = u.read()
return data
###################################################
if __name__ == '__main__':
form = cgi.FieldStorage()
data = getform(['url'],form)
if not data['url']: data['url'] = HOMEPAGE
print "Content-type: text/html" # this is the header to the server
print # so is this blank line
test = pagefetch('http://' + data['url'])
print test
|
call it with :
http://www.pathtoyourserver.com/cgi-bin/cgiproxy.py?url=www.urltofetch.com
miss off the http:// from the urltofetch !!
The next stage will involve using regular expressions to parse the page and change any locations (images, links etc) so that they go through the proxy as well.
Not a patch on the James Marshall cgiproxy... but it's a start :-)
An advantage of this one is that the James Marshall one modifies pages it fetches (for example using the change mentioned above) - which is fine but it includes your proxy location in every URL.. this one will fetch an unmodified copy which can be more useful..