This recipe is really about using previously identified information in a web page--ie, state information--to decide how to use newly identified information. To view a page of the kind that can be scraped using the code below visit http://www.archives.ca/02/02012202_e.html select "Ontario", enter "Cornwall" in the Geographic Location box, and select "MAX" in the Number of References per page list.
Two kinds of document images are offered within each page served by the census site, namely, schedule 1 document images and schedule 2. Only the schedule 1 documents provide information in which I have an interest at present (namely surnames, birthdates, etc). I would, therefore, like to extract information that identifies schedule 1 images and ignore the others.
Put in terms of state, when my script notices that it has most recently seen HTML code indicating schedule 1 I want it to extract information in the URLs in the "option" tags; when it has found schedule 2 I want it to ignore the URLs. It might be that one of the simplest ways of doing this is to form a regular expression (RE) that alternates one RE that recognises schedule numbers and one RE that recognises the URLs, then use this whole RE with a "sub" function so that the matches can be processed in a purpose-built function.
Incidentally, I have found that Phil Schwartz' "Kodos" Python Regex Debugger makes it a lot faster to create and check REs. Many thanks, Phil!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from re import compile, sub, DOTALL, IGNORECASE
RE = """\
(?:Schedule: </B></TD><TD><font face="Arial, Helvetica, sans-serif" size="2">(?P<schedule>[\d]+))\
|\
(?:http://data6.archives.ca/exec/getSID.pl\?f=(?P<URL>.{15}))\
"""
compiledRE = compile ( RE, DOTALL or IGNORECASE )
def handleMatch ( match ) :
global schedule
if match . groups ( ) [ 0 ] :
schedule = match . groups ( ) [ 0 ]
else :
if match . groups ( ) [ 1 ] and schedule == '1' :
print match . groups ( ) [ 1 ]
return ''
htm = file ( 'cornwall.htm' ) . read ( )
sub ( compiledRE, handleMatch, htm )
|
Scraping pages is something that some of us do fairly often. Using patterns to identify URLs or email addresses is usually insufficient because this approach fails to make use of contextual information, and frequently identifies many items that are not useful.