Used in conjunction with Mozilla's "DOM Inspector" and Mozilla's "View | Page Source" and (say) PythonWin, the parser component of IE can make scraping fairly easy in large part because the parse tree that it produces is rigorously self-similar from root to leaf. Here's an example that parallels the one offered for ScrapeNFeed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | from win32com.client import Dispatch
from PyRSS2Gen import RSSItem, Guid
import ScrapeNFeed
class ContactPointEvents ( ScrapeNFeed . ScrapedFeed ) :
def HTML2RSS ( self, unused_headers, body ) :
html = Dispatch ( 'htmlfile' )
html . writeln ( body )
items = [ ]
count = 0
for item in html . body . all :
if item . tagName == 'UL' :
count += 1
if count == 4 :
break
theUL = item . all
for item in theUL :
if item . tagName == 'LI' :
title = item . childNodes [ 0 ] . innerText
link = item . childNodes [ 0 ] . outerHTML
if item . childNodes . length >= 2 :
description = item . innerText
else :
description = ''
items . append ( RSSItem ( title = title, description = description, link = link ) )
self . addRSSItems ( items )
ContactPointEvents . load ( "New O'Reilly releases",
'http://www.oreilly.com/catalog/new.html',
"New O'Reillys",
r'new.xml', r'new.pickle',
managingEditor='wbell@vex.net (Bill Bell)')
|
What do I do?
I develop the finished scraping product in SciTE, using PythonWin to make trial forays into the parse tree. I usually find that the coloured source provided by Mozilla helps me most in navigating the parse tree. However, sometimes I also use the search features in the DOM Inspector, especially when it appears that 'name' and 'class' entities might be helpful for navigating the parse tree.
The code above shows the general pattern. Dispatch the IE parser and fill it with the HTML to be parsed using its 'writeln' function. Now you can iterate through the entire collection of elements in the body of the given page as html.body.all. When you find a branch that you want to follow name it as I have done in the case of 'theUL' and then iterate through that collection.
Since the branches are self-similar learning to cope with one element means you've learned to cope with it at a lower level. Furthermore, the same style works in other languages and systems that involve IE.
Beautifulsoup?? check out BeautifulSoup http://www.crummy.com/software/BeautifulSoup/ No dependencies, cross-platform, handles bad HTML
code style. Look for this: http://www.python.org/dev/peps/pep-0008
Thanks. Yes, I know about BeautifulSoup. I never work on anything other than Windows though, for one thing. Beyond that, it _appeared to me_ that the parse tree for BeautifulSoup might not be self-similar in the way that the one that IE produces is. Can someone set me straight?
Thanks. I usually take the extra blanks out in recipes. Forgot this time. My eyesight is poor.