Used in conjunction with Mozilla's "DOM Inspector" and Mozilla's "View | Page Source" and (say) PythonWin, the parser component of IE can make scraping fairly easy in large part because the parse tree that it produces is rigorously self-similar from root to leaf. Here's an example that parallels the one offered for ScrapeNFeed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
from win32com.client import Dispatch from PyRSS2Gen import RSSItem, Guid import ScrapeNFeed class ContactPointEvents ( ScrapeNFeed . ScrapedFeed ) : def HTML2RSS ( self, unused_headers, body ) : html = Dispatch ( 'htmlfile' ) html . writeln ( body ) items = [ ] count = 0 for item in html . body . all : if item . tagName == 'UL' : count += 1 if count == 4 : break theUL = item . all for item in theUL : if item . tagName == 'LI' : title = item . childNodes [ 0 ] . innerText link = item . childNodes [ 0 ] . outerHTML if item . childNodes . length >= 2 : description = item . innerText else : description = '' items . append ( RSSItem ( title = title, description = description, link = link ) ) self . addRSSItems ( items ) ContactPointEvents . load ( "New O'Reilly releases", 'http://www.oreilly.com/catalog/new.html', "New O'Reillys", r'new.xml', r'new.pickle', managingEditoremail@example.com (Bill Bell)')
What do I do?
I develop the finished scraping product in SciTE, using PythonWin to make trial forays into the parse tree. I usually find that the coloured source provided by Mozilla helps me most in navigating the parse tree. However, sometimes I also use the search features in the DOM Inspector, especially when it appears that 'name' and 'class' entities might be helpful for navigating the parse tree.
The code above shows the general pattern. Dispatch the IE parser and fill it with the HTML to be parsed using its 'writeln' function. Now you can iterate through the entire collection of elements in the body of the given page as html.body.all. When you find a branch that you want to follow name it as I have done in the case of 'theUL' and then iterate through that collection.
Since the branches are self-similar learning to cope with one element means you've learned to cope with it at a lower level. Furthermore, the same style works in other languages and systems that involve IE.