Welcome, guest | Sign In | My Account | Store | Cart

Used in conjunction with Mozilla's "DOM Inspector" and Mozilla's "View | Page Source" and (say) PythonWin, the parser component of IE can make scraping fairly easy in large part because the parse tree that it produces is rigorously self-similar from root to leaf. Here's an example that parallels the one offered for ScrapeNFeed.

Python, 34 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from win32com.client import Dispatch
from PyRSS2Gen import RSSItem, Guid
import ScrapeNFeed

class ContactPointEvents ( ScrapeNFeed . ScrapedFeed ) :    

    def HTML2RSS ( self, unused_headers, body ) :

        html = Dispatch ( 'htmlfile' ) 
        html . writeln ( body )
        items = [ ]
        count = 0
        for item in html . body . all :
            if item . tagName == 'UL' :
                count += 1
                if count == 4 :
                    break
        theUL = item . all
        for item in theUL :
            if item . tagName == 'LI' :
                title = item . childNodes [ 0 ] . innerText
                link = item . childNodes [ 0 ] . outerHTML
                if item . childNodes . length >= 2 :
                    description = item . innerText
                else :
                    description = ''
                items . append ( RSSItem ( title = title, description = description, link = link ) )
        self . addRSSItems ( items )

ContactPointEvents . load ( "New O'Reilly releases",
         'http://www.oreilly.com/catalog/new.html',
         "New O'Reillys",
         r'new.xml', r'new.pickle',
         managingEditor='wbell@vex.net (Bill Bell)')

What do I do?

I develop the finished scraping product in SciTE, using PythonWin to make trial forays into the parse tree. I usually find that the coloured source provided by Mozilla helps me most in navigating the parse tree. However, sometimes I also use the search features in the DOM Inspector, especially when it appears that 'name' and 'class' entities might be helpful for navigating the parse tree.

The code above shows the general pattern. Dispatch the IE parser and fill it with the HTML to be parsed using its 'writeln' function. Now you can iterate through the entire collection of elements in the body of the given page as html.body.all. When you find a branch that you want to follow name it as I have done in the case of 'theUL' and then iterate through that collection.

Since the branches are self-similar learning to cope with one element means you've learned to cope with it at a lower level. Furthermore, the same style works in other languages and systems that involve IE.

4 comments

Chris Clarke 17 years, 10 months ago  # | flag

Beautifulsoup?? check out BeautifulSoup http://www.crummy.com/software/BeautifulSoup/ No dependencies, cross-platform, handles bad HTML

sgilja 17 years, 10 months ago  # | flag

code style. Look for this: http://www.python.org/dev/peps/pep-0008

Bill Bell (author) 17 years, 10 months ago  # | flag

Thanks. Yes, I know about BeautifulSoup. I never work on anything other than Windows though, for one thing. Beyond that, it _appeared to me_ that the parse tree for BeautifulSoup might not be self-similar in the way that the one that IE produces is. Can someone set me straight?

Bill Bell (author) 17 years, 10 months ago  # | flag

Thanks. I usually take the extra blanks out in recipes. Forgot this time. My eyesight is poor.