Welcome, guest | Sign In | My Account | Store | Cart

A simple html parser subclassing HTMLParser that will collect a dictionary of 'id':'text' elements, where 'text' is the text contained within an element with an id attribute, and 'id' is the name of the element. A nice demonstration of using the HTMLParser class.

Python, 49 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
from HTMLParser import HTMLParser

class IdParser(HTMLParser):
    ''' Parses HTML and places any elements with an ID attribute in a
    dictionary for later access... '''

    stacks = dict()
    elements = dict()
    idd = dict()

    def updatepos(self, i, j):
        # overridden to keep track of our pos
        # line number / offset doesn't help too much
        self.abspos = i #can contain ws
        self.abspos2 = j #element starts here
        if i >= j:
            return j
        rawdata = self.rawdata
        nlines = rawdata.count("\n", i, j)
        if nlines:
            self.lineno = self.lineno + nlines
            pos = rawdata.rindex("\n", i, j)
            self.offset = j-(pos+1)
        else:
            self.offset = self.offset + j-i
        return j

    def handle_starttag(self, tag, attrs, desired='id'):
        ''' Change desired to something other than 'id'
            to get other unique elements. '''

        end = self.abspos2 + len(self.get_starttag_text())

        if not self.stacks.has_key(tag):
            self.stacks[tag] = [end]
        else:
            self.stacks[tag].append(end)

        for key, value in attrs:
            if key == desired:
                self.elements[end] = value

    def handle_endtag(self, tag):
        ''' Pop an element from the desired stack and
            extract the data. '''

        o = self.stacks[tag].pop()
        if self.elements.has_key(o):
            self.idd[self.elements[o]] = self.rawdata[o:self.abspos]

There are a few solutions out there for this type of thing already, but I feel that finding the solution best fit for exactly what you want to do is the best thing. I didn't want to extract the element information in a series of 'subelement' classes... All I desired was the text contained within, and this is pretty much the easiest way to do that, and to do it appropriately.

Simplicity, speed, and suitability... All there.

1 comment

Foo Bear 15 years, 6 months ago  # | flag

use BeautifulSoup. I think with BeautifulSoup this is a matter of 3 lines of code. http://www.crummy.com/software/BeautifulSoup/

Created by James Kassemi on Thu, 8 Jun 2006 (PSF)
Python recipes (4591)
James Kassemi's recipes (4)

Required Modules

Other Information and Tasks