A simple html parser subclassing HTMLParser that will collect a dictionary of 'id':'text' elements, where 'text' is the text contained within an element with an id attribute, and 'id' is the name of the element. A nice demonstration of using the HTMLParser class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | from HTMLParser import HTMLParser
class IdParser(HTMLParser):
''' Parses HTML and places any elements with an ID attribute in a
dictionary for later access... '''
stacks = dict()
elements = dict()
idd = dict()
def updatepos(self, i, j):
# overridden to keep track of our pos
# line number / offset doesn't help too much
self.abspos = i #can contain ws
self.abspos2 = j #element starts here
if i >= j:
return j
rawdata = self.rawdata
nlines = rawdata.count("\n", i, j)
if nlines:
self.lineno = self.lineno + nlines
pos = rawdata.rindex("\n", i, j)
self.offset = j-(pos+1)
else:
self.offset = self.offset + j-i
return j
def handle_starttag(self, tag, attrs, desired='id'):
''' Change desired to something other than 'id'
to get other unique elements. '''
end = self.abspos2 + len(self.get_starttag_text())
if not self.stacks.has_key(tag):
self.stacks[tag] = [end]
else:
self.stacks[tag].append(end)
for key, value in attrs:
if key == desired:
self.elements[end] = value
def handle_endtag(self, tag):
''' Pop an element from the desired stack and
extract the data. '''
o = self.stacks[tag].pop()
if self.elements.has_key(o):
self.idd[self.elements[o]] = self.rawdata[o:self.abspos]
|
There are a few solutions out there for this type of thing already, but I feel that finding the solution best fit for exactly what you want to do is the best thing. I didn't want to extract the element information in a series of 'subelement' classes... All I desired was the text contained within, and this is pretty much the easiest way to do that, and to do it appropriately.
Simplicity, speed, and suitability... All there.
use BeautifulSoup. I think with BeautifulSoup this is a matter of 3 lines of code. http://www.crummy.com/software/BeautifulSoup/