A simple html parser subclassing HTMLParser that will collect a dictionary of 'id':'text' elements, where 'text' is the text contained within an element with an id attribute, and 'id' is the name of the element. A nice demonstration of using the HTMLParser class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
from HTMLParser import HTMLParser class IdParser(HTMLParser): ''' Parses HTML and places any elements with an ID attribute in a dictionary for later access... ''' stacks = dict() elements = dict() idd = dict() def updatepos(self, i, j): # overridden to keep track of our pos # line number / offset doesn't help too much self.abspos = i #can contain ws self.abspos2 = j #element starts here if i >= j: return j rawdata = self.rawdata nlines = rawdata.count("\n", i, j) if nlines: self.lineno = self.lineno + nlines pos = rawdata.rindex("\n", i, j) self.offset = j-(pos+1) else: self.offset = self.offset + j-i return j def handle_starttag(self, tag, attrs, desired='id'): ''' Change desired to something other than 'id' to get other unique elements. ''' end = self.abspos2 + len(self.get_starttag_text()) if not self.stacks.has_key(tag): self.stacks[tag] = [end] else: self.stacks[tag].append(end) for key, value in attrs: if key == desired: self.elements[end] = value def handle_endtag(self, tag): ''' Pop an element from the desired stack and extract the data. ''' o = self.stacks[tag].pop() if self.elements.has_key(o): self.idd[self.elements[o]] = self.rawdata[o:self.abspos]
There are a few solutions out there for this type of thing already, but I feel that finding the solution best fit for exactly what you want to do is the best thing. I didn't want to extract the element information in a series of 'subelement' classes... All I desired was the text contained within, and this is pretty much the easiest way to do that, and to do it appropriately.
Simplicity, speed, and suitability... All there.
use BeautifulSoup. I think with BeautifulSoup this is a matter of 3 lines of code. http://www.crummy.com/software/BeautifulSoup/