Welcome, guest | Sign In | My Account | Store | Cart

Extract elements with id attributes from HTML (Python recipe) by James Kassemi
ActiveState Code (http://code.activestate.com/recipes/496787/)

A simple html parser subclassing HTMLParser that will collect a dictionary of 'id':'text' elements, where 'text' is the text contained within an element with an id attribute, and 'id' is the name of the element. A nice demonstration of using the HTMLParser class.

      from HTMLParser import HTMLParser

class IdParser(HTMLParser):
    ''' Parses HTML and places any elements with an ID attribute in a
    dictionary for later access... '''

    stacks = dict()
    elements = dict()
    idd = dict()

    def updatepos(self, i, j):
        # overridden to keep track of our pos
        # line number / offset doesn't help too much
        self.abspos = i #can contain ws
        self.abspos2 = j #element starts here
        if i >= j:
            return j
        rawdata = self.rawdata
        nlines = rawdata.count("\n", i, j)
        if nlines:
            self.lineno = self.lineno + nlines
            pos = rawdata.rindex("\n", i, j)
            self.offset = j-(pos+1)
        else:
            self.offset = self.offset + j-i
        return j

    def handle_starttag(self, tag, attrs, desired='id'):
        ''' Change desired to something other than 'id'
            to get other unique elements. '''

        end = self.abspos2 + len(self.get_starttag_text())

        if not self.stacks.has_key(tag):
            self.stacks[tag] = [end]
        else:
            self.stacks[tag].append(end)

        for key, value in attrs:
            if key == desired:
                self.elements[end] = value

    def handle_endtag(self, tag):
        ''' Pop an element from the desired stack and
            extract the data. '''

        o = self.stacks[tag].pop()
        if self.elements.has_key(o):
            self.idd[self.elements[o]] = self.rawdata[o:self.abspos]

      

There are a few solutions out there for this type of thing already, but I feel that finding the solution best fit for exactly what you want to do is the best thing. I didn't want to extract the element information in a series of 'subelement' classes... All I desired was the text contained within, and this is pretty much the easiest way to do that, and to do it appropriately.

Simplicity, speed, and suitability... All there.

1 comment

Foo Bear 17 years, 10 months ago # | flag

use BeautifulSoup. I think with BeautifulSoup this is a matter of 3 lines of code. http://www.crummy.com/software/BeautifulSoup/

Created by James Kassemi on Thu, 8 Jun 2006 (PSF)

◄	Python recipes (4591)	►
◄	James Kassemi's recipes (4)	►

Required Modules

htmlparser

Other Information and Tasks

Licensed under the PSF License
Viewed 24901 times
Revision 1

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

Extract elements with id attributes from HTML (Python recipe) by James Kassemi ActiveState Code (http://code.activestate.com/recipes/496787/)