Shows: how to derive a class from NullWriter that accumulates text from the body of an HTML page, how to derive a class from HTMLParser that retains metatag information, how to instantiate these classes and display a typical result of using them.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
>>> from formatter import AbstractFormatter , NullWriter >>> from htmllib import HTMLParser >>> from string import join >>> class myWriter ( NullWriter ): ... def send_flowing_data( self, str ): ... self . _bodyText . append ( str ) ... def __init__ ( self ): ... NullWriter.__init__ ( self ) ... self . _bodyText = [ ] ... def _get_bodyText ( self ): ... return join ( self . _bodyText, " " ) ... bodyText = property ( _get_bodyText, None, None, "plain text from body" ) ... >>> class MilenaHTMLParser (HTMLParser): ... def do_meta(self, attrs): ... self . metas = attrs ... >>> mywriter = myWriter ( ) >>> abstractformatter = AbstractFormatter ( mywriter ) >>> parser = MilenaHTMLParser( abstractformatter ) >>> parser . feed ( open ( r'c:\temp.htm' ) . read ( ) ) >>> parser . title 'Astronomical Summary: Hamilton, Ontario' >>> parser . metas [('http-equiv', 'REFRESH'), ('content', '1800')] >>> parser.formatter.writer.bodyText 'Hamilton, Ontario Picture of Earth Local Date 31 October 2001 Temperature 10.8 \xb0C Observation Location 43.17 N, 79.93 W Sunrise: Sunset: 6:53 am 5:13 pm The Moon is Waxing Gibbous (98% of Full) Image created with xearth  . Page inspired by design at AUSOM.'
Uses properties, which are a newly-available language feature in version 2.2. Will work under earlier versions; just use _get_bodyText in place of bodyText to access the plain text from the body of the page.
This recipe is unable to cope with some pages. (a) Poorly formatted comments in HTML make htmllib fail. (b) For some pages, the 'plain text' will contain debris from the page. Still, the code above may help as a leg-up in using these facilities.
The results shown were obtained using PythonWin.