MSHTML is the COM component used by Internet Explorer to parse HTML pages (since version 4 of IE). It can be used independently of IE as shown here.
1 2 3 4 5 6 7 8 9 10 11 12 13
from win32com.client import Dispatch html = Dispatch ( 'htmlfile' ) // disguise for MSHTML as a COM server html.writeln( "<html><header><title>A title</title><meta name='a name' content='page description'></header><body>This is some of it. <span>And this is the rest.</span></html>" ) print "Title: %s" % ( html.title, ) print "Bag of words from body of the page: %s" % ( html.body.innerText, ) print "URL associated with the page: %s" % ( html.url, ) print "Display of name:content pairs from meta tags: " metas=html.getElementsByTagName("meta") for m in xrange ( metas.length ): print "\t%s: %s" % ( metas [ m ] . name, metas [ m ] . content, )
This may be the easiest way to parse HTML, at least on the MSW platform. Simply use the 'writeln' method to provide MSHTML with the HTML page, and then read out pieces of the HTML page using the methods and properties of the component.
Normally one would use 'open' or 'urlopen', followed by 'read()' (say) to obtain the content of the page to be parsed, and then 'writeln' that content to MSHTML.
Since the structure that is made available by MSHTML is quite complex it can be very helpful to make use of PythonWin because this product displays the list of properties and methods available via each interface.
MSHTML can be invoked in such a way that scripts are not executed, for instance. The details of doing this are displayed in a MS example called 'walkall'. I have not worked out the corresponding Python code.