Welcome, guest | Sign In | My Account | Store | Cart

Using MSHTML to Parse HTML (Python recipe) by Bill Bell
ActiveState Code (http://code.activestate.com/recipes/135702/)

MSHTML is the COM component used by Internet Explorer to parse HTML pages (since version 4 of IE). It can be used independently of IE as shown here.

      from win32com.client import Dispatch

html = Dispatch ( 'htmlfile' ) // disguise for MSHTML as a COM server

html.writeln( "<html><header><title>A title</title><meta name='a name' content='page description'></header><body>This is some of it. <span>And this is the rest.</span></html>" )

print "Title: %s" % ( html.title, )
print "Bag of words from body of the page: %s" % ( html.body.innerText, )
print "URL associated with the page: %s" % ( html.url, )
print "Display of name:content pairs from meta tags: "
metas=html.getElementsByTagName("meta")
for m in xrange ( metas.length ):
    print "\t%s: %s" % ( metas [ m ] . name, metas [ m ] . content, )

      

This may be the easiest way to parse HTML, at least on the MSW platform. Simply use the 'writeln' method to provide MSHTML with the HTML page, and then read out pieces of the HTML page using the methods and properties of the component.

Normally one would use 'open' or 'urlopen', followed by 'read()' (say) to obtain the content of the page to be parsed, and then 'writeln' that content to MSHTML.
Since the structure that is made available by MSHTML is quite complex it can be very helpful to make use of PythonWin because this product displays the list of properties and methods available via each interface.

MSHTML can be invoked in such a way that scripts are not executed, for instance. The details of doing this are displayed in a MS example called 'walkall'. I have not worked out the corresponding Python code.

Tags: web

Created by Bill Bell on Wed, 26 Jun 2002 (PSF)

◄	Python recipes (4591)	►
◄	Bill Bell's recipes (16)	►
◄	Python Cookbook Edition 2 (117)	►

Required Modules

(none specified)

Other Information and Tasks

Licensed under the PSF License
Viewed 12600 times
Revision 1

Accounts

Code Recipes

Feedback & Information

ActiveState

© 2024 ActiveState Software Inc. All rights reserved. ActiveState®, Komodo®, ActiveState Perl Dev Kit®, ActiveState Tcl Dev Kit®, ActivePerl®, ActivePython®, and ActiveTcl® are registered trademarks of ActiveState. All other marks are property of their respective owners.

Using MSHTML to Parse HTML (Python recipe) by Bill Bell ActiveState Code (http://code.activestate.com/recipes/135702/)