Welcome, guest | Sign In | My Account | Store | Cart

This recipe shows how to dump the structure of an HTML5 document, using the html5lib Python library with recursion.

Python, 21 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Demo program to show how to dump the structure of 
# an HTML5 document to text, using html5lib.
# Author: Vasudev Ram.
# Copyright 2015 Vasudev Ram - http://www.dancingbison.com

import html5lib

# Define a function to dump HTML5 element info recursively, 
# given a top-level element.
def print_element(elem, indent, level):
    for sub_elem in elem:
        print "{}{}".format(indent * level, sub_elem)
        # Recursive call to print_element().
        print_element(sub_elem, indent, level + 1)

f = open("html5doc.html")
# Parse the HTML document.
tree = html5lib.parse(f)
indent = '----'
level = 0
print_element(tree, indent, level)

This recipe can be useful if you want to dump the structure of an HTML5 document to standard output, in order to understand its structure, and maybe modify or improve it after that.

More details and sample input and output in this blog post:

http://jugad2.blogspot.in/2015/02/recursively-dumping-structure-of-html5.html