XML parsers consider several conditions when deciding which whitespace-only text nodes should be preserved during DOM construction. Unfortunately, those conditions are controlled by the document's DTD or by the content of document itself. Since it is often difficult to modify the DTD or the XML, this recipe simple removes all whitespace-only text nodes from a DOM node.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
def remove_whilespace_nodes(node, unlink=False): """Removes all of the whitespace-only text decendants of a DOM node. When creating a DOM from an XML source, XML parsers are required to consider several conditions when deciding whether to include whitespace-only text nodes. This function ignores all of those conditions and removes all whitespace-only text decendants of the specified node. If the unlink flag is specified, the removed text nodes are unlinked so that their storage can be reclaimed. If the specified node is a whitespace-only text node then it is left unmodified.""" remove_list =  for child in node.childNodes: if child.nodeType == dom.Node.TEXT_NODE and \ not child.data.strip(): remove_list.append(child) elif child.hasChildNodes(): remove_whilespace_nodes(child, unlink) for node in remove_list: node.parentNode.removeChild(node) if unlink: node.unlink()
This code should work with any correctly implemented Python-DOM.