ActiveState Code

Recipe 116539: turn the structure of a XML-document into a combination of dictionaries and lists


I decided not to customize the xml-parser to fit the structure of a xml-document, but to make a parser that adapts the structure of the document. By converting the xml-document in this way, the access to the elements is simple and code-customization is minimal.

Python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
==================================================
xmlreader.py:
==================================================
from xml.dom.minidom import parse


class NotTextNodeError:
    pass


def getTextFromNode(node):
    """
    scans through all children of node and gathers the
    text. if node has non-text child-nodes, then
    NotTextNodeError is raised.
    """
    t = ""
    for n in node.childNodes:
	if n.nodeType == n.TEXT_NODE:
	    t += n.nodeValue
	else:
	    raise NotTextNodeError
    return t


def nodeToDic(node):
    """
    nodeToDic() scans through the children of node and makes a
    dictionary from the content.
    three cases are differentiated:
	- if the node contains no other nodes, it is a text-node
    and {nodeName:text} is merged into the dictionary.
	- if the node has the attribute "method" set to "true",
    then it's children will be appended to a list and this
    list is merged to the dictionary in the form: {nodeName:list}.
	- else, nodeToDic() will call itself recursively on
    the nodes children (merging {nodeName:nodeToDic()} to
    the dictionary).
    """
    dic = {} 
    for n in node.childNodes:
	if n.nodeType != n.ELEMENT_NODE:
	    continue
	if n.getAttribute("multiple") == "true":
	    # node with multiple children:
	    # put them in a list
	    l = []
	    for c in n.childNodes:
	        if c.nodeType != n.ELEMENT_NODE:
		    continue
		l.append(nodeToDic(c))
	        dic.update({n.nodeName:l})
	    continue
		
	try:
	    text = getTextFromNode(n)
	except NotTextNodeError:
            # 'normal' node
            dic.update({n.nodeName:nodeToDic(n)})
            continue

        # text node
        dic.update({n.nodeName:text})
	continue
    return dic


def readConfig(filename):
    dom = parse(filename)
    return nodeToDic(dom)





def test():
    dic = readConfig("sample.xml")
    
    print dic["Config"]["Name"]
    print
    for item in dic["Config"]["Items"]:
	print "Item's Name:", item["Name"]
	print "Item's Value:", item["Value"]

test()



==================================================
sample.xml:
==================================================
<?xml version="1.0" encoding="UTF-8"?>

<Config>
    <Name>My Config File</Name>
    
    <Items multiple="true">
	<Item>
	    <Name>First Item</Name>
	    <Value>Value 1</Value>
	</Item>
	<Item>
	    <Name>Second Item</Name>
	    <Value>Value 2</Value>
	</Item>
    </Items>

</Config>



==================================================
output:
==================================================
My Config File

Item's Name: First Item
Item's Value: Value 1
Item's Name: Second Item
Item's Value: Value 2

Discussion

The big advantage of this recipe is that you never define the structure of the xml-document, you just use it.

One thing that bothers me, is that you must define 'multiple="true"' in the attribute of an element, if you want its children to be put in a list.

Comments

  1. 1. At 9:54 p.m. on 11 sep 2002, John Bair said:

    An alternate solution. Good idea. See my xml2obj recipe which is a variation on the theme that uses the expat parser for lower overhead and a stack to keep track of parents.

  2. 2. At 8:32 a.m. on 19 dec 2002, Kevin Manley said:

    Check out pyRXP. pyRXP from Reportlab turns XML into a python tuple tree and is extremely fast. Check it out (http://www.reportlab.com/xml/pyrxp.html)

  3. 3. At 8:20 p.m. on 14 may 2003, Chris Ryland said:

    buglet? Shouldn't the first

    dic.update({n.nodeName:l})
    

    be outdented one level? Otherwise, it's adding the partially-built list to the dictionary every time through the loop. (Or maybe it's late and I'm seeing double. ;-)

  4. 4. At 11:15 a.m. on 5 apr 2004, Pawel Zdziechowicz said:

    Improvement? Great idea! Very nice for small config. It is also good to add something like this:

    tmp = nodeToDic(c)
    if tmp != {}
      l.append(tmp)
    else:
      l.append(getTextFromNode(c))
    
    eg. piece of xml file
    
    &ltShared multiple="true">
      &ltFolder>c:\Mp3&lt;/Folder>
      &ltFolder>d:\Tmp&lt;/Folder>
    &lt;/Shared>
    
    without:  {.. u'Shared': [{}, {}] ..} ,
    with: {.. u'Shared': [u'c:\\Mp3', u'd:\\Tmp'] ..}
    
  5. 5. At 11:21 p.m. on 14 apr 2005, Peter Neish said:

    Another improvent? How about this as an alternative to allow it to work without specifying the multiple attribute?

    def nodeToDic(node):
    
        dic = {}
        multlist = {} # holds temporary lists where there are multiple children
        multiple = False
        for n in node.childNodes:
            if n.nodeType != n.ELEMENT_NODE:
                continue
    
            # find out if there are multiple records
            if len(node.getElementsByTagName(n.nodeName)) > 1:
                multiple = True
                # and set up the list to hold the values
                if not multlist.has_key(n.nodeName):
                    multlist[n.nodeName] = []
    
            try:
                #text node
                text = getTextFromNode(n)
            except NotTextNodeError:
                if multiple:
                    # append to our list
                    multlist[n.nodeName].append(nodeToDic(n))
                    dic.update({n.nodeName:multlist[n.nodeName]})
                    continue
                else:
                    # 'normal' node
                    dic.update({n.nodeName:nodeToDic(n)})
                    continue
    
            # text node
            if multiple:
                multlist[n.nodeName].append(text)
                dic.update({n.nodeName:multlist[n.nodeName]})
            else:
                dic.update({n.nodeName:text})
        return dic
    

Sign in to comment