Welcome, guest | Sign In | My Account | Store | Cart

Reads an xml file into a python dictionary of dictionaries (repeated elements are read in as lists). Modified from xmlreader.py by Christoph Dietze - differs in not needing repeated elements to be tagged in the xml file.

Python, 134 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
"""
==================================================
xmlreader2.py:
Modified from: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/116539
contributed by Christoph Dietze.

Modified to allow it to work with repeating elements without having to specify the multiple attribute.
==================================================
"""
from xml.dom.minidom import parse


class NotTextNodeError:
    pass


def getTextFromNode(node):
    """
    scans through all children of node and gathers the
    text. if node has non-text child-nodes, then
    NotTextNodeError is raised.
    """
    t = ""
    for n in node.childNodes:
        if n.nodeType == n.TEXT_NODE:
            t += n.nodeValue
        else:
            raise NotTextNodeError
    return t


def nodeToDic(node):
    """
    nodeToDic() scans through the children of node and makes a
    dictionary from the content.
    three cases are differentiated:
    - if the node contains no other nodes, it is a text-node
    and {nodeName:text} is merged into the dictionary.
    - if there is more than one child with the same name
    then these children will be appended to a list and this
    list is merged to the dictionary in the form: {nodeName:list}.
    - else, nodeToDic() will call itself recursively on
    the nodes children (merging {nodeName:nodeToDic()} to
    the dictionary).
    """
    dic = {} 
    multlist = {} # holds temporary lists where there are multiple children
    for n in node.childNodes:
        multiple = False 
        if n.nodeType != n.ELEMENT_NODE:
            continue
        # find out if there are multiple records    
        if len(node.getElementsByTagName(n.nodeName)) > 1:
            multiple = True 
            # and set up the list to hold the values
            if not multlist.has_key(n.nodeName):
                multlist[n.nodeName] = []
        
        try:
            #text node
            text = getTextFromNode(n)
        except NotTextNodeError:
            if multiple:
                # append to our list
                multlist[n.nodeName].append(nodeToDic(n))
                dic.update({n.nodeName:multlist[n.nodeName]})
                continue
            else: 
                # 'normal' node
                dic.update({n.nodeName:nodeToDic(n)})
                continue

        # text node
        if multiple:
            multlist[n.nodeName].append(text)
            dic.update({n.nodeName:multlist[n.nodeName]})
        else:
            dic.update({n.nodeName:text})
    return dic


def readConfig(filename):
    dom = parse(filename)
    return nodeToDic(dom)





def test():
    dic = readConfig("sample.xml")
    
    print dic["Config"]["Name"]
    print
    print "Item Type:", dic["Config"]["Items"]["Type"]
    for item in dic["Config"]["Items"]["Item"]:
        print "Item's Name:", item["Name"]
        print "Item's Value:", item["Value"]
    
    """
    ==================================================
    sample.xml:
    ==================================================
    <?xml version="1.0" encoding="UTF-8"?>

    <Config>
        <Name>My Config File</Name>

        <Items>
            <Type>Item type</Type>
            <Item>
                <Name>First Item</Name>
                <Value>Value 1</Value>
            </Item>
            <Item>
                <Name>Second Item</Name>
                <Value>Value 2</Value>
            </Item>
        </Items>

    </Config>

    
    ==================================================
    output:
    ==================================================
    [u'My Config File']

    Item Type: Item type
    Item's Name: First Item
    Item's Value: Value 1
    Item's Name: Second Item
    Item's Value: Value 2
    """
        

Modified from a very useful script by Christoph Dietze. I've fixed the thing that troubled him - having to specify repeating elements with an attribute 'multiple'. Allows an xml file to be read in to a python dictionary. Any xml file can be used and repeating elements are handled as lists. Repeating elements can be mixed with single elements.

No doubt there are faster ways to do this, but this works with the standard library and should be useful for small xml files like config info.

See the original script and discussion at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/116539

4 comments

Duncan McGreggor 19 years ago  # | flag

Another alternative... I've also written one that handles lists:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/410469

erwan h 18 years, 12 months ago  # | flag

Interesting but incomplete. pardon my remark but your code, confronted to newsml doesn't work well with repeated tags with properties. example extracted from busineswire NewsML press release system :

&lt;test&gt;
 &lt;Metadata&gt;
   &lt;MetadataType FormalName="BWKeywords"/&gt;
   &lt;Property FormalName="BWCountryKeywords" Value="United States"/&gt;
   &lt;Property FormalName="BWIndustryKeywords" Value="Technology"/&gt;
   &lt;Property FormalName="BWRegionKeywords" Value="North America"/&gt;
   &lt;Property FormalName="BWIndustryKeywords" Value="Hardware"/&gt;
   &lt;Property FormalName="BWIndustryKeywords" Value="Networks"/&gt;
   &lt;Property FormalName="BWIndustryKeywords" Value="Software"/&gt;
   &lt;Property FormalName="BWStateKeywords" Value="California"/&gt;
   &lt;Property FormalName="BWCategoryKeywords" Value="Personnel"/&gt;
 &lt;/Metadata&gt;
&lt;/test&gt;

Adding a few tests, here and there i succeeded to produce a more interesting ouput, but not yet what i expected and probably bad pythonic writing.

from elementtree import ElementTree
#import cElementTree as ElementTree

class XmlListConfig(list):
    def __init__(self, aList):
        for element in aList:
            childs = element.getchildren()
            if childs:
                # treat like dict
                tagslist=[]
                for c in childs :
                    if not (c.tag in tagslist) :
                        tagslist.append(c.tag)
                if len(childs) == 1 or len(childs)==len(tagslist):
                    self.append(XmlDictConfig(element))
                # treat like list
                else:
                    self.append(XmlListConfig(element))
            else:
                #ajout de la gestion des tags
                if element.items():
                    self.append({element.tag:dict(element.items())})
                if element.text :
                    self.append({element.tag:element.text})


class XmlDictConfig(dict):
    '''
    Example usage:

    >>> tree = ElementTree.parse('your_file.xml')
    >>> root = tree.getroot()
    >>> xmldict = XmlDictConfig(root)

    Or, if you want to use an XML string:

    >>> root = ElementTree.XML(xml_string)
    >>> xmldict = XmlDictConfig(root)

(comment continued...)

erwan h 18 years, 12 months ago  # | flag

(...continued from previous comment)

    And then use xmldict for what it is... a dict.
    '''
    def __init__(self, parent_element):
        if parent_element.items():
            self.update(dict(parent_element.items()))
        for element in parent_element:
            childs = element.getchildren()
            if childs:
                # treat like dict - we assume that if the first two tags
                # in a series are different, then they are all different.
                tagslist=[]
                for c in childs :
                    if not (c.tag in tagslist) :
                        tagslist.append(c.tag)
                if len(childs) == 1 or len(childs)==len(tagslist):
                    aDict = XmlDictConfig(element)
                    # if the tag has attributes, add those to the dict
                    if element.items():
                        aDict.update(dict(element.items()))
                # treat like list - we assume that if the first two tags
                # in a series are the same, then the rest are the same.
                else:
                    # here, we put the list in dictionary; the key is the
                    # tag name the list elements all share in common, and
                    # the value is the list itself
                    aDict = XmlListConfig(element)
                    # if the tag has attributes, add those to the list
                    if element.items():
                        aDict.append(element.items())
                self.update({element.tag: aDict})
            # this assumes that if you've got an attribute in a tag,
            # you won't be having any text. This may or may not be a
            # good idea -- time will tell. It works for the way we are
            # currently doing XML configuration files...
            elif element.items():
                self.update({element.tag: dict(element.items())})
            # finally, if there are no child tags and no attributes, extract
            # the text
            else:
                self.update({element.tag: element.text})
erwan h 18 years, 12 months ago  # | flag

sorry for my displaced comment. i switched page with http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/410469. I also tried you method but had problem with tags that don't have text like in the example above