ActiveState Code

Recipe 534109: XML to Python data structure


This simple method construct Python data structure from XML in one simple step. Data is accessed using the Pythonic "object.attribute" notation. See the discussion below for usage examples.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import re
import xml.sax.handler

def xml2obj(src):
    """
    A simple function to converts XML data into native Python object.
    """

    non_id_char = re.compile('[^_0-9a-zA-Z]')
    def _name_mangle(name):
        return non_id_char.sub('_', name)

    class DataNode(object):
        def __init__(self):
            self._attrs = {}    # XML attributes and child elements
            self.data = None    # child text data
        def __len__(self):
            # treat single element as a list of 1
            return 1
        def __getitem__(self, key):
            if isinstance(key, basestring):
                return self._attrs.get(key,None)
            else:
                return [self][key]
        def __contains__(self, name):
            return self._attrs.has_key(name)
        def __nonzero__(self):
            return bool(self._attrs or self.data)
        def __getattr__(self, name):
            if name.startswith('__'):
                # need to do this for Python special methods???
                raise AttributeError(name)
            return self._attrs.get(name,None)
        def _add_xml_attr(self, name, value):
            if name in self._attrs:
                # multiple attribute of the same name are represented by a list
                children = self._attrs[name]
                if not isinstance(children, list):
                    children = [children]
                    self._attrs[name] = children
                children.append(value)
            else:
                self._attrs[name] = value
        def __str__(self):
            return self.data or ''
        def __repr__(self):
            items = sorted(self._attrs.items())
            if self.data:
                items.append(('data', self.data))
            return u'{%s}' % ', '.join([u'%s:%s' % (k,repr(v)) for k,v in items])

    class TreeBuilder(xml.sax.handler.ContentHandler):
        def __init__(self):
            self.stack = []
            self.root = DataNode()
            self.current = self.root
            self.text_parts = []
        def startElement(self, name, attrs):
            self.stack.append((self.current, self.text_parts))
            self.current = DataNode()
            self.text_parts = []
            # xml attributes --> python attributes
            for k, v in attrs.items():
                self.current._add_xml_attr(_name_mangle(k), v)
        def endElement(self, name):
            text = ''.join(self.text_parts).strip()
            if text:
                self.current.data = text
            if self.current._attrs:
                obj = self.current
            else:
                # a text only node is simply represented by the string
                obj = text or ''
            self.current, self.text_parts = self.stack.pop()
            self.current._add_xml_attr(_name_mangle(name), obj)
        def characters(self, content):
            self.text_parts.append(content)

    builder = TreeBuilder()
    if isinstance(src,basestring):
        xml.sax.parseString(src, builder)
    else:
        xml.sax.parse(src, builder)
    return builder.root._attrs.values()[0]

Discussion

XML is a popular mean to encode data to share between systems. Despite its ubiquity, there is no straight forward way to translate XML to Python data structure. Traditional API like DOM and SAX often require undue amount of work to access the simplest piece of data.

This method convert XML data into a natural Pythonic data structure. For example:

>>> SAMPLE_XML = """<?xml version="1.0" encoding="UTF-8"?>
... <address_book>
...   <person gender='m'>
...     <name>fred</name>
...     <phone type='home'>54321</phone>
...     <phone type='cell'>12345</phone>
...     <note>&quot;A<!-- comment --><![CDATA[ <note>]]>&quot;</note>
...   </person>
... </address_book>
... """
>>> address_book = xml2obj(SAMPLE_XML)
>>> person = address_book.person

To access its data, you can do the following:

person.gender        -> 'm'     # an attribute
person['gender']     -> 'm'     # alternative dictionary syntax
person.name          -> 'fred'  # shortcut to a text node
person.phone[0].type -> 'home'  # multiple elements becomes an list
person.phone[0].data -> '54321' # use .data to get the text value
str(person.phone[0]) -> '54321' # alternative syntax for the text value
person[0]            -> person  # if there are only one <person>, it can still
                                # be used as if it is a list of 1 element.
'address' in person  -> False   # test for existence of an attr or child
person.address       -> None    # non-exist element returns None
bool(person.address) -> False   # has any 'address' data (attr, child or text)
person.note          -> '"A <note>"'

This function is inspired by David Mertz' Gnosis objectify utilities. The motivation of writing this recipe is for simplicity. With just 100 lines of code packaged into a single function, it can easily be embedded with other code for ease of distribution.

Comments

  1. 1. At 2:30 p.m. on 13 oct 2007, Wai Yip Tung (the author) said:

    known issues.

    1. If you have an attribute named "data", it has to be accessed using "node['data']". "node.data" refers to the text value.
    
    2. The text node shortcut maybe not stable
    
    e.g. If the src is &lt;name>fred&lt;/name&gt;,
    
         name -&gt; 'fred
    
    However, if the src is &lt;name title='mr'&gt;fred&lt;/name&gt;, then
    
         name -&gt; the name node
         name.data -&gt; 'fred'
    
    You can always use str(name) -&gt; 'fred' however.
    
  2. 2. At 6:26 p.m. on 13 oct 2007, Paul Miller said:

    A small nit. It should be noted that if your XML data has an attribute which is a Python keyword, this isn't going to work. For example, using "print" as an attribute is not going to work out well.

    You could fix this with a little work, say, wrapping attributes in an XMLAttr class, or something. Or, you could simply map names like "print" to python attributes "_print". Or, you can simply accept that this is a limitation of this recipe. :-)

    Overall, I think the second and third solutions are better than the first.

  3. 3. At 5:17 p.m. on 16 oct 2007, Wai Yip Tung (the author) said:

    use dictionary syntax.

    Use dictionary syntax to get around the keyword issue.
    
    e.g.
      node['print']
    
  4. 4. At 11:29 a.m. on 19 oct 2007, Wai Yip Tung (the author) said:

    Support iteration. Fixed __getitem__() to better support iteration

  5. 5. At 1:12 p.m. on 24 oct 2007, Adam Atlas said:

    Multiple items. One thing about this that I find concerning is the possibility of having a schema (just in the abstract sense -- some structure in mind) where some element can have multiple children of the same name, but where that number could just as easily be one. It seems like in this situation, any code that uses this recipe will have to check whether or not the value is a list every time it accesses such a structure.

    Like, in your example -- the phone tag. If I were using this to insert into a database, I'd always want to get the phone numbers as a list, even if there were only one. (And it seems pretty silly to assume that everyone will have at least two.) Also, what about the reverse -- you're only expecting one value for some element, but it's an improperly constructed file that gives multiple. I suppose you could solve both of these with isinstance() idioms on a case-by-case basis, but it seems like that would get tedious.

    Can you think of an elegant, Pythonic solution to this? Because I actually encounter this problem all the time parsing similar data structures (GET query-strings, INI-style configuration files, etc.) and I have yet to find a solution I'm completely happy with.

  6. 6. At 2:03 p.m. on 24 oct 2007, Wai Yip Tung (the author) said:

    It becomes a list of 1. Hi Adam. I hear you. That't why it has some magic to treat a single element as a list of 1. For example there is only 1 person in this XML message. But you can do:

    >>> len(address_book.person)
    1
    >>> for p in address_book.person:
    ...   print p.name
    ...
    fred
    
  7. 7. At 9:34 a.m. on 24 jan 2008, Jeremy Bunn said:

    If you get the error: TypeError: 'DataNode' object does not support item assignment. A simple fix...

    In rare cases, you may want to set an item back into the data structure.

    This worked for me (add to DataNode and fix indentation problems)

    def __setitem__(self, key, value):

    self._attrs[key] = value

    BTW, this is one of the best xml to object mapping snippets I have found. The array handling is particularly nice.

    If you are a perl programmer looking for a Python equivalent of XML::Simple this is the closest I have seen.

  8. 8. At 4:34 a.m. on 6 jan 2009, david SHI said:

    I wish to read in an xml, extract data and put data in a .dbf. Can anyone help?

    Regards. davidgshi@yahoo.co.uk

Sign in to comment