Welcome, guest | Sign In | My Account | Store | Cart

This simple method construct Python data structure from XML in one simple step. Data is accessed using the Pythonic "object.attribute" notation. See the discussion below for usage examples.

Python, 84 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import re
import xml.sax.handler

def xml2obj(src):
    """
    A simple function to converts XML data into native Python object.
    """

    non_id_char = re.compile('[^_0-9a-zA-Z]')
    def _name_mangle(name):
        return non_id_char.sub('_', name)

    class DataNode(object):
        def __init__(self):
            self._attrs = {}    # XML attributes and child elements
            self.data = None    # child text data
        def __len__(self):
            # treat single element as a list of 1
            return 1
        def __getitem__(self, key):
            if isinstance(key, basestring):
                return self._attrs.get(key,None)
            else:
                return [self][key]
        def __contains__(self, name):
            return self._attrs.has_key(name)
        def __nonzero__(self):
            return bool(self._attrs or self.data)
        def __getattr__(self, name):
            if name.startswith('__'):
                # need to do this for Python special methods???
                raise AttributeError(name)
            return self._attrs.get(name,None)
        def _add_xml_attr(self, name, value):
            if name in self._attrs:
                # multiple attribute of the same name are represented by a list
                children = self._attrs[name]
                if not isinstance(children, list):
                    children = [children]
                    self._attrs[name] = children
                children.append(value)
            else:
                self._attrs[name] = value
        def __str__(self):
            return self.data or ''
        def __repr__(self):
            items = sorted(self._attrs.items())
            if self.data:
                items.append(('data', self.data))
            return u'{%s}' % ', '.join([u'%s:%s' % (k,repr(v)) for k,v in items])

    class TreeBuilder(xml.sax.handler.ContentHandler):
        def __init__(self):
            self.stack = []
            self.root = DataNode()
            self.current = self.root
            self.text_parts = []
        def startElement(self, name, attrs):
            self.stack.append((self.current, self.text_parts))
            self.current = DataNode()
            self.text_parts = []
            # xml attributes --> python attributes
            for k, v in attrs.items():
                self.current._add_xml_attr(_name_mangle(k), v)
        def endElement(self, name):
            text = ''.join(self.text_parts).strip()
            if text:
                self.current.data = text
            if self.current._attrs:
                obj = self.current
            else:
                # a text only node is simply represented by the string
                obj = text or ''
            self.current, self.text_parts = self.stack.pop()
            self.current._add_xml_attr(_name_mangle(name), obj)
        def characters(self, content):
            self.text_parts.append(content)

    builder = TreeBuilder()
    if isinstance(src,basestring):
        xml.sax.parseString(src, builder)
    else:
        xml.sax.parse(src, builder)
    return builder.root._attrs.values()[0]

XML is a popular mean to encode data to share between systems. Despite its ubiquity, there is no straight forward way to translate XML to Python data structure. Traditional API like DOM and SAX often require undue amount of work to access the simplest piece of data.

This method convert XML data into a natural Pythonic data structure. For example:

>>> SAMPLE_XML = """<?xml version="1.0" encoding="UTF-8"?>
... <address_book>
...   <person gender='m'>
...     <name>fred</name>
...     <phone type='home'>54321</phone>
...     <phone type='cell'>12345</phone>
...     <note>&quot;A<!-- comment --><![CDATA[ <note>]]>&quot;</note>
...   </person>
... </address_book>
... """
>>> address_book = xml2obj(SAMPLE_XML)
>>> person = address_book.person

To access its data, you can do the following:

person.gender        -> 'm'     # an attribute
person['gender']     -> 'm'     # alternative dictionary syntax
person.name          -> 'fred'  # shortcut to a text node
person.phone[0].type -> 'home'  # multiple elements becomes an list
person.phone[0].data -> '54321' # use .data to get the text value
str(person.phone[0]) -> '54321' # alternative syntax for the text value
person[0]            -> person  # if there are only one <person>, it can still
                                # be used as if it is a list of 1 element.
'address' in person  -> False   # test for existence of an attr or child
person.address       -> None    # non-exist element returns None
bool(person.address) -> False   # has any 'address' data (attr, child or text)
person.note          -> '"A <note>"'

This function is inspired by David Mertz' Gnosis objectify utilities. The motivation of writing this recipe is for simplicity. With just 100 lines of code packaged into a single function, it can easily be embedded with other code for ease of distribution.

25 comments

Wai Yip Tung (author) 16 years, 6 months ago  # | flag

known issues.

1. If you have an attribute named "data", it has to be accessed using "node['data']". "node.data" refers to the text value.

2. The text node shortcut maybe not stable

e.g. If the src is &lt;name>fred&lt;/name&gt;,

     name -&gt; 'fred

However, if the src is &lt;name title='mr'&gt;fred&lt;/name&gt;, then

     name -&gt; the name node
     name.data -&gt; 'fred'

You can always use str(name) -&gt; 'fred' however.
Paul Miller 16 years, 6 months ago  # | flag

A small nit. It should be noted that if your XML data has an attribute which is a Python keyword, this isn't going to work. For example, using "print" as an attribute is not going to work out well.

You could fix this with a little work, say, wrapping attributes in an XMLAttr class, or something. Or, you could simply map names like "print" to python attributes "_print". Or, you can simply accept that this is a limitation of this recipe. :-)

Overall, I think the second and third solutions are better than the first.

Wai Yip Tung (author) 16 years, 6 months ago  # | flag

use dictionary syntax.

Use dictionary syntax to get around the keyword issue.

e.g.
  node['print']
Wai Yip Tung (author) 16 years, 6 months ago  # | flag

Support iteration. Fixed __getitem__() to better support iteration

Adam Atlas 16 years, 6 months ago  # | flag

Multiple items. One thing about this that I find concerning is the possibility of having a schema (just in the abstract sense -- some structure in mind) where some element can have multiple children of the same name, but where that number could just as easily be one. It seems like in this situation, any code that uses this recipe will have to check whether or not the value is a list every time it accesses such a structure.

Like, in your example -- the phone tag. If I were using this to insert into a database, I'd always want to get the phone numbers as a list, even if there were only one. (And it seems pretty silly to assume that everyone will have at least two.) Also, what about the reverse -- you're only expecting one value for some element, but it's an improperly constructed file that gives multiple. I suppose you could solve both of these with isinstance() idioms on a case-by-case basis, but it seems like that would get tedious.

Can you think of an elegant, Pythonic solution to this? Because I actually encounter this problem all the time parsing similar data structures (GET query-strings, INI-style configuration files, etc.) and I have yet to find a solution I'm completely happy with.

Wai Yip Tung (author) 16 years, 6 months ago  # | flag

It becomes a list of 1. Hi Adam. I hear you. That't why it has some magic to treat a single element as a list of 1. For example there is only 1 person in this XML message. But you can do:

>>> len(address_book.person)
1
>>> for p in address_book.person:
...   print p.name
...
fred
Jeremy Bunn 16 years, 3 months ago  # | flag

If you get the error: TypeError: 'DataNode' object does not support item assignment. A simple fix...

In rare cases, you may want to set an item back into the data structure.

This worked for me (add to DataNode and fix indentation problems)

def __setitem__(self, key, value):

self._attrs[key] = value

BTW, this is one of the best xml to object mapping snippets I have found. The array handling is particularly nice.

If you are a perl programmer looking for a Python equivalent of XML::Simple this is the closest I have seen.

david SHI 15 years, 3 months ago  # | flag

I wish to read in an xml, extract data and put data in a .dbf. Can anyone help?

Regards. davidgshi@yahoo.co.uk

Jake Hulme 14 years, 7 months ago  # | flag

Very neat and much cleaner than anything else I found - thank you!

sammy 13 years, 11 months ago  # | flag

I found this useful. However, I want to edit the python object and then translate it back to xml. But in this case, although editing the python object is easy, there is no way to translate it back to xml form. Do you have something like this? cheers

Wai Yip Tung (author) 13 years, 11 months ago  # | flag

Sorry Sammy, there is no XML generation. One of the 'feature' of this recipe is it has only 84 lines. Reading is all it does.

Matt 13 years, 2 months ago  # | flag

Whenever i try this i get the following error: xml.sax.parseString(src, builder) File "/usr/lib/python2.6/xml/sax/__init__.py", line 49, in parseString parser.parse(inpsrc) File "/usr/lib/python2.6/xml/sax/expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib/python2.6/xml/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/usr/lib/python2.6/xml/sax/expatreader.py", line 211, in feed self._err_handler.fatalError(exc) File "/usr/lib/python2.6/xml/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: <unknown>:1:2: not well-formed (invalid token)

Am i missing something. The xml file is utf-8 formatted. Other than that i simply download the script and run it ... ??

Ming-Chih Kao 13 years, 2 months ago  # | flag

Is there a way to convert the DataNode object back into a standard python Dictionary object? Would like to plug the object back into pymongo. Thanks!

enricostano 13 years, 1 month ago  # | flag

Hi, thank you for the great code!

I need to read the xml from a file, like nodes.xml and pass the path or filename trough attributes, just like:

python myscript.py nodes.xml

but I don't know how to change your code to reach it.

Could you help me please?

Thank you in advance!

Bye,

enrico

Will Stevens 12 years, 11 months ago  # | flag

I had to do a bunch of troubleshooting because this was not working out of the box on the XML I was working with.

I found that I had to change the following: return builder.root._attrs.values()[0] to this: return builder.root._attrs in order to get the function to return the object.

I hope this helps other people. Thank you for this code...

Wai Yip Tung (author) 12 years, 11 months ago  # | flag

Will, can you provide a snippet of XML that this script does not handle well?

Wai Yip Tung (author) 12 years, 11 months ago  # | flag

enricostano, you can do this

obj = xml2obj(open(filename))
Wai Yip Tung (author) 12 years, 11 months ago  # | flag

Ming-Chih Kao,

You can walk the XML tree via its _attrs attribute. This makes me think it may be a good idea to add a iterator function to the script.

Will Stevens 12 years, 11 months ago  # | flag

Hi Wai Yip Tung, Sorry, I spoke too soon. I have been working with it this morning again and I have found that what is returned is usable. My problem was that I was trying to visualize the output while I was developing and it was not showing anything.

In my GAE dev sandbox I was trying: result = xml2obj(SAMPLE_XML) import logging logging.debug(result)

If I did the change I mentioned previously, I was able to see the structure (not pretty, but viewable) because it was of type 'dict', where as everything else is of type 'NodeData' (which does not display).

I tried to change 'class DataNode(object):' to 'class DataNode(dict):' to see if that would allow logging.debug to view the data, but I did not troubleshoot very much when it did not work.

I will be working with this script quite a bit in the next little while, so I will probably add some more functionality. The first thing that comes to mind is support for '__keys__'.

I apologize for assuming your code what wrong when it did not react the way I expected it to. This is a great piece of code and I appreciate the work you have done on it. I will share any features I add...

Will Stevens 12 years, 11 months ago  # | flag

So far all I have added to DataNode is:

def keys(self): return self._attrs.keys()

Mat 12 years, 7 months ago  # | flag

Hi Wai Yip Tung, Thanks for this excelent program. How can I save the resulting data structure in a file to restore it later. This prevent the program from parsing the xml at each startup. If I use pickle I get this error message (DataNode is not found):

>>> pickle.dump(ab1,myfile)
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/lib/python2.6/pickle.py", line 1362, in dump Pickler(file, rotocol).dump(obj)
 File "/usr/lib/python2.6/pickle.py", line 224, in dump self.save(obj)
 File "/usr/lib/python2.6/pickle.py", line 331, in save self.save_reduce(obj=obj, *rv)
 File "/usr/lib/python2.6/pickle.py", line 401, in save_reduce save(args)
 File "/usr/lib/python2.6/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self
 File "/usr/lib/python2.6/pickle.py", line 562, in save_tuple save(element)
 File "/usr/lib/python2.6/pickle.py", line 286, in save  f(self, obj) # Call unbound method with explicit self
 File "/usr/lib/python2.6/pickle.py", line 748, in save_global(obj, module, name))
 pickle.PicklingError: Can't pickle <class 'xml_reader.DataNode'>: it's not found

as xml_reader.DataNode

where xml_reader contain xml2obj() function. Is there Any other way to save the resulting structure? Thanks Mat

Mat 12 years, 7 months ago  # | flag

Dear Wai Yip Tung, I think I sorted the problem out. I included an instance of the DataNode class in the file containing pickle.load(myfile).

Mat ps: I'm quite newbie in Python, sorry!

Rakesh 12 years, 6 months ago  # | flag

Hi Wai Yip Tung,Great Program. I'm using xmlParser with DJango. I want to store this xmlObjects as part of DJango session [it pickles the Data].

I'm getting the following error. Can't pickle <class 'kontikiLogparser.xmlParse.DataNode'>: attribute lookup kontikiLogparser.xmlParse.DataNode failed

from the DJango views i'm using a module which imports another module which uses xmlParser.

How do i fix this ? It looks similar to the problem "Mat" faced in previous comment, but i'm not quite sure how he fixed it ?

Rakesh 12 years, 6 months ago  # | flag

Dear Wai Yip Tung, i think i found the problem. I took class DataNode class(object) out of the Method def xml2obj(src).It works now.

Andy 9 years, 8 months ago  # | flag

I ended up just using lxml.objectify since that solves the problem below. or maybe somebody could explain what I was doing wrong.

TEST A: One title. XML result: <titles><title>test</title></titles>

Objectified:

(Pdb) p d['obj']['titles']['title']
u’test’

TEST B: Two titles. XML result: <titles><title>test01</title><title>test02</title></titles>

Objectified:

(Pdb) p d['obj']['titles']['title'] 
[{data:u'test01'}, {data:u'test02'}]

TEMPLATE

{% for title in obj.titles.title %}
  <tr><td>Title:</td><td>{{ title }}</td></tr>
{% endfor %}

Template output for TEST A:

Title:     t
Title:     e
Title:     s
Title:     t

Template output for TEST B:

Title:     test01
Title:     test02

I tried {% for title in obj.titles %} and this just returned nothing.

Andy