This simple method construct Python data structure from XML in one simple step. Data is accessed using the Pythonic "object.attribute" notation. See the discussion below for usage examples.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | import re
import xml.sax.handler
def xml2obj(src):
"""
A simple function to converts XML data into native Python object.
"""
non_id_char = re.compile('[^_0-9a-zA-Z]')
def _name_mangle(name):
return non_id_char.sub('_', name)
class DataNode(object):
def __init__(self):
self._attrs = {} # XML attributes and child elements
self.data = None # child text data
def __len__(self):
# treat single element as a list of 1
return 1
def __getitem__(self, key):
if isinstance(key, basestring):
return self._attrs.get(key,None)
else:
return [self][key]
def __contains__(self, name):
return self._attrs.has_key(name)
def __nonzero__(self):
return bool(self._attrs or self.data)
def __getattr__(self, name):
if name.startswith('__'):
# need to do this for Python special methods???
raise AttributeError(name)
return self._attrs.get(name,None)
def _add_xml_attr(self, name, value):
if name in self._attrs:
# multiple attribute of the same name are represented by a list
children = self._attrs[name]
if not isinstance(children, list):
children = [children]
self._attrs[name] = children
children.append(value)
else:
self._attrs[name] = value
def __str__(self):
return self.data or ''
def __repr__(self):
items = sorted(self._attrs.items())
if self.data:
items.append(('data', self.data))
return u'{%s}' % ', '.join([u'%s:%s' % (k,repr(v)) for k,v in items])
class TreeBuilder(xml.sax.handler.ContentHandler):
def __init__(self):
self.stack = []
self.root = DataNode()
self.current = self.root
self.text_parts = []
def startElement(self, name, attrs):
self.stack.append((self.current, self.text_parts))
self.current = DataNode()
self.text_parts = []
# xml attributes --> python attributes
for k, v in attrs.items():
self.current._add_xml_attr(_name_mangle(k), v)
def endElement(self, name):
text = ''.join(self.text_parts).strip()
if text:
self.current.data = text
if self.current._attrs:
obj = self.current
else:
# a text only node is simply represented by the string
obj = text or ''
self.current, self.text_parts = self.stack.pop()
self.current._add_xml_attr(_name_mangle(name), obj)
def characters(self, content):
self.text_parts.append(content)
builder = TreeBuilder()
if isinstance(src,basestring):
xml.sax.parseString(src, builder)
else:
xml.sax.parse(src, builder)
return builder.root._attrs.values()[0]
|
XML is a popular mean to encode data to share between systems. Despite its ubiquity, there is no straight forward way to translate XML to Python data structure. Traditional API like DOM and SAX often require undue amount of work to access the simplest piece of data.
This method convert XML data into a natural Pythonic data structure. For example:
>>> SAMPLE_XML = """<?xml version="1.0" encoding="UTF-8"?>
... <address_book>
... <person gender='m'>
... <name>fred</name>
... <phone type='home'>54321</phone>
... <phone type='cell'>12345</phone>
... <note>"A<!-- comment --><![CDATA[ <note>]]>"</note>
... </person>
... </address_book>
... """
>>> address_book = xml2obj(SAMPLE_XML)
>>> person = address_book.person
To access its data, you can do the following:
person.gender -> 'm' # an attribute
person['gender'] -> 'm' # alternative dictionary syntax
person.name -> 'fred' # shortcut to a text node
person.phone[0].type -> 'home' # multiple elements becomes an list
person.phone[0].data -> '54321' # use .data to get the text value
str(person.phone[0]) -> '54321' # alternative syntax for the text value
person[0] -> person # if there are only one <person>, it can still
# be used as if it is a list of 1 element.
'address' in person -> False # test for existence of an attr or child
person.address -> None # non-exist element returns None
bool(person.address) -> False # has any 'address' data (attr, child or text)
person.note -> '"A <note>"'
This function is inspired by David Mertz' Gnosis objectify utilities. The motivation of writing this recipe is for simplicity. With just 100 lines of code packaged into a single function, it can easily be embedded with other code for ease of distribution.
known issues.
A small nit. It should be noted that if your XML data has an attribute which is a Python keyword, this isn't going to work. For example, using "print" as an attribute is not going to work out well.
You could fix this with a little work, say, wrapping attributes in an XMLAttr class, or something. Or, you could simply map names like "print" to python attributes "_print". Or, you can simply accept that this is a limitation of this recipe. :-)
Overall, I think the second and third solutions are better than the first.
use dictionary syntax.
Support iteration. Fixed __getitem__() to better support iteration
Multiple items. One thing about this that I find concerning is the possibility of having a schema (just in the abstract sense -- some structure in mind) where some element can have multiple children of the same name, but where that number could just as easily be one. It seems like in this situation, any code that uses this recipe will have to check whether or not the value is a list every time it accesses such a structure.
Like, in your example -- the
phone
tag. If I were using this to insert into a database, I'd always want to get the phone numbers as a list, even if there were only one. (And it seems pretty silly to assume that everyone will have at least two.) Also, what about the reverse -- you're only expecting one value for some element, but it's an improperly constructed file that gives multiple. I suppose you could solve both of these with isinstance() idioms on a case-by-case basis, but it seems like that would get tedious.Can you think of an elegant, Pythonic solution to this? Because I actually encounter this problem all the time parsing similar data structures (GET query-strings, INI-style configuration files, etc.) and I have yet to find a solution I'm completely happy with.
It becomes a list of 1. Hi Adam. I hear you. That't why it has some magic to treat a single element as a list of 1. For example there is only 1 person in this XML message. But you can do:
If you get the error: TypeError: 'DataNode' object does not support item assignment. A simple fix...
In rare cases, you may want to set an item back into the data structure.
This worked for me (add to DataNode and fix indentation problems)
def __setitem__(self, key, value):
self._attrs[key] = value
BTW, this is one of the best xml to object mapping snippets I have found. The array handling is particularly nice.
If you are a perl programmer looking for a Python equivalent of XML::Simple this is the closest I have seen.
I wish to read in an xml, extract data and put data in a .dbf. Can anyone help?
Regards. davidgshi@yahoo.co.uk
Very neat and much cleaner than anything else I found - thank you!
I found this useful. However, I want to edit the python object and then translate it back to xml. But in this case, although editing the python object is easy, there is no way to translate it back to xml form. Do you have something like this? cheers
Sorry Sammy, there is no XML generation. One of the 'feature' of this recipe is it has only 84 lines. Reading is all it does.
Whenever i try this i get the following error: xml.sax.parseString(src, builder) File "/usr/lib/python2.6/xml/sax/__init__.py", line 49, in parseString parser.parse(inpsrc) File "/usr/lib/python2.6/xml/sax/expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib/python2.6/xml/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/usr/lib/python2.6/xml/sax/expatreader.py", line 211, in feed self._err_handler.fatalError(exc) File "/usr/lib/python2.6/xml/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: <unknown>:1:2: not well-formed (invalid token)
Am i missing something. The xml file is utf-8 formatted. Other than that i simply download the script and run it ... ??
Is there a way to convert the DataNode object back into a standard python Dictionary object? Would like to plug the object back into pymongo. Thanks!
Hi, thank you for the great code!
I need to read the xml from a file, like nodes.xml and pass the path or filename trough attributes, just like:
python myscript.py nodes.xml
but I don't know how to change your code to reach it.
Could you help me please?
Thank you in advance!
Bye,
enrico
I had to do a bunch of troubleshooting because this was not working out of the box on the XML I was working with.
I found that I had to change the following: return builder.root._attrs.values()[0] to this: return builder.root._attrs in order to get the function to return the object.
I hope this helps other people. Thank you for this code...
Will, can you provide a snippet of XML that this script does not handle well?
enricostano, you can do this
Ming-Chih Kao,
You can walk the XML tree via its _attrs attribute. This makes me think it may be a good idea to add a iterator function to the script.
Hi Wai Yip Tung, Sorry, I spoke too soon. I have been working with it this morning again and I have found that what is returned is usable. My problem was that I was trying to visualize the output while I was developing and it was not showing anything.
In my GAE dev sandbox I was trying: result = xml2obj(SAMPLE_XML) import logging logging.debug(result)
If I did the change I mentioned previously, I was able to see the structure (not pretty, but viewable) because it was of type 'dict', where as everything else is of type 'NodeData' (which does not display).
I tried to change 'class DataNode(object):' to 'class DataNode(dict):' to see if that would allow logging.debug to view the data, but I did not troubleshoot very much when it did not work.
I will be working with this script quite a bit in the next little while, so I will probably add some more functionality. The first thing that comes to mind is support for '__keys__'.
I apologize for assuming your code what wrong when it did not react the way I expected it to. This is a great piece of code and I appreciate the work you have done on it. I will share any features I add...
So far all I have added to DataNode is:
def keys(self): return self._attrs.keys()
Hi Wai Yip Tung, Thanks for this excelent program. How can I save the resulting data structure in a file to restore it later. This prevent the program from parsing the xml at each startup. If I use pickle I get this error message (DataNode is not found):
as xml_reader.DataNode
where xml_reader contain xml2obj() function. Is there Any other way to save the resulting structure? Thanks Mat
Dear Wai Yip Tung, I think I sorted the problem out. I included an instance of the DataNode class in the file containing pickle.load(myfile).
Mat ps: I'm quite newbie in Python, sorry!
Hi Wai Yip Tung,Great Program. I'm using xmlParser with DJango. I want to store this xmlObjects as part of DJango session [it pickles the Data].
I'm getting the following error. Can't pickle <class 'kontikiLogparser.xmlParse.DataNode'>: attribute lookup kontikiLogparser.xmlParse.DataNode failed
from the DJango views i'm using a module which imports another module which uses xmlParser.
How do i fix this ? It looks similar to the problem "Mat" faced in previous comment, but i'm not quite sure how he fixed it ?
Dear Wai Yip Tung, i think i found the problem. I took class DataNode class(object) out of the Method def xml2obj(src).It works now.
I ended up just using lxml.objectify since that solves the problem below. or maybe somebody could explain what I was doing wrong.
TEST A: One title. XML result:
<titles><title>test</title></titles>
Objectified:
TEST B: Two titles. XML result:
<titles><title>test01</title><title>test02</title></titles>
Objectified:
TEMPLATE
Template output for TEST A:
Template output for TEST B:
I tried
{% for title in obj.titles %}
and this just returned nothing.Andy