This code de-serializes XML into a Python data structure.
This is one part of a trio of recipes:
For more information
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | '''
XML2Py - XML to Python de-serialization
This code transforms an XML document into a Python data structure
Usage:
deserializer = XML2Py()
python_object = deserializer.parse( xml_string )
print xml_string
print python_object
'''
from lxml import etree
class XML2Py():
def __init__( self ):
self._parser = parser = etree.XMLParser( remove_blank_text=True )
self._root = None # root of etree structure
self.data = None # where we store the processed Python structure
def parse( self, xmlString ):
'''
processes XML string into Python data structure
'''
self._root = etree.fromstring( xmlString, self._parser )
self.data = self._parseXMLRoot()
return self.data
def tostring( self ):
'''
creates a string representation using our etree object
'''
if self._root != None:
return etree.tostring( self._root )
def _parseXMLRoot( self ):
'''
starts processing, takes care of first level idisyncrasies
'''
childDict = self._parseXMLNode( self._root )
return { self._root.tag : childDict["children"] }
def _parseXMLNode( self, element ):
'''
rest of the processing
'''
childContainer = None # either Dict or List
# process any tag attributes
# if we have attributes then the child container is a Dict
# otherwise a List
if element.items():
childContainer = {}
childContainer.update( dict( element.items() ) )
else:
childContainer = []
if isinstance( childContainer, list ) and element.text:
# tag with no attributes and one that contains text
childContainer.append( element.text )
else:
# tag might have children, let's process them
for child_elem in element.getchildren():
childDict = self._parseXMLNode( child_elem )
# let's store our child based on container type
#
if isinstance( childContainer, dict ):
# these children are lone tag entities ( eg, 'copyright' )
childContainer.update( { childDict["tag"] : childDict["children"] } )
else:
# these children are repeated tag entities ( eg, 'format' )
childContainer.append( childDict["children"] )
return { "tag":element.tag, "children": childContainer }
def main():
xml_string = '''
<documents>
<document date="June 6, 2009" title="The Newness of Python" author="John Doe">
<copyright type="CC" url="http://www.creativecommons.org/" date="June 24, 2009" />
<text>Python is very nice. Very, very nice.</text>
<formats>
<format type="pdf">
<info uri="http://www.python.org/newness-of-python.pdf" pages="245" />
</format>
<format type="web">
<info uri="http://www.python.org/newness-of-python.html" />
</format>
</formats>
</document>
</documents>
'''
deserializer = XML2Py()
python_object = deserializer.parse( xml_string )
print xml_string
print python_object
if __name__ == '__main__':
main()
|
For more information
Tags: deserialize, xml
Hello, I am newbie to python and therefore please pardon my dumb questions. I tried implementing your script to deserialize xml to python dictoinary...it didnt work for me.
I used your recipe for serializing dict to xml and got the following xml string as its output
'<a a="1" /><b b="2" />' Now, when I tried to deserialize the above xml string I get the following error.
Traceback (most recent call last): File "<stdin>", line 1, in ? File "xml2py2.py", line 20, in parse self._root = etree.fromstring( xmlString, self._parser ) File "lxml.etree.pyx", line 2385, in lxml.etree.fromstring (src/lxml/lxml.etree.c:23581) File "parser.pxi", line 1359, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:57605) File "parser.pxi", line 1246, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:56594) File "parser.pxi", line 798, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:53794) File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:50830) File "parser.pxi", line 536, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:51661) File "parser.pxi", line 478, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:51079) lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 1, column 7
I tried looking at lines 20 and 1 but couldnt make much sense out of it. Could you please help me understand the error and or bug.
Thank you Ankit
this solution doesn't work for just any arbitrary Python DICT. it must follow a certain structure. two things to mention about one you have supplied.
the XML created from your input DICT isn't valid XML. a valid XML structure must have a single node at the root which wraps all data. here's an example with 'data' as the root node:
<data><a a="1"/><b b="2"/></data>
i'm sure that this is causing the error you're seeing.
according to my solution, any parent/child nodes share the same name except that the parent is the "plural" form of the name. using your data, this might be:
<nodes><node a="1"/><node b="2"/></nodes>
this is mentioned in the main doc page and can be seen in the example XML/DICT listed in the code: http://code.activestate.com/recipes/577266/