Welcome, guest | Sign In | My Account | Store | Cart

This code de-serializes XML into a Python data structure.

This is one part of a trio of recipes:

For more information

See XML to Python data structure Recipe #577266

Python, 109 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
'''
XML2Py - XML to Python de-serialization

This code transforms an XML document into a Python data structure

Usage:
    deserializer = XML2Py()
    python_object = deserializer.parse( xml_string )
    print xml_string
    print python_object
'''

from lxml import etree

class XML2Py():

    def __init__( self ):

        self._parser = parser = etree.XMLParser( remove_blank_text=True )
        self._root = None  # root of etree structure
        self.data = None   # where we store the processed Python structure

    def parse( self, xmlString ):
        '''
        processes XML string into Python data structure
        '''
        self._root = etree.fromstring( xmlString, self._parser )
        self.data = self._parseXMLRoot()
        return self.data

    def tostring( self ):
        '''
        creates a string representation using our etree object
        '''
        if self._root != None:
            return etree.tostring( self._root )

    def _parseXMLRoot( self ):
        '''
        starts processing, takes care of first level idisyncrasies
        '''
        childDict = self._parseXMLNode( self._root )
        return { self._root.tag : childDict["children"] }

    def _parseXMLNode( self, element ):
        '''
        rest of the processing
        '''
        childContainer = None # either Dict or List

        # process any tag attributes
        # if we have attributes then the child container is a Dict
        #   otherwise a List
        if element.items():
            childContainer = {}
            childContainer.update( dict( element.items() ) )
        else:
            childContainer = []


        if isinstance( childContainer, list ) and element.text:
            # tag with no attributes and one that contains text
            childContainer.append( element.text )

        else:
            # tag might have children, let's process them
            for child_elem in element.getchildren():

                childDict = self._parseXMLNode( child_elem )

              # let's store our child based on container type
                #
                if isinstance( childContainer, dict ):
                    # these children are lone tag entities ( eg, 'copyright' )
                    childContainer.update( { childDict["tag"] : childDict["children"] } )

                else:
                    # these children are repeated tag entities ( eg, 'format' )
                    childContainer.append( childDict["children"] )

        return { "tag":element.tag, "children": childContainer }


def main():

    xml_string = '''
    <documents>
        <document date="June 6, 2009" title="The Newness of Python" author="John Doe">
            <copyright type="CC" url="http://www.creativecommons.org/" date="June 24, 2009" />
            <text>Python is very nice. Very, very nice.</text>
            <formats>
                <format type="pdf">
                    <info uri="http://www.python.org/newness-of-python.pdf" pages="245" />
                </format>
                <format type="web">
                    <info uri="http://www.python.org/newness-of-python.html" />
                </format>
            </formats>
        </document>
    </documents>
    '''
    deserializer = XML2Py()
    python_object = deserializer.parse( xml_string )
    print xml_string
    print python_object


if __name__ == '__main__':
    main()

2 comments

Ankit Master 13 years, 8 months ago  # | flag

Hello, I am newbie to python and therefore please pardon my dumb questions. I tried implementing your script to deserialize xml to python dictoinary...it didnt work for me.

I used your recipe for serializing dict to xml and got the following xml string as its output

>>> d={'a':{'a':1},'b':{'b':2}}

'<a a="1" /><b b="2" />' Now, when I tried to deserialize the above xml string I get the following error.

Traceback (most recent call last): File "<stdin>", line 1, in ? File "xml2py2.py", line 20, in parse self._root = etree.fromstring( xmlString, self._parser ) File "lxml.etree.pyx", line 2385, in lxml.etree.fromstring (src/lxml/lxml.etree.c:23581) File "parser.pxi", line 1359, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:57605) File "parser.pxi", line 1246, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:56594) File "parser.pxi", line 798, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:53794) File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:50830) File "parser.pxi", line 536, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:51661) File "parser.pxi", line 478, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:51079) lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 1, column 7

I tried looking at lines 20 and 1 but couldnt make much sense out of it. Could you please help me understand the error and or bug.

Thank you Ankit

David McCuskey (author) 9 years, 7 months ago  # | flag

this solution doesn't work for just any arbitrary Python DICT. it must follow a certain structure. two things to mention about one you have supplied.

  1. the XML created from your input DICT isn't valid XML. a valid XML structure must have a single node at the root which wraps all data. here's an example with 'data' as the root node:

    <data><a a="1"/><b b="2"/></data>

i'm sure that this is causing the error you're seeing.

  1. according to my solution, any parent/child nodes share the same name except that the parent is the "plural" form of the name. using your data, this might be:

    <nodes><node a="1"/><node b="2"/></nodes>

this is mentioned in the main doc page and can be seen in the example XML/DICT listed in the code: http://code.activestate.com/recipes/577266/

Created by David McCuskey on Wed, 16 Jun 2010 (MIT)
Python recipes (4591)
David McCuskey's recipes (3)

Required Modules

Other Information and Tasks