Welcome, guest | Sign In | My Account | Store | Cart

Often when a program adds some XML markup to a plain-text document, it doesn't retain the original whitespace formatting. This recipe determines the character offsets the XML elements should have had in the original document.

Python, 50 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def align(tree, text):
    """Aligns each ElementTree element with its offsets in the text.

    Returns a list of (element, start, stop) tuples.

    Keyword Arguments:
    tree -- An ElementTree for an XML document
    text -- The text to which the XML should be aligned. The text and
        XML should only differ in the presence or absence of XML
        elements and whitespace.
    """

    def align_helper(elem, elem_start):
        # skip whitespace in the text before the element
        while text[elem_start:elem_start + 1].isspace():
            elem_start += 1

        # advance the element end past any element text            
        elem_end = elem_start
        if elem.text is not None:
            for i, char in enumerate(elem.text):
                if not char.isspace():
                    while text[elem_end:elem_end + 1].isspace():
                        elem_end += 1
                    assert text[elem_end:elem_end + 1] == char
                    elem_end += 1

        # advance the element end past any child elements
        for child_elem in elem:
            elem_end = align_helper(child_elem, elem_end)

        # advance the start for the next element past the tail text
        next_start = elem_end
        if elem.tail is not None:
            for i, char in enumerate(elem.tail):
                if not char.isspace():
                    while text[next_start:next_start + 1].isspace():
                        next_start += 1
                    assert text[next_start:next_start + 1] == char
                    next_start += 1

        # add the element and its start and end to the result list
        result.append((elem, elem_start, elem_end))

        # return the start of the next element        
        return next_start

    result = []
    align_helper(tree, 0)
    return result

So imagine you had some original text like: <pre>

>>> plain_text = '''
... Pacific First Financial Corp. said shareholders approved its
... acquisition.
... '''
</pre> And your XML-producing program added markup like this:
<pre>
>>> xml_text = '''   &lt;s>Pacific First Financial Corp.
... &lt;EVENT eid="e1" class="REPORTING" > said &lt;/EVENT> shareholders
... &lt;EVENT eid="e2" class="OCCURRENCE" >approved&lt;/EVENT> its
... &lt;EVENT eid="e8" class="OCCURRENCE" >    acquis ition &lt;/EVENT>.
... &lt;/s>
... '''
</pre> You don't want the reformatted text with the extra spaces and newlines; you want the original text, but you need to know where the XML elements should have been placed. Using the align function to do this gives us:
<pre>
>>> import xml.etree.cElementTree as etree
>>> xml_tree = etree.fromstring(xml_text)
>>> align(xml_tree, plain_text)
[(&lt;Element 'EVENT' at 00ADA638>, 31, 35),
(&lt;Element 'EVENT' at 00ADA590>, 49, 57),
(&lt;Element 'EVENT' at 00ADA5F0>, 62, 73),
(&lt;Element 's' at 00ADAAE8>, 1, 74)]
</pre> Each element has now been aligned to the location it would have been in the original text:
<pre>
>>> for element, start, stop in align(xml_tree, plain_text):
...     if element.tag == 'EVENT':
...         print '%r %r' % (plain_text[start:stop], element.text)
...
'said' ' said '
'approved' 'approved'
'acquisition' '    acquis ition '
</pre>
Created by Steven Bethard on Fri, 19 Jan 2007 (PSF)
Python recipes (4591)
Steven Bethard's recipes (7)

Required Modules

  • (none specified)

Other Information and Tasks