Often when a program adds some XML markup to a plain-text document, it doesn't retain the original whitespace formatting. This recipe determines the character offsets the XML elements should have had in the original document.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | def align(tree, text):
"""Aligns each ElementTree element with its offsets in the text.
Returns a list of (element, start, stop) tuples.
Keyword Arguments:
tree -- An ElementTree for an XML document
text -- The text to which the XML should be aligned. The text and
XML should only differ in the presence or absence of XML
elements and whitespace.
"""
def align_helper(elem, elem_start):
# skip whitespace in the text before the element
while text[elem_start:elem_start + 1].isspace():
elem_start += 1
# advance the element end past any element text
elem_end = elem_start
if elem.text is not None:
for i, char in enumerate(elem.text):
if not char.isspace():
while text[elem_end:elem_end + 1].isspace():
elem_end += 1
assert text[elem_end:elem_end + 1] == char
elem_end += 1
# advance the element end past any child elements
for child_elem in elem:
elem_end = align_helper(child_elem, elem_end)
# advance the start for the next element past the tail text
next_start = elem_end
if elem.tail is not None:
for i, char in enumerate(elem.tail):
if not char.isspace():
while text[next_start:next_start + 1].isspace():
next_start += 1
assert text[next_start:next_start + 1] == char
next_start += 1
# add the element and its start and end to the result list
result.append((elem, elem_start, elem_end))
# return the start of the next element
return next_start
result = []
align_helper(tree, 0)
return result
|
So imagine you had some original text like: <pre>
>>> plain_text = '''
... Pacific First Financial Corp. said shareholders approved its
... acquisition.
... '''
</pre> And your XML-producing program added markup like this:
<pre>
>>> xml_text = ''' <s>Pacific First Financial Corp.
... <EVENT eid="e1" class="REPORTING" > said </EVENT> shareholders
... <EVENT eid="e2" class="OCCURRENCE" >approved</EVENT> its
... <EVENT eid="e8" class="OCCURRENCE" > acquis ition </EVENT>.
... </s>
... '''
</pre> You don't want the reformatted text with the extra spaces and newlines; you want the original text, but you need to know where the XML elements should have been placed. Using the align function to do this gives us:
<pre>
>>> import xml.etree.cElementTree as etree
>>> xml_tree = etree.fromstring(xml_text)
>>> align(xml_tree, plain_text)
[(<Element 'EVENT' at 00ADA638>, 31, 35),
(<Element 'EVENT' at 00ADA590>, 49, 57),
(<Element 'EVENT' at 00ADA5F0>, 62, 73),
(<Element 's' at 00ADAAE8>, 1, 74)]
</pre> Each element has now been aligned to the location it would have been in the original text:
<pre>
>>> for element, start, stop in align(xml_tree, plain_text):
... if element.tag == 'EVENT':
... print '%r %r' % (plain_text[start:stop], element.text)
...
'said' ' said '
'approved' 'approved'
'acquisition' ' acquis ition '
</pre>
Tags: xml