Welcome, guest | Sign In | My Account | Store | Cart

Solves the "tail-problem" for prose-oriented XML. Recursively pulls text out of elements and their sub-elements. Extracts text from DocBook, XMHTL, and other nested XML markup.

Python, 92 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
def _textlist(self, _addtail=False):
    '''Returns a list of text strings contained within an element and its sub-elements.

    Helpful for extracting text from prose-oriented XML (such as XHTML or DocBook).
    '''
    result = []
    if self.text is not None:
        result.append(self.text)
    for elem in self:
        result.extend(elem.textlist(True))
    if _addtail and self.tail is not None:
        result.append(self.tail)
    return result


# inject the new method into the ElementTree framework
from xml.etree.ElementTree import _Element
_Element.textlist = _textlist


## ---------- Sample calls -----------

from xml.etree.ElementTree import XML
from textwrap import fill

xhmtl_fragment = XML('''
<ul>
<li>XHTML documents start with an <span class="code">&lt;html&gt;</span> tag - there is no such thing as an <span class="code">&lt;xhtml&gt;</span> tag.</li>
<li>It is required that you declare the XHTML namespace inside the opening <span class="code">&lt;html&gt;</span> tag.</li>
<li>This XHTML example covered the use of XHTML transitional - for XHTML strict or frameset, use the appropriate
<a title="Declaring a DocType" href="/xhtml/doctype/" >DOCTYPE Declaration</a>.</li>
<li>Remember that declaring a DOCTYPE with a valid identifier at the top of an XHTML page puts most browers
in <i>standards</i> mode- so remember not to use old browser hacks, and non-standard tags. (Otherwise, use just use regular HTML)</li>
<li>For some browsers, including Microsoft Internet Explorer 6, if you start an XHTML page with the XML declaration,
the browser goes into <i>quirks</i> mode, an unfortunate bug. The workaround is to delete the optional 
declaration and declare the the encoding using a meta tag.</li>
<li>The DOCTYPE declaration has to be in all uppercase characters, just like in the XHTML example code.</li>
</ul>
''')

print fill(''.join(xhmtl_fragment.textlist()))


docbook_fragment = XML('''
<book id="ashortbook">
  <title>History of Computer Programming</title>
  <chapter id="afirstchapter">
    <title>Chapter 1 -- Evolution</title>
    <para>In the beginning, there was machine language.   Then, arose assember.</para>
    <para>From those humble beginnings, a thousand languages were born.</para>
  </chapter>
  <chapter id="asecondchapter">
    <title>Chapter 2 -- Consolidation </title>
    <para>Eventually, all designs converged on variants on LISP.</para>
  </chapter>
</book>
''')

print '\n'.join(map(fill, docbook_fragment.textlist()))


## ---------- Sample output -----------

'''
 XHTML documents start with an <html> tag - there is no such thing as
an <xhtml> tag. It is required that you declare the XHTML namespace
inside the opening <html> tag. This XHTML example covered the use of
XHTML transitional - for XHTML strict or frameset, use the appropriate
DOCTYPE Declaration. Remember that declaring a DOCTYPE with a valid
identifier at the top of an XHTML page puts most browers in standards
mode- so remember not to use old browser hacks, and non-standard tags.
(Otherwise, use just use regular HTML) For some browsers, including
Microsoft Internet Explorer 6, if you start an XHTML page with the XML
declaration, the browser goes into quirks mode, an unfortunate bug.
The workaround is to delete the optional  declaration and declare the
the encoding using a meta tag. The DOCTYPE declaration has to be in
all uppercase characters, just like in the XHTML example code.

History of Computer Programming


Chapter 1 -- Evolution

In the beginning, there was machine language.   Then, arose assember.

From those humble beginnings, a thousand languages were born.


Chapter 2 -- Consolidation

Eventually, all designs converged on variants on LISP.
'''

As distributed in Py2.5, ElementTree's default text handling does not directly support extracting text from nested XML tags. Given an xhtml fragment like '[p]the [em]quick[/em] brown fox[/p]' (reading the square-brackets as angle-brackets), ElementTree parses the fragment into p.text='the', em.text='quick', and em.tail='brown fox'. This grouping makes it inconvenient to extract the full sentence, 'the quick brown fox'.

This recipe injects a new method into the ElementTree framework. The method extracts all the PCDATA inside the element and its sub-elements.

The private _addtail variable controls whether or not the an element's tail is included in the result. For the outermost element, we do not want the tail because it is outside the tag. However, the tails of the inner elements all fall inside the start and stop elements of the outermost tag.

Hopefully, this recipe will make ElementTree more appealing by addressing a long standing concern about its applicability to prose embedded in nested XML markup.