Welcome, guest | Sign In | My Account | Store | Cart

A subclass of String that allows simple handling of pango markup or other simple XML markup. The goal here is to allow a slice of a python markup string to "do the right thing" and preserve formatting correctly. In other word, MarkupString(<i>Hello World</i>)[6:] = "<i>World</i>"

Python, 87 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import xml.sax

class simpleHandler (xml.sax.ContentHandler):
    """A simple handler that provides us with indices of marked up content."""
    def __init__ (self):        
        self.elements = [] #this will contain a list of elements and their start/end indices
        self.open_elements = [] #this holds info on open elements while we wait for their close
        self.content = ""

    def startElement (self,name,attrs):
        if name=='foobar': return # we require an outer wrapper, which we promptly ignore.
        self.open_elements.append({'name':name,
                                   'attrs':attrs.copy(),
                                   'start':len(self.content),
                                   })

    def endElement (self, name):
        if name=='foobar': return # we require an outer wrapper, which we promptly ignore.
        for i in range(len(self.open_elements)):
            e = self.open_elements[i]
            if e['name']==name:
                # append a  (start,end), name, attrs
                self.elements.append(((e['start'], #start position
                                       len(self.content)),# current (end) position
                                      e['name'],e['attrs'])
                                     )
                del self.open_elements[i]
                return

    def characters (self, chunk):
        self.content += chunk

class MarkupString (str):
    """A simple class for dealing with marked up strings. When we are sliced, we return
    valid marked up strings, preserving markup."""
    def __init__ (self, string):        
        str.__init__(self,string)
        self.handler = simpleHandler()
        xml.sax.parseString("<foobar>%s</foobar>"%string,self.handler)
        self.raw=self.handler.content

    def __getitem__ (self, n):
        return self.__getslice__(n,n+1)

    def __getslice__ (self, s, e):
        # only include relevant elements
        if not e or e > len(self.raw): e = len(self.raw)
        elements = filter(lambda tp: (tp[0][1] >= s and # end after the start...
                                      tp[0][0] <= e # and start before the end
                                      ),
                          self.handler.elements)
        ends = {}
        starts = {}
        for el in elements:
            # cycle through elements that effect our slice and keep track of
            # where their start and end tags should go.
            pos = el[0]
            name = el[1]
            attrs = el[2]
            # write our start tag <stag att="val"...>
            stag = "<%s"%name
            for k,v in attrs.items(): stag += " %s=%s"%(k,xml.sax.saxutils.quoteattr(v))
            stag += ">"
            etag = "</%s>"%name # simple end tag
            spos = pos[0]
            epos = pos[1]
            if spos < s: spos=s
            if epos > e: epos=e
            if epos != spos: # we don't care about tags that don't markup any text
                if not starts.has_key(spos): starts[spos]=[]
                starts[spos].append(stag)
                if not ends.has_key(epos): ends[epos]=[]
                ends[epos].append(etag)
        outbuf = "" # our actual output string
        for pos in range(s,e): # we move through positions
            char = self.raw[pos]
            if ends.has_key(pos):  # if there are endtags to insert...
                for et in ends[pos]: outbuf += et
            if starts.has_key(pos): # if there are start tags to insert
                mystarts = starts[pos]
                # reverse these so the order works out,e.g. <i><b><u></u></b></i>
                mystarts.reverse()
                for st in mystarts: outbuf += st
            outbuf += char
        if ends.has_key(e):
            for et in ends[e]: outbuf+= et
        return MarkupString(str(outbuf)) # the str call is necessary to avoid unicode messiness

I came up with this while doing a gnomeprint-related hack that involved cutting up and moving around strings with pango markup. I kept breaking the markup when I had to cut up the strings, and it occurred to me that it would be useful to create a python string subclass that would preserve xml markup correctly when sliced and diced.

The MarkupString only cares about tags that should apply to the contained content (i.e. true markup). We don't deal with tags like <br>. For pango, this is never necessary anyway.

Here's how it works:

>>> s=MarkupString('<b>hello <i>world</i></b>')
>>> s
'<b>hello <i>world</i></b>'
>>> s[6:]
'<b><i>world</i></b>'
>>> s[6:][2]
'<b><i>r</i></b>'
>>> s[0:4]
'<b>hell</b>'

Note that len(s) will give you the length of the full marked up string. However, string indices refer to the non-marked up content. To get the length of the non marked up content, use len(s.raw) (you could also add a __len__ method that returned len(s.raw).

Created by Thomas Hinkle on Mon, 21 Feb 2005 (PSF)
Python recipes (4591)
Thomas Hinkle's recipes (1)

Required Modules

Other Information and Tasks