Welcome, guest | Sign In | My Account | Store | Cart

Turn a string representation of a list back into a list. Including nested lists - list elements, can themselves be lists. Includes functions for quoting and unquoting list elements....

Python, 264 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
#29-04-04
# v1.0.1
# E-mail fuzzyman AT atlantibots DOT org DOT uk (or michael AT foord DOT me DOT uk )
# Maintained at www.voidspace.org.uk/atlantibots/pythonutils.html
# Used by COnfigObj for storing config files with lists of values.

def listparse(inline, recursive = 1, comment = 1, retain = 0, lpstack = None, **keywargs):
    """Parses a line (a string) as a representation of a list. Can recursively parse nested lists. (List members can themselves be lists).
    List elements are stripped - and are returned as either lists or strings.

    This is useful for storing lists of information as text - for example in config files

    Listparse returns the list and trailing comments or None if the list is badly built.
    
    A valid comments exists after the end of the list (and any whitespace) and starts with a '#' or a ';'.
    Returned comment will include the initial '#' or a ';'.
    
    Commas delimit list elements.
    If the first non whitespace character in a list element is '[' then that element is treated as a list.

    Inside the list '[', ']', '"', "\" or '\' can be escaped with '\'
    (or indeed any other character - a single '\' will always be treated as escaping the character that follows)
    The leading '\' of escaped characters is *not* retained.....
    Any unquoted list elements must not have an unescaped ']' in them - except to terminate the current list.
    Escaping can be switched off by passing in a keyword argument 'escapechar' set to None.
    If you want to use literal '\' without escaping them - then you must switch escaping off.
    If you make sure every element of a list is contained within quotes - using the quot_elem function - this shouldn't be a problem).

    If retain is set to 1 (default is 0) any quotes around elements will be retained.
    This could be used to specify element types - e.g. if it has quotes it is a string. 
    So the function unquote can be used recursively to check if a list element is validly quoted.
    (and here you could implement other methods for unquoted elements - e.g. check for None or integer values etc...)
    *However* if an element is quoted - it must be correctly quoted, or the element will be invalid.
    The default is for quotes to be removed.

    If recursive is set to 0 (default is 1)
    then list elements will not be recursively parsed - an element containing another list will just
    be returned as a string.
    (meaning an unescaped and unquoted ']' will close the current list... and listparse will say you have a bad list).

    lpstack is used for recursion. Effectively it parses the current table and returns the rest of the line as well.

    If comment is set to 0 (default is 1)
    It causes listparse to return None if there is anything other than whitespace after a valid list.
    (I.e. comments are not allowed). In this case it will only return the list.
    """
    if keywargs.has_key('escapechar'):
        escapechar = keywargs['escapechar']         # either True or False
    else:
        escapechar = True
    outlist = []
    inline = inline.strip()
    if inline[0] != '[':
        return None
    inline = inline[1:].lstrip()
    found_end = 0
    thiselement = None
    escape = 0
    while inline:
        if thiselement == None:         # start of the element
            output = unquote(inline, 0, retain, escapechar=escapechar)          # partquote mode, retain quotes.......
            if output == None:
                return None
            if output != -1:            # element is quoted
                thiselement, inline = output
                inline = inline.lstrip()
                if not inline:
                    return None
                if inline[0] not in [',', ']']:     # only two valid ways to terminate an element
                    return None
                continue
                
        thischar = inline[0]
        inline = inline[1:]
        if escape:                      # the current character is escaped... whatever it may be
            thiselement =__add(thiselement, thischar)
            escape = 0
            continue
        elif thischar == '\\' and escapechar:
            escape = 1
#            thiselement = __add(thiselement, thischar)             # commenting this out means we no longer retain the initial '\' if quoting is on
            continue
        if recursive and not thiselement and thischar == '[':
            output = listparse('[' + inline, True, comment, retain, True, escapechar=escapechar)            # we have found a list element, herewith lies recursion...
            if not output:
                return None         # which is badly formed
            thiselement, inline = output
            inline = inline.lstrip()
            if not inline:
                return None
            if inline[0] not in [',', ']']:     # only two valid ways to terminate an element
                return None
            continue
        if thischar == ',':         # element terminated
            outlist.append(thiselement)
            thiselement = None
            inline = inline.lstrip()
            continue
        if thischar == ']':
            if thiselement != None:                     # trap empty lists
                outlist.append(thiselement)
            found_end = 1
            if lpstack:
                return outlist, inline
            break
        thiselement = __add(thiselement, thischar)
    if not found_end:
        return None
    inline = inline.strip()
    if inline and not comment:
        return None
    elif not comment:
        return outlist
    if inline and inline[0] not in ['#',';']:
        return None
    return outlist, inline
            
def __add(thiselement, char):
    """Shorthand for adding a character...."""
    if thiselement == None:
        return char
    return thiselement + char

def unquote(inline, fullquote = 1, retain = 0, **keywargs):
    """Given a line - if it's correctly quoted - it reurns the 'unquoted' value.
    If not quoted at all, it returns -1.
    If badly quoted, it returns None.
    
    line is stripped before starting.

    Any instances of '&mjf-quot;' found (from elem_quot) are turned back into '"'
    Any instances of '&mjf-lf;' found (from elem_quot) are turned back into '\n'
    
    Quotes can be escaped with a '\'.
    '\' (or any other character) can also be escaped with a '\'.
    No triple quotes though :-)
    (Escaping can be switched off by passing in the keyword argument 'escapechar' set to None
    If you want to use literal '\' without escaping them then you must turn escaping off).

    If fullquote is set to 0 (default is 1)
    then unquote will return the first correctly quoted part of the line *and* the rest of the line.
    If retain is set to 1 (default is 0)
    then unquote will retain the quote characters in the returned value."""
    if keywargs.has_key('escapechar'):
        escapechar = keywargs['escapechar']
    else:
        escapechar = True
    outline = ''
    quotes = ["'",'"']
    escape = 0
    index = 0
    quotechar = None
    inline = inline.strip()
    while index < len(inline):
        thischar = inline[index]
        index += 1
        if not quotechar and thischar not in quotes:
            return -1
        elif not quotechar:
            quotechar = thischar
            if retain:
                outline += thischar
            continue
        if escape:
            outline += thischar
            escape = 0
            continue
        if thischar in quotes:
            if thischar == quotechar:
                if retain:
                    outline += thischar
                if not fullquote:
                    return outline.replace('&mjf-quot;','\"').replace('&mjf-lf;','\n'), inline[index:]
                elif index == len(inline):
                    return outline.replace('&mjf-quot;','\"').replace('&mjf-lf;','\n')
                else:
                    return None
            else:
                outline += thischar
                continue
        if thischar == '\\' and escapechar:         # a continue here to *not* retain the escape character 
            escape = 1
            continue
        outline += thischar
    return None


def list_stringify(inlist):
    """Recursively rebuilds a list - making all the members strings...
    Useful before writing out lists.
    Used by makelist."""
    outlist = []
    for item in inlist:
        if not isinstance(item, list):
            if not isinstance(item, str):
                thisitem = str(item)
            else:
                thisitem = item
        else:
            thisitem = list_stringify(item)
        outlist.append(thisitem)
    return outlist


def makelist(inlist):
    """Given a list - will turn it into a string... suitable for writing out.
    (and then reparsing with listparse.)

    Uses list_stringify to make sure all elements are strings and
    elem_quote to decide the most appropriate quoting.

    (This means it adds quoting to every element and, where necessary, escapes
    '"' as '&mjf-quot;' and '\n' as '&mjf-lf;'........)."""
    inlist = list_stringify(inlist)
    outline = '['
    if not inlist:         # the member is set to None or is an empty list
        outline += ']'
    else:
        for item in inlist:
            if not isinstance(item, list):
                outline += elem_quote(item)
                outline += ', '
            else:
                outline += makelist(item)
                outline += ', '
        if outline[-2:] == ', ':
            outline = outline[:-2]
        outline += ']'
    return outline

def elem_quote(member):
    """Simple method to add the most appropriate quote to an element.
    Element is first converted to a string.
    If the element contains both \' and \" then \" is escaped as '&mjf-quot;'
    If the element contains \n it is escaped as '&mjf-lf;'
    Both are restored transparently by unquote.

    If you only have literal strings at this stage and will be parsing with escaping on -
    you might want to do a replace('\\', '\\\\') on the member too...
    """
#        member = str(member)                                            # since we now stringify everything - this is probably a redundant command
    if member.find("'") == -1:
        outline = "'" + member + "'"
    elif member.find('"') == -1:
        outline = '"' + member + '"'
    else:
        outline = '"' + member.replace('"','&mjf-quot;')+'"'
    return outline.replace('\n','&mjf-lf;')


# brief test stuff
if __name__ == '__main__':
    test ='["hello", \'hello2\']'
    test1 = """['hello',"hello again", and again,['hello',"hello again", and again,], and last of all]"""
    print listparse('[]')
    print test1
    print unquote('"hello baby", hello again', 0, 1)
    print listparse(test1)
    print listparse(test1,1,1,1)
    print listparse(test)
    test1 = test1 +'   # hello'
    print listparse(test1)
    print listparse(test1, 0)       # no recursion      - without recursion the list is very badly formed, so returns None
    print listparse(test1, 1, 0)    # the comment at the end causes listparse to return None here

When listparse discovers a list element that is itself a list (starts with '[') it calls itself - a recursive list parser.

You can use '\' to escape quotes or '[' in the elements - but this effectively prevents you using literal '\' in your text lists.... (like in Python strings). It's better to ensure each element in the list is properly 'quoted'. Use the makelist and/or elem_quote functions to do this.

Useful for representing list information as text and then retrieving it. The information is easily human readable and edtitable..........

ISSUES I'm still not convinced I've got this 'escaping' business right. I never use it which is probably part of the problem - I always quote elements instead - so never need to escape characters Maybe I should just get rid of it.

I use str() to make sure elements are always strings. This eliminates UNICODE problems !! But also makes it impossible to use if you need unicode I guess....

4 comments

Raymond Hettinger 17 years, 7 months ago  # | flag

Whew! That's a lot of work to avoid eval(s).

Michael Foord (author) 17 years, 7 months ago  # | flag

Down With Eval. I can't stand eval.

Psyco doesn't handle eval very well.

Eval only works if the elements are all correctly quoted - as my main use is for reading in list values for simple config files I didn't want to enforce quoting.

Eval will also attempt to evaluate expressions in the list - as well as other possible bizarre things I didn't want to have to think about. (Including security risks I guess). I wanted to garuantee a straightforward string version of the text - but with the list elements properly handled... Which this does quite nicely.

And it was a nice little parser to write - the quote, unquote and makelist functions I needed anyway.......

Raymond Hettinger 17 years, 7 months ago  # | flag

Alternative to Eval. Having a little parser to replace eval() sure helps with the security risks. I was surprised at how much code it took though.

Michael Foord (author) 17 years, 7 months ago  # | flag

It's only little..............

Hmmm... the actual list parse function is about 70 lines - and could be reduced by putting some conditionals on single lines.

I guess if I was a better coder it would be even smaller............
Having said that - it works fine, and does a nice job - it's useful to be able to have config files with keywords having a list of values.
Created by Michael Foord on Fri, 30 Apr 2004 (PSF)
Python recipes (4591)
Michael Foord's recipes (20)

Required Modules

  • (none specified)

Other Information and Tasks