Welcome, guest | Sign In | My Account | Store | Cart

A serialization library to serialize some of the more basic types. Does not suffer from the security flaws that cPickle and pickle do.

Python, 126 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
#!/usr/bin/env python
# -*- coding: utf-8 -*-

__author__ = "James Eric Pruitt"
__all__ = [ "serialize", "deserialize" ]
__version__ = "2009.11.04"

import collections

itertable = {}
for t in [set, frozenset, list, tuple, dict]:
    itertable[t] = t.__name__
    itertable[t.__name__] = t

supporteddict = {
    int: (repr, int, "int"),
    long: (repr, long, "long"),
    bool: (repr, lambda s: s == "True", "bool"),
    complex: (repr, lambda s: complex(s[1:-1]), "complex"),
    float: (repr, float, "float"),

    str: (
        lambda s: s.encode("string-escape"),
        lambda s: s.decode("string-escape"), "str"),
    unicode: (
        lambda s: s.encode("unicode-escape"),
        lambda s: s.decode("unicode-escape"), "unicode"),

    type(None): (repr, lambda s: None, "None"), # None is a special case;
}                                               # type(None) != None

# inverted dictionary
supporteddictinv = dict(
    (name,func) for (_,func,name) in supporteddict.itervalues())

def serialize(root):
    """
    Serializes some of the fundamental data types in Python.

    Serialization function designed to not possess the same security flaws
    as the cPickle and pickle modules. At present, the following data types
    are supported:

        set, frozenset, list, tuple, dict, int, long, bool, complex, float,
        None, str, unicode

    To convert the serialized object back into a Python object, pass the text
    through the deserialize function.

    >>> deserialize(serialize((1, 2, 3+4j, ['this', 'is', 'a', 'list'])))
    (1, 2, (3+4j), ['this', 'is', 'a', 'list'])
    """
    stack = collections.deque([ (0,(root,)) ])
    lintree, eid = collections.deque(), 0
    while stack:
        uid, focus = stack.pop()
        for element in focus:
            eid += 1
            if hasattr(focus, "keys"): # Support for dictionaries
                lintree.appendleft((eid, uid, 'C', "tuple"))
                stack.append((eid, (element, focus[element])))
            elif hasattr(element, "__iter__"):
                lintree.appendleft((eid, uid, 'C', itertable[type(element)]))
                stack.append((eid, element))
            else:
                elementtype = type(element)
                serializefunc, _, label = supporteddict[elementtype]
                lintree.appendleft((eid, uid, label, serializefunc(element)))

    return '\n'.join(str(element) for entry in lintree for element in entry)

def deserialize(text):
    """
    Deserializes data generated by the serialize function.

    >>> deserialize(serialize((1, 2, 3+4j, ['this', 'is', 'a', 'list'])))
    (1, 2, (3+4j), ['this', 'is', 'a', 'list'])
    """
    nodaldict = { 0: collections.deque() }
    text = text.split('\n')
    lastpid = int(text[1])
    for quartet in xrange(0, len(text) - 1, 4):
        eid, pid = int(text[quartet]), int(text[quartet+1])
        moniker = text[quartet+2]
        if moniker == 'C':
            encapsulator = itertable[text[quartet+3]]
            appendage = encapsulator(nodaldict.get(eid, collections.deque()))
        else:
            deserializer = supporteddictinv[moniker]
            appendage = deserializer(text[quartet+3])
        nodaldict.setdefault(pid, collections.deque()).appendleft(appendage)

    return nodaldict[0].pop()

def test(supressoutput = False):
    testvectors = [
        list(((None, True, False), (1, 12341234123412341234123412341234L,0.5),
              0.12341234123412341234,
              u'This is\nan\tUnicode string\u0A0D\N{HANGUL SYLLABLE BYENH}',
              set(('A','B','D')),
            frozenset(tuple((1, '9', -0.12341234123412341234+1j, 'Y'))))),
        tuple(),
        list(),
        set(),
        frozenset(),
        'Element that is not nested.',
        {'a': (1, "Waka"), ('X', 'K'): u'CD', u'y':{ 1:'O', 2:('T','wo')}}]

    # Recursion not yet properly supported.
    #x = {'a': None, 'z': None}
    #y = {'x': x}
    #x['y'] = y
    #testvectors.extend([x,y])
    for root in testvectors:
        serialized = serialize(root)
        inverse = deserialize(serialized)
        if not supressoutput:
            print "Expected: ", repr(root)
            print "Returned: ", repr(inverse)
        if (inverse != root):
            raise ValueError, "The test failed."

if __name__ == '__main__':
    print "Running test..."
    test()
    print "Test passed."

These functions are for serializing fundamental Python data types. Both functions are non-recursive and do not possess the same security flaws that cPickle and pickle do. Works well when you do not need to serialize custom classes or objects with additional attributes. The code works on Python versions 2.4 to 2.7 (trunk). Beware that serializing recursive data does not work and causes an infinite loop.

To add support for new types, just add an entry to supported dictionary whose key is the data type. In most cases, it is just the name of the data type but for others, you may need to set the key using type(typename) or type(type_instance). The entry is a three element tuple: the first element in the function to call to serialize the data type, the second is a function to deserialize the data and the third is a string is the type's name.

Change Log (Minor changes may not be noted):

2009.11.04 Changes by Gabriel Genellina:

  • __builtins__ is no longer used or imported

  • repr instead of str for all conversions

  • added an inverted dictionary typename -> function to ease unserialize (and avoid __builtins__ too)-

  • A couple additional test cases (more decimal places, long integers, unicode strings). Note that the test would fail if still using str instead of repr.

2009.11.03 Fixed a security issue with iterables and simplified "supporteddict."

2009.11.02 Added support for dictionaries.

9 comments

Gabriel Genellina 12 years, 1 month ago  # | flag

You claim those functions "do not suffer from the security flaws that cPickle and pickle do".

But since unserialize uses getattr(__builtin__, some_arbitrary_data) it looks to me as unsafe as pickle, or even worse.

Eric Pruitt (author) 12 years, 1 month ago  # | flag

I don't see the problem with the way I am using __builtin__. No function calls are made from built in and no objects can be accessed that aren't in the white list.

I think you are referring to line 81:

_, deserial, _ = supportedind[getattr(__builtin__, moniker)]

No calls are being made with the function that is specified. getattr(...) is harmless in this case because I am not executing any code based on user input. It is run through the whitelist dictionary which contains safe deserialization methods.

Eric Pruitt (author) 12 years, 1 month ago  # | flag

You say it "looks" unsafe but provide no analysis or example of flaw. On linux "rm -rf /" looks unsafe but the command will have no effect when run by a normal user. If I am missing something, can you please show me an example of this code being exploited?

Gabriel Genellina 12 years, 1 month ago  # | flag

No, I was thinking of the other place where __builtin__ is used (when moniker is 'C'). What if someone builds a quartet with 'eval' in the last item?

unserialize('0\n0\nC\neval')
  File "d:\temp\serialize.py", line 79, in unserialize
    appendage = encapsulator(nodaldict.get(eid, collections.deque()))
  TypeError: eval() arg 1 must be a string or code object

Although this is not a working example, it shows that eval is attempted. One should prove that no input string could lead to eval being sucessfully called with arbitrary arguments.

Eric Pruitt (author) 12 years, 1 month ago  # | flag

Thank you for pointing that out. After testing it further, it seems eval cannot be successfully executed because the data that is passed to the "encapsulator" function is always a deque() type.

>>> deserialize('2\n1\nstr\nprint 666\n1\n0\nC\neval')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "simpleserialize.py", line 89, in deserialize
    appendage = encapsulator(nodaldict.get(eid, collections.deque()))
TypeError: eval() arg 1 must be a string or code object

For good measure, I have now implemented white list checking of anything that is allegedly an iterator:

>>> deserialize('0\n0\nC\neval')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "../simpleserialize.py", line 89, in deserialize
    encapsulator = itertable[text[quartet+3]]
KeyError: 'eval'

Again, thanks for the input.

Eric

Gabriel Genellina 12 years, 1 month ago  # | flag

My previous comments aren't applicable to the current version of this recipe, and I'd remove them if I could. I don't consider this "unsafe" anymore (at least, no more "unsafe" than standard Python code)

Thomas 12 years ago  # | flag

I would like to use that recipe into a project of mine. I just register into ActiveState so I'm not aware of the usages around here. Is that code licensed under any open-source license ?

Thank for the recipe.

Cheers, Thomas

Gabriel Genellina 11 years, 11 months ago  # | flag

@Thomas: You may try to contact the author. The MIT license applies:

http://code.activestate.com/help/terms/

Eric Pruitt (author) 11 years, 11 months ago  # | flag

Thomas, you are more than welcome to use my code and as Gabriel stated, the MIT license is site wide. If you need my e-mail address, Google me on the Python dev-list.

Created by Eric Pruitt on Mon, 2 Nov 2009 (MIT)
Python recipes (4591)
Eric Pruitt's recipes (4)

Required Modules

Other Information and Tasks