ActiveState Code

Recipe 576943: Serialize and Deserialize Securely


A serialization library to serialize some of the more basic types. Does not suffer from the security flaws that cPickle and pickle do.

Python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
#!/usr/bin/env python
# -*- coding: utf-8 -*-

__author__ = "James Eric Pruitt"
__all__ = [ "serialize", "deserialize" ]
__version__ = "2009.11.04"

import collections

itertable = {}
for t in [set, frozenset, list, tuple, dict]:
    itertable[t] = t.__name__
    itertable[t.__name__] = t

supporteddict = {
    int: (repr, int, "int"),
    long: (repr, long, "long"),
    bool: (repr, lambda s: s == "True", "bool"),
    complex: (repr, lambda s: complex(s[1:-1]), "complex"),
    float: (repr, float, "float"),

    str: (
        lambda s: s.encode("string-escape"),
        lambda s: s.decode("string-escape"), "str"),
    unicode: (
        lambda s: s.encode("unicode-escape"),
        lambda s: s.decode("unicode-escape"), "unicode"),

    type(None): (repr, lambda s: None, "None"), # None is a special case;
}                                               # type(None) != None

# inverted dictionary
supporteddictinv = dict(
    (name,func) for (_,func,name) in supporteddict.itervalues())

def serialize(root):
    """
    Serializes some of the fundamental data types in Python.

    Serialization function designed to not possess the same security flaws
    as the cPickle and pickle modules. At present, the following data types
    are supported:

        set, frozenset, list, tuple, dict, int, long, bool, complex, float,
        None, str, unicode

    To convert the serialized object back into a Python object, pass the text
    through the deserialize function.

    >>> deserialize(serialize((1, 2, 3+4j, ['this', 'is', 'a', 'list'])))
    (1, 2, (3+4j), ['this', 'is', 'a', 'list'])
    """
    stack = collections.deque([ (0,(root,)) ])
    lintree, eid = collections.deque(), 0
    while stack:
        uid, focus = stack.pop()
        for element in focus:
            eid += 1
            if hasattr(focus, "keys"): # Support for dictionaries
                lintree.appendleft((eid, uid, 'C', "tuple"))
                stack.append((eid, (element, focus[element])))
            elif hasattr(element, "__iter__"):
                lintree.appendleft((eid, uid, 'C', itertable[type(element)]))
                stack.append((eid, element))
            else:
                elementtype = type(element)
                serializefunc, _, label = supporteddict[elementtype]
                lintree.appendleft((eid, uid, label, serializefunc(element)))

    return '\n'.join(str(element) for entry in lintree for element in entry)

def deserialize(text):
    """
    Deserializes data generated by the serialize function.

    >>> deserialize(serialize((1, 2, 3+4j, ['this', 'is', 'a', 'list'])))
    (1, 2, (3+4j), ['this', 'is', 'a', 'list'])
    """
    nodaldict = { 0: collections.deque() }
    text = text.split('\n')
    lastpid = int(text[1])
    for quartet in xrange(0, len(text) - 1, 4):
        eid, pid = int(text[quartet]), int(text[quartet+1])
        moniker = text[quartet+2]
        if moniker == 'C':
            encapsulator = itertable[text[quartet+3]]
            appendage = encapsulator(nodaldict.get(eid, collections.deque()))
        else:
            deserializer = supporteddictinv[moniker]
            appendage = deserializer(text[quartet+3])
        nodaldict.setdefault(pid, collections.deque()).appendleft(appendage)

    return nodaldict[0].pop()

def test(supressoutput = False):
    testvectors = [
        list(((None, True, False), (1, 12341234123412341234123412341234L,0.5),
              0.12341234123412341234,
              u'This is\nan\tUnicode string\u0A0D\N{HANGUL SYLLABLE BYENH}',
              set(('A','B','D')),
            frozenset(tuple((1, '9', -0.12341234123412341234+1j, 'Y'))))),
        tuple(),
        list(),
        set(),
        frozenset(),
        'Element that is not nested.',
        {'a': (1, "Waka"), ('X', 'K'): u'CD', u'y':{ 1:'O', 2:('T','wo')}}]

    # Recursion not yet properly supported.
    #x = {'a': None, 'z': None}
    #y = {'x': x}
    #x['y'] = y
    #testvectors.extend([x,y])
    for root in testvectors:
        serialized = serialize(root)
        inverse = deserialize(serialized)
        if not supressoutput:
            print "Expected: ", repr(root)
            print "Returned: ", repr(inverse)
        if (inverse != root):
            raise ValueError, "The test failed."

if __name__ == '__main__':
    print "Running test..."
    test()
    print "Test passed."

Discussion

These functions are for serializing fundamental Python data types. Both functions are non-recursive and do not possess the same security flaws that cPickle and pickle do. Works well when you do not need to serialize custom classes or objects with additional attributes. The code works on Python versions 2.4 to 2.7 (trunk). Beware that serializing recursive data does not work and causes an infinite loop.

To add support for new types, just add an entry to supported dictionary whose key is the data type. In most cases, it is just the name of the data type but for others, you may need to set the key using type(typename) or type(type_instance). The entry is a three element tuple: the first element in the function to call to serialize the data type, the second is a function to deserialize the data and the third is a string is the type's name.

Change Log (Minor changes may not be noted):

2009.11.04 Changes by Gabriel Genellina:

  • __builtins__ is no longer used or imported

  • repr instead of str for all conversions

  • added an inverted dictionary typename -> function to ease unserialize (and avoid __builtins__ too)-

  • A couple additional test cases (more decimal places, long integers, unicode strings). Note that the test would fail if still using str instead of repr.

2009.11.03 Fixed a security issue with iterables and simplified "supporteddict."

2009.11.02 Added support for dictionaries.

Comments

  1. 1. At 2:35 a.m. on 3 nov 2009, Gabriel Genellina said:

    You claim those functions "do not suffer from the security flaws that cPickle and pickle do".

    But since unserialize uses getattr(__builtin__, some_arbitrary_data) it looks to me as unsafe as pickle, or even worse.

  2. 2. At 9:04 a.m. on 3 nov 2009, Eric Pruitt (the author) said:

    I don't see the problem with the way I am using __builtin__. No function calls are made from built in and no objects can be accessed that aren't in the white list.

    I think you are referring to line 81:

    _, deserial, _ = supportedind[getattr(__builtin__, moniker)]
    

    No calls are being made with the function that is specified. getattr(...) is harmless in this case because I am not executing any code based on user input. It is run through the whitelist dictionary which contains safe deserialization methods.

  3. 3. At 9:08 a.m. on 3 nov 2009, Eric Pruitt (the author) said:

    You say it "looks" unsafe but provide no analysis or example of flaw. On linux "rm -rf /" looks unsafe but the command will have no effect when run by a normal user. If I am missing something, can you please show me an example of this code being exploited?

  4. 4. At 1 a.m. on 4 nov 2009, Gabriel Genellina said:

    No, I was thinking of the other place where __builtin__ is used (when moniker is 'C'). What if someone builds a quartet with 'eval' in the last item?

    unserialize('0\n0\nC\neval')
      File "d:\temp\serialize.py", line 79, in unserialize
        appendage = encapsulator(nodaldict.get(eid, collections.deque()))
      TypeError: eval() arg 1 must be a string or code object
    

    Although this is not a working example, it shows that eval is attempted. One should prove that no input string could lead to eval being sucessfully called with arbitrary arguments.

  5. 5. At 2:54 a.m. on 4 nov 2009, Eric Pruitt (the author) said:

    Thank you for pointing that out. After testing it further, it seems eval cannot be successfully executed because the data that is passed to the "encapsulator" function is always a deque() type.

    >>> deserialize('2\n1\nstr\nprint 666\n1\n0\nC\neval')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "simpleserialize.py", line 89, in deserialize
        appendage = encapsulator(nodaldict.get(eid, collections.deque()))
    TypeError: eval() arg 1 must be a string or code object
    

    For good measure, I have now implemented white list checking of anything that is allegedly an iterator:

    >>> deserialize('0\n0\nC\neval')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "../simpleserialize.py", line 89, in deserialize
        encapsulator = itertable[text[quartet+3]]
    KeyError: 'eval'
    

    Again, thanks for the input.

    Eric

  6. 6. At 1:01 a.m. on 5 nov 2009, Gabriel Genellina said:

    My previous comments aren't applicable to the current version of this recipe, and I'd remove them if I could. I don't consider this "unsafe" anymore (at least, no more "unsafe" than standard Python code)

  7. 7. At 5:13 a.m. on 20 nov 2009, Thomas said:

    I would like to use that recipe into a project of mine. I just register into ActiveState so I'm not aware of the usages around here. Is that code licensed under any open-source license ?

    Thank for the recipe.

    Cheers, Thomas

Sign in to comment