Welcome, guest | Sign In | My Account | Store | Cart

Unpickling from an untrusted source, such as a network connection, can allow maliciously formed pickles to run arbitrary code.

This recipe presents a simple solution for serializing and unserializing simple Python types. Only simple Python types can be serialized, which makes the use of this algorithm safer than using the pickle module.

NB: I've changes this recipe drastically. It used to use a rather slow string slicing technique, which was a very bad example of how to use strings in Python! The cStringIO provided a faster, simpler replacement. This recipe now serializes faster than the Pickle module (not cPickle).

NB: A Python 2.4 version of this recipe is available here: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/415791

Python, 130 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
from types import IntType, TupleType, StringType, FloatType, LongType, ListType, DictType, NoneType
from struct import pack, unpack
from cStringIO import StringIO

class EncodeError(Exception):
    pass
class DecodeError(Exception):
    pass

#contains dictionary of coding functions, where the dictionary key is the type.
encoder = {}

def enc_dict_type(obj):
    data = "".join([encoder[type(i)](i) for i in obj.items()])
    return "%s%s%s" % ("D", pack("!L", len(data)), data)
encoder[DictType] = enc_dict_type

def enc_list_type(obj):
    data = "".join([encoder[type(i)](i) for i in obj])
    return "%s%s%s" % ("L", pack("!L", len(data)), data)
encoder[ListType] = enc_list_type

def enc_tuple_type(obj):
    data = "".join([encoder[type(i)](i) for i in obj])
    return "%s%s%s" % ("T", pack("!L", len(data)), data)
encoder[TupleType] = enc_tuple_type

def enc_int_type(obj):
    return "%s%s" % ("I", pack("!i", obj))
encoder[IntType] = enc_int_type

def enc_float_type(obj):
    return "%s%s" % ("F", pack("!f", obj))
encoder[FloatType] = enc_float_type

def enc_long_type(obj):
    obj = hex(obj)[2:-1]
    return "%s%s%s" % ("B", pack("!L", len(obj)), obj)
encoder[LongType] = enc_long_type

def enc_string_type(obj):
    return "%s%s%s" % ("S", pack("!L", len(obj)), obj)
encoder[StringType] = enc_string_type

def enc_none_type(obj):
    return "N"
encoder[NoneType] = enc_none_type

def encode(obj):
    """Encode simple Python types into a binary string."""
    try:
        return encoder[type(obj)](obj)
    except KeyError, e:
        raise EncodeError, "Type not supported. (%s)" % e

#contains dictionary of decoding functions, where the dictionary key is the type prefix used.
decoder = {}

def build_sequence(data, cast=list):
    size = unpack('!L', data.read(4))[0]
    items = []
    data_tell = data.tell
    data_read = data.read
    items_append = items.append
    start_position = data.tell()
    while (data_tell() - start_position) < size:
        T = data_read(1)
        value = decoder[T](data)
        items_append(value)
    return cast(items)

def dec_tuple_type(data):
    return build_sequence(data, cast=tuple)
decoder["T"] = dec_tuple_type

def dec_list_type(data):
    return build_sequence(data, cast=list)
decoder["L"] = dec_list_type

def dec_dict_type(data):
    return build_sequence(data, cast=dict)
decoder["D"] = dec_dict_type

def dec_long_type(data):
    size = unpack('!L', data.read(4))[0]
    value = long(data.read(size),16)
    return value
decoder["B"] = dec_long_type

def dec_string_type(data):
    size = unpack('!L', data.read(4))[0]
    value = str(data.read(size))
    return value
decoder["S"] = dec_string_type

def dec_float_type(data):
    value = unpack('!f', data.read(4))[0]
    return value
decoder["F"] = dec_float_type

def dec_int_type(data):
    value = unpack('!i', data.read(4))[0]
    return value
decoder['I'] = dec_int_type

def dec_none_type(data):
    return None
decoder['N'] = dec_none_type

def decode(data):
    """
    Decode a binary string into the original Python types.
    """
    buffer = StringIO(data)
    try:
        value = decoder[buffer.read(1)](buffer)
    except KeyError, e:
        raise DecodeError, "Type prefix not supported. (%s)" % e
    return value


if __name__ == "__main__":
    value = [None,["simon","wittber"],(1,2),{1:2.1,3:4.3},999999999999999999999999999999999999999]
    data = encode(value)
    print data
    x = decode(data)
    for item in zip(value,x):
        print item[0],"---",item[1]
    print "-" * 10
    print x

Warning: This recipe can only serialize Integers, Longs, Floats, String, Tuples, Lists and Dictionaries. It also assumes floats and ints are 4 bytes in length.

8 comments

Nicolas Lehuen 18 years, 10 months ago  # | flag

What about the marshall module ? Why not use the standard marshal module ?

http://www.python.org/doc/2.4.1/lib/module-marshal.html

S W (author) 18 years, 10 months ago  # | flag

marshal has the same vulnerabilities as pickle. From http://python.org/doc/2.4/lib/module-marshal.html:

Warning: The marshal module is not intended to be secure against erroneous or maliciously constructed data. Never unmarshal data received from an untrusted or unauthenticated source.

That's why! :-)

Andrew Bennetts 18 years, 10 months ago  # | flag

Reminds me of Moshe Zadka's unrepr. http://twistedmatrix.com/~moshez/unrepr.py

I wonder if using the same getattr trick could make this recipe much shorter...

Michael 18 years, 10 months ago  # | flag

this might be really naive... but for simple types... what's wrong w/ repr() and eval()? they're portable AND readable, right?

S W (author) 18 years, 10 months ago  # | flag

repr and eval. Portable and Readable, yes. Secure, no.

For example, what does eval("import os; os.rmdir('/')") return?

This recipe is designed to decode data coming from untrusted, possibly malicious network connections. It does this without running possibly dangerous eval statements.

Oren Tirosh 18 years, 10 months ago  # | flag

bool, JSON. One builtin type missing from this recipe is bool. We're not in Python 2.2 any more!

With bool added, the data model described by this serialization format is nearly equivalent to that described by JSON (http://www.json.org).

Connelly Barnes 18 years, 3 months ago  # | flag

Alternative. I created a module named rencode, which is based on bencode. This handles floats, dicts with any serializable keys, bools, None, and can safely decode from untrusted data sources. For complex, heterogeneous data structures with many small elements, the serialized strings are significantly smaller than those generated by bencode, gherkin, and this Recipe. It is also faster than this Recipe.

Available: http://barnesc.blogspot.com/2006/01/rencode-reduced-length-encodings.html

Connelly Barnes 17 years, 8 months ago  # | flag

Bug. As of version 1.6, this recipe cannot encode and then decode -2**31.