Unpickling from an untrusted source, such as a network connection, can allow maliciously formed pickles to run arbitrary code.
This recipe presents a simple solution for serializing and unserializing simple Python types. Only simple Python types can be serialized, which makes the use of this algorithm safer than using the pickle module.
NB: I've changes this recipe drastically. It used to use a rather slow string slicing technique, which was a very bad example of how to use strings in Python! The cStringIO provided a faster, simpler replacement. This recipe now serializes faster than the Pickle module (not cPickle).
NB: A Python 2.4 version of this recipe is available here: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/415791
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | from types import IntType, TupleType, StringType, FloatType, LongType, ListType, DictType, NoneType
from struct import pack, unpack
from cStringIO import StringIO
class EncodeError(Exception):
    pass
class DecodeError(Exception):
    pass
#contains dictionary of coding functions, where the dictionary key is the type.
encoder = {}
def enc_dict_type(obj):
    data = "".join([encoder[type(i)](i) for i in obj.items()])
    return "%s%s%s" % ("D", pack("!L", len(data)), data)
encoder[DictType] = enc_dict_type
def enc_list_type(obj):
    data = "".join([encoder[type(i)](i) for i in obj])
    return "%s%s%s" % ("L", pack("!L", len(data)), data)
encoder[ListType] = enc_list_type
def enc_tuple_type(obj):
    data = "".join([encoder[type(i)](i) for i in obj])
    return "%s%s%s" % ("T", pack("!L", len(data)), data)
encoder[TupleType] = enc_tuple_type
def enc_int_type(obj):
    return "%s%s" % ("I", pack("!i", obj))
encoder[IntType] = enc_int_type
def enc_float_type(obj):
    return "%s%s" % ("F", pack("!f", obj))
encoder[FloatType] = enc_float_type
def enc_long_type(obj):
    obj = hex(obj)[2:-1]
    return "%s%s%s" % ("B", pack("!L", len(obj)), obj)
encoder[LongType] = enc_long_type
def enc_string_type(obj):
    return "%s%s%s" % ("S", pack("!L", len(obj)), obj)
encoder[StringType] = enc_string_type
def enc_none_type(obj):
    return "N"
encoder[NoneType] = enc_none_type
def encode(obj):
    """Encode simple Python types into a binary string."""
    try:
        return encoder[type(obj)](obj)
    except KeyError, e:
        raise EncodeError, "Type not supported. (%s)" % e
#contains dictionary of decoding functions, where the dictionary key is the type prefix used.
decoder = {}
def build_sequence(data, cast=list):
    size = unpack('!L', data.read(4))[0]
    items = []
    data_tell = data.tell
    data_read = data.read
    items_append = items.append
    start_position = data.tell()
    while (data_tell() - start_position) < size:
        T = data_read(1)
        value = decoder[T](data)
        items_append(value)
    return cast(items)
def dec_tuple_type(data):
    return build_sequence(data, cast=tuple)
decoder["T"] = dec_tuple_type
def dec_list_type(data):
    return build_sequence(data, cast=list)
decoder["L"] = dec_list_type
def dec_dict_type(data):
    return build_sequence(data, cast=dict)
decoder["D"] = dec_dict_type
def dec_long_type(data):
    size = unpack('!L', data.read(4))[0]
    value = long(data.read(size),16)
    return value
decoder["B"] = dec_long_type
def dec_string_type(data):
    size = unpack('!L', data.read(4))[0]
    value = str(data.read(size))
    return value
decoder["S"] = dec_string_type
def dec_float_type(data):
    value = unpack('!f', data.read(4))[0]
    return value
decoder["F"] = dec_float_type
def dec_int_type(data):
    value = unpack('!i', data.read(4))[0]
    return value
decoder['I'] = dec_int_type
def dec_none_type(data):
    return None
decoder['N'] = dec_none_type
def decode(data):
    """
    Decode a binary string into the original Python types.
    """
    buffer = StringIO(data)
    try:
        value = decoder[buffer.read(1)](buffer)
    except KeyError, e:
        raise DecodeError, "Type prefix not supported. (%s)" % e
    return value
if __name__ == "__main__":
    value = [None,["simon","wittber"],(1,2),{1:2.1,3:4.3},999999999999999999999999999999999999999]
    data = encode(value)
    print data
    x = decode(data)
    for item in zip(value,x):
        print item[0],"---",item[1]
    print "-" * 10
    print x
 | 
Warning: This recipe can only serialize Integers, Longs, Floats, String, Tuples, Lists and Dictionaries. It also assumes floats and ints are 4 bytes in length.

 Download
Download Copy to clipboard
Copy to clipboard
What about the marshall module ? Why not use the standard marshal module ?
http://www.python.org/doc/2.4.1/lib/module-marshal.html
marshal has the same vulnerabilities as pickle. From http://python.org/doc/2.4/lib/module-marshal.html:
Warning: The marshal module is not intended to be secure against erroneous or maliciously constructed data. Never unmarshal data received from an untrusted or unauthenticated source.
That's why! :-)
Reminds me of Moshe Zadka's unrepr. http://twistedmatrix.com/~moshez/unrepr.py
I wonder if using the same getattr trick could make this recipe much shorter...
this might be really naive... but for simple types... what's wrong w/ repr() and eval()? they're portable AND readable, right?
repr and eval. Portable and Readable, yes. Secure, no.
For example, what does eval("import os; os.rmdir('/')") return?
This recipe is designed to decode data coming from untrusted, possibly malicious network connections. It does this without running possibly dangerous eval statements.
bool, JSON. One builtin type missing from this recipe is bool. We're not in Python 2.2 any more!
With bool added, the data model described by this serialization format is nearly equivalent to that described by JSON (http://www.json.org).
Alternative. I created a module named rencode, which is based on bencode. This handles floats, dicts with any serializable keys, bools, None, and can safely decode from untrusted data sources. For complex, heterogeneous data structures with many small elements, the serialized strings are significantly smaller than those generated by bencode, gherkin, and this Recipe. It is also faster than this Recipe.
Available: http://barnesc.blogspot.com/2006/01/rencode-reduced-length-encodings.html
Bug. As of version 1.6, this recipe cannot encode and then decode -2**31.