A serialization library to serialize some of the more basic types. Does not suffer from the security flaws that cPickle and pickle do.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | #!/usr/bin/env python
# -*- coding: utf-8 -*-
__author__ = "James Eric Pruitt"
__all__ = [ "serialize", "deserialize" ]
__version__ = "2009.11.04"
import collections
itertable = {}
for t in [set, frozenset, list, tuple, dict]:
itertable[t] = t.__name__
itertable[t.__name__] = t
supporteddict = {
int: (repr, int, "int"),
long: (repr, long, "long"),
bool: (repr, lambda s: s == "True", "bool"),
complex: (repr, lambda s: complex(s[1:-1]), "complex"),
float: (repr, float, "float"),
str: (
lambda s: s.encode("string-escape"),
lambda s: s.decode("string-escape"), "str"),
unicode: (
lambda s: s.encode("unicode-escape"),
lambda s: s.decode("unicode-escape"), "unicode"),
type(None): (repr, lambda s: None, "None"), # None is a special case;
} # type(None) != None
# inverted dictionary
supporteddictinv = dict(
(name,func) for (_,func,name) in supporteddict.itervalues())
def serialize(root):
"""
Serializes some of the fundamental data types in Python.
Serialization function designed to not possess the same security flaws
as the cPickle and pickle modules. At present, the following data types
are supported:
set, frozenset, list, tuple, dict, int, long, bool, complex, float,
None, str, unicode
To convert the serialized object back into a Python object, pass the text
through the deserialize function.
>>> deserialize(serialize((1, 2, 3+4j, ['this', 'is', 'a', 'list'])))
(1, 2, (3+4j), ['this', 'is', 'a', 'list'])
"""
stack = collections.deque([ (0,(root,)) ])
lintree, eid = collections.deque(), 0
while stack:
uid, focus = stack.pop()
for element in focus:
eid += 1
if hasattr(focus, "keys"): # Support for dictionaries
lintree.appendleft((eid, uid, 'C', "tuple"))
stack.append((eid, (element, focus[element])))
elif hasattr(element, "__iter__"):
lintree.appendleft((eid, uid, 'C', itertable[type(element)]))
stack.append((eid, element))
else:
elementtype = type(element)
serializefunc, _, label = supporteddict[elementtype]
lintree.appendleft((eid, uid, label, serializefunc(element)))
return '\n'.join(str(element) for entry in lintree for element in entry)
def deserialize(text):
"""
Deserializes data generated by the serialize function.
>>> deserialize(serialize((1, 2, 3+4j, ['this', 'is', 'a', 'list'])))
(1, 2, (3+4j), ['this', 'is', 'a', 'list'])
"""
nodaldict = { 0: collections.deque() }
text = text.split('\n')
lastpid = int(text[1])
for quartet in xrange(0, len(text) - 1, 4):
eid, pid = int(text[quartet]), int(text[quartet+1])
moniker = text[quartet+2]
if moniker == 'C':
encapsulator = itertable[text[quartet+3]]
appendage = encapsulator(nodaldict.get(eid, collections.deque()))
else:
deserializer = supporteddictinv[moniker]
appendage = deserializer(text[quartet+3])
nodaldict.setdefault(pid, collections.deque()).appendleft(appendage)
return nodaldict[0].pop()
def test(supressoutput = False):
testvectors = [
list(((None, True, False), (1, 12341234123412341234123412341234L,0.5),
0.12341234123412341234,
u'This is\nan\tUnicode string\u0A0D\N{HANGUL SYLLABLE BYENH}',
set(('A','B','D')),
frozenset(tuple((1, '9', -0.12341234123412341234+1j, 'Y'))))),
tuple(),
list(),
set(),
frozenset(),
'Element that is not nested.',
{'a': (1, "Waka"), ('X', 'K'): u'CD', u'y':{ 1:'O', 2:('T','wo')}}]
# Recursion not yet properly supported.
#x = {'a': None, 'z': None}
#y = {'x': x}
#x['y'] = y
#testvectors.extend([x,y])
for root in testvectors:
serialized = serialize(root)
inverse = deserialize(serialized)
if not supressoutput:
print "Expected: ", repr(root)
print "Returned: ", repr(inverse)
if (inverse != root):
raise ValueError, "The test failed."
if __name__ == '__main__':
print "Running test..."
test()
print "Test passed."
|
These functions are for serializing fundamental Python data types. Both functions are non-recursive and do not possess the same security flaws that cPickle and pickle do. Works well when you do not need to serialize custom classes or objects with additional attributes. The code works on Python versions 2.4 to 2.7 (trunk). Beware that serializing recursive data does not work and causes an infinite loop.
To add support for new types, just add an entry to supported dictionary whose key is the data type. In most cases, it is just the name of the data type but for others, you may need to set the key using type(typename) or type(type_instance). The entry is a three element tuple: the first element in the function to call to serialize the data type, the second is a function to deserialize the data and the third is a string is the type's name.
Change Log (Minor changes may not be noted):
2009.11.04 Changes by Gabriel Genellina:
__builtins__ is no longer used or imported
repr instead of str for all conversions
added an inverted dictionary typename -> function to ease unserialize (and avoid __builtins__ too)-
A couple additional test cases (more decimal places, long integers, unicode strings). Note that the test would fail if still using str instead of repr.
2009.11.03 Fixed a security issue with iterables and simplified "supporteddict."
2009.11.02 Added support for dictionaries.
You claim those functions "do not suffer from the security flaws that cPickle and pickle do".
But since unserialize uses getattr(__builtin__, some_arbitrary_data) it looks to me as unsafe as pickle, or even worse.
I don't see the problem with the way I am using __builtin__. No function calls are made from built in and no objects can be accessed that aren't in the white list.
I think you are referring to line 81:
No calls are being made with the function that is specified. getattr(...) is harmless in this case because I am not executing any code based on user input. It is run through the whitelist dictionary which contains safe deserialization methods.
You say it "looks" unsafe but provide no analysis or example of flaw. On linux "rm -rf /" looks unsafe but the command will have no effect when run by a normal user. If I am missing something, can you please show me an example of this code being exploited?
No, I was thinking of the other place where __builtin__ is used (when moniker is 'C'). What if someone builds a quartet with 'eval' in the last item?
Although this is not a working example, it shows that eval is attempted. One should prove that no input string could lead to eval being sucessfully called with arbitrary arguments.
Thank you for pointing that out. After testing it further, it seems eval cannot be successfully executed because the data that is passed to the "encapsulator" function is always a deque() type.
For good measure, I have now implemented white list checking of anything that is allegedly an iterator:
Again, thanks for the input.
Eric
My previous comments aren't applicable to the current version of this recipe, and I'd remove them if I could. I don't consider this "unsafe" anymore (at least, no more "unsafe" than standard Python code)
I would like to use that recipe into a project of mine. I just register into ActiveState so I'm not aware of the usages around here. Is that code licensed under any open-source license ?
Thank for the recipe.
Cheers, Thomas
@Thomas: You may try to contact the author. The MIT license applies:
http://code.activestate.com/help/terms/
Thomas, you are more than welcome to use my code and as Gabriel stated, the MIT license is site wide. If you need my e-mail address, Google me on the Python dev-list.