Smaller faster pickles! Eliminates unused PUT opcodes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
from pickletools import genops def optimize(p): 'Optimize a pickle string by removing unused PUT opcodes' gets = set() # set of args used by a GET opcode puts =  # (arg, startpos, stoppos) for the PUT opcodes prevpos = None # set to pos if previous opcode was a PUT for opcode, arg, pos in genops(p): if prevpos is not None: puts.append((prevarg, prevpos, pos)) prevpos = None if 'PUT' in opcode.name: prevarg, prevpos = arg, pos elif 'GET' in opcode.name: gets.add(arg) # Copy the pickle string except for PUTS without a corresponding GET s =  i = 0 for arg, start, stop in puts: j = stop if (arg in gets) else start s.append(p[i:j]) i = stop s.append(p[i:]) return ''.join(s) if __name__ == '__main__': from pickle import dumps from pickletools import dis p = dumps(['the', 'quick', 'brown', 'fox']) print 'Before:' dis(p) print '\nAfter:' dis(optimize(p))
The pickler is designed to conserve memory by writing its output directly to a file as the pickle is generated. When writing a potentially reusable object, it is not known whether the object will be subsequently referenced. To allow for that possibility, pickle is forced to save a PUT opcode for each of those objects. This makes this pickle unnecessarily fat.
Fat pickles suck. They consume disk space if you write them to a file. They take extra transmission time if you send them across a network. And worse, fat pickles take more time and memory to unpickle. Unnecessary PUT opcodes cost you at both the sending and receiving ends.
To use the recipe, write "optimize(dumps(obj))" wherever you would have written "dumps(obj)".
This recipe pulls the whole pickle into memory, scans all the GET and PUT opcodes, and then eliminates unused PUT codes from the pickle string. This results in much shorter pickles. The effect is enhanced if you zip the pickle prior to transmission or storage.