Reading Pickled Data in Go
- Sunday September 07 2014
- python golang pickle serialization
Why?
Lately I've been needing to read data written by Python programs in Go. Normally I just use JSON for this task, but was convinced there was a better solution. There are more effecient formats like Google Protocol Buffers, but I don't like the idea of having to explicitly define a data structure in advance. In addition, you have to make sure both the reader and writer are using the same version.
In the end I took a look at Python's own pickle
module and decided I could implement something in Go to read that without too much trouble. The main appeal of the pickle format is that the newer versions of the protocol can encode large number of similar objects in a space-efficient manner.
As an example, consider the following program
import pickle import json l = [] with open('/dev/urandom','r') as fin: for i in range(1024): l.append({"i":i,"apples":ord(fin.read(1)),"banana":ord(fin.read(1)) }) print "JSON: %d bytes" % len(json.dumps(l)) print "Pickle: %d bytes" % len(pickle.dumps(l,2))
The output of this program is
JSON: 42031 bytes Pickle: 20521 bytes
Using pickle to encode the same data structure uses only 48% of the bytes of JSON. This is possible because pickle only encodes the strings that make up the keys of each dictionary once and references them throughout the program. JSON has to encode the same strings over and over again. Obviously, the JSON in this example is highly compressible. I'm not interested in adding yet another layer of complexity to my software so I didn't consider using compressed JSON.
The format
The pickle format is not so much a serialization format as it is a definition of a machine that can be used to reconstruct objects.
The result of pickling any object in Python is actually a complete "program". The program is composed of a series of opcodes that perform operations on a stack. Each opcode has zero or more operands following it. The result of the program is whatever is left on the stack after all the opcodes are executed. A program always ends with the STOP
opcode.
The opcodes defined by the pickle format are much simpler than those used by a microprocessor. There are no branch instructions whatsoever. The opcodes operate on a stack that is the same conceptually as the stack of a microprocessor. There is another data structure that the opcodes can manipulate known as the memo. This structure is similiar to the registers of a microprocessor. Unlike registers, the memo stores objects and has an infinite capacity. Objects are stored and retrieved from the memo positionally. The memo is an optimization that allows frequently used strings to be encoded only once in the data stream. This is one of the advantages of pickle over a format like JSON.
Understanding the format
In order to understood the format I began reading Lib/pickle.py
from the python source code. It turns out this contains almost no documentation. The file Lib/pickletools.py
defines the pickletools
module. This is where I found an explanation of the format in great detail.
Each opcode starts with a single byte that uniquely identifies it. Zero or more bytes follow the leading byte. These bytes make up the operands. The total width of each opcode and its operands can be variable. This is the case of opcodes that encode strings, for example.
Each protocol version defines successfully higher valued opcodes. By checking the range of the opcodes, the version of pickled data can be determined. Each pickle protocol version is a superset of all the previous versions. For example, protocol 2 includes all the opcodes from versions 1 and 0.
The pickletools.opcodes
list contains objects that describe each opcode in detail. It includes an indication of what the operands to each opcode are as well. This is quite helpful to read, as the encoding of some values is non-obvious.
Inspecting the format
The function pickletools.dis
can be used to inspect pickled data. This is very useful when debugging an implementation. You can inspect how encodings changed across protocol versions.
I compared the encoding of the string "hydrogen18" in protocol versions 0 and 2 for this example.
>>> pickletools.dis(pickle.dumps("hydrogen18",0)) 0: S STRING 'hydrogen18' 14: p PUT 0 17: . STOP highest protocol among opcodes = 0 >>> pickletools.dis(pickle.dumps("hydrogen18",2)) 0: \x80 PROTO 2 2: U SHORT_BINSTRING 'hydrogen18' 14: q BINPUT 0 16: . STOP highest protocol among opcodes = 2
The disassembly of each encoding is shown. The encoding of the string in protocol version 0 uses the STRING
opcode which is newline terminated. In protocol version 2 the SHORT_BINSTRING
opcode is used which includes the length of the string as a byte directly before the string.
Protocol changes across versions
Each protocol version introduces new opcodes. These opcodes do not replace old ones, but generally make them obsolete. Protocol versions 0 & 1 predate Python 2.3. Protocol version 2 was introduced in Python 2.3.
Protocol version 0 was the original pickle implementation. Prior to Python version 2.3, this was refered to as "text mode". It uses opcodes like all the following versions, but the encoding of objects is based primarily off the result of calling repr(obj)
. As a result, strings are encoded with a leading and trailing single quote. Integers are encoded as the base-10 ASCII representation. Bizarrely, the python values True
and False
are encoded as the strings "01" and "00" respectively.
Almost all the opcodes in version 0 use newline terminated operands on the opcodes. When encoding, the input of strings must have any newlines escaped. When output is decoded this must be unescaped. Due to decisions such as this the performance of unpickling data encoded using version 0 is poor.
Protocol version 1 introduced "binary" opcodes. Prior to Python version 2.3 this was refered to as "binary mode". These opcodes can be used in place of the version 0 opcodes for encoding integers, floats, strings, etc. For values such as integers, the size of the operand is fixed. For strings, the length of the encoded string is the first operand. This removes the need for newline termination and un-escaping when the value is decoded.
Protocol version 2 was added in Python 2.3. It introduced opcodes to encode the values True
and False
directly. It also introduced opcodes that allow faster decoding of small tuples. Most importantly it introduced encoding python long
types as their base-256 complement (a byte string, basically) rather than the base-10 ASCII string representation.
When writing data from Python, you should always use the newest version of the protocol possible. This can be done by passing -1
to the functions in the pickle
module. The newer protocol versions are much faster to read and write.
How classes are serialized
One of the pecuilar facts about the pickle format is that reading pickled data is completely insecure as implemented in Python. This has to do with the way that objects are pickled. The object is not pickled completely. The data of the object is pickled and a reference to its type is pickled along with that data.
Let's take a look at what happens when a user-defined class instance is serialized. This program produces the following output.
import pickle import pickletools class myclass(object): def __init__(self,a,b): self.a = a self.b = b self.x = 0 myinst = myclass(1,2) pickletools.dis(pickle.dumps(myinst,2))
0: \x80 PROTO 2 2: c GLOBAL '__main__ myclass' 20: q BINPUT 0 22: ) EMPTY_TUPLE 23: \x81 NEWOBJ 24: q BINPUT 1 26: } EMPTY_DICT 27: q BINPUT 2 29: ( MARK 30: U SHORT_BINSTRING 'a' 33: q BINPUT 3 35: K BININT1 1 37: U SHORT_BINSTRING 'x' 40: q BINPUT 4 42: K BININT1 0 44: U SHORT_BINSTRING 'b' 47: q BINPUT 5 49: K BININT1 2 51: u SETITEMS (MARK at 29) 52: b BUILD 53: . STOP
The opcode GLOBAL
finds the definition of the class. The NEWOBJ
opcode constructs an instance of myclass
. However, this calls __new__
on the class, not the __init__
defined in the program. The members of the class are restored using the opcodes between MARK
and SETITEMS
. The various BINPUT
opcodes are just memoization and not important in this example.
This pickle program is loosely equivalent to the following Python code
import __main__ cls = getattr(__main__,'myclass') i = cls.__new__(cls,*()) setattr(i,"a",1) setattr(i,"x",0) setattr(i,"b",2)
Obviously, if this is unpickled and there is no definition of myclass
the unpickling fails. This is the same thing that makes unpickling data from an untrusted source so dangerous: pickle programs can cause arbitrary code to be imported. This means that unpickling can result in arbitrary code execution. Since there is no equivalent to Python's runtime evaluated import
statement in Go, the same insecurity doesn't exist in my implementation.
Due to the complexity of trying to unpickle Python object instances into some sort of representation in Go, I skipped that for now.
Implementation
After understanding the format I decided on a couple of goals for my library.
- Conversion to native Go types whenever possible.
- Complete support for Pickle protocols 0,1, & 2.
- Allow unpickling a python
dict
into a Gostruct
where the keys of thedict
become the fields of thestruct
.
I opted not to try and tackle the problem of unpickling an instance of a Python object for now because I have no real need to do so. If you attempt to unpickle Python object instances using my library you'll get an error indicating an opcode is unsupported.
Python tuples have no good equivalent in Go. In the end I just create an []interface{}
whenever a tuple is encountered.
Since the opcodes are always a single byte, I decided to use a jump list to call the function that implements each method.
I'm not terribly interested in writing a ton of boilerplate, so I ended up writing a Python program that reads pickletools.opcodes
to produce the function definitions for each opcode and populates a jump list. The jump list has a static length of 256. By default, each value points at a function that just returns an error indicating the opcode is invalid. The values that represent valid opcodes are populated with the appropriate function. All the generated boilerplate just returns ErrOpcodeNotImplemented
. So all I had to do was fill in the blanks. I also had the Python program generate documentation for each opcode's function. This is how I generated the files protocol_0.go
,protocol_1.go
, protocol_2.go
, populate_jump_list.go
.
The implementation of each opcode consists of checking any logical preconditions about the stack then performing the action associated with opcode. For example the SETITEM
opcode without at least 3 objects on the stack is not computable.
The result of calling Unpickle
is always the type interface{}
. This isn't useful, so I implemented helpers that take the result of Unpickle
and attempt to convert it to the requested type. This solution is inspired by the package github.com/garyburg/redigo
. You can use them like so:
result, err := stalecucumber.Bool(stalecucumber.Unpickle(reader))
If the result of stalecucumber.Unpickle
was not an error and can be converted into a bool
, then result
has that value after the helper Bool
returns. Otherwise an error is returned indicating why the type conversion cannot take place.
To unpack python dict
types into a struct
in Go, I had to use the reflect
package. This turns out mainly to be an exercise in boilerplate and corner cases. In order for this to work, the python dict
must have keys that are all unicode
or string
types. The keys are assigned to matching field names in the struct
with the exception that the first letter of the key is always capitalized.
This example demonstrates the unpacking into structs
pickle.dumps({"apple":1,"banana":2,"cat":"hello","Dog":42.0})
This Python code pickles a python dict
var pickledDict io.Reader mystruct := struct{ Apple int Banana uint Cat string Dog float32}{} err := stalecucumber.UnpackInto(&mystruct).From(stalecucumber.Unpickle(pickledDict))
If pickledDict
is an io.Reader
containing the result of the call to pickle.dumps
then mystruct
is populated with the values. The types of the structure must loosely agree with the values in the python dict
.
Getting it
You can get my library at github.com/hydrogen18/stalecucumber or using the following command
go get github.com/hydrogen18/stalecucumber
You can read the full documentation at godoc.org/github.com/hydrogen18/stalecucumber
Future work
In the future, I plan on adding to the library the writing of pickled objects from Go. This should be simpler as are less problems around type conversion. I have some really interesting ideas about how to handle unpickling Python object instances that I'd like to try out. I haven't benchmarked this library against something like encoding/json
at all yet, but I suspect it is extremely poor. This is certainly the case when reading pickled objects using protocol 0. Realistically I am only concerned about the performance when reading objects pickled using protocol 2.