Reading Pickled Data in Go

Sunday September 07 2014
python golang pickle serialization

Why?

Lately I've been needing to read data written by Python programs in Go. Normally I just use JSON for this task, but was convinced there was a better solution. There are more effecient formats like Google Protocol Buffers, but I don't like the idea of having to explicitly define a data structure in advance. In addition, you have to make sure both the reader and writer are using the same version.

In the end I took a look at Python's own pickle module and decided I could implement something in Go to read that without too much trouble. The main appeal of the pickle format is that the newer versions of the protocol can encode large number of similar objects in a space-efficient manner.

As an example, consider the following program

import pickle
import json
l = []
with open('/dev/urandom','r') as fin:
    for i in range(1024):
        l.append({"i":i,"apples":ord(fin.read(1)),"banana":ord(fin.read(1)) })  

print "JSON: %d bytes" % len(json.dumps(l))
print "Pickle: %d bytes" % len(pickle.dumps(l,2))

The output of this program is

JSON: 42031 bytes
Pickle: 20521 bytes

Using pickle to encode the same data structure uses only 48% of the bytes of JSON. This is possible because pickle only encodes the strings that make up the keys of each dictionary once and references them throughout the program. JSON has to encode the same strings over and over again. Obviously, the JSON in this example is highly compressible. I'm not interested in adding yet another layer of complexity to my software so I didn't consider using compressed JSON.

The format

The pickle format is not so much a serialization format as it is a definition of a machine that can be used to reconstruct objects.

The result of pickling any object in Python is actually a complete "program". The program is composed of a series of opcodes that perform operations on a stack. Each opcode has zero or more operands following it. The result of the program is whatever is left on the stack after all the opcodes are executed. A program always ends with the STOP opcode.

The opcodes defined by the pickle format are much simpler than those used by a microprocessor. There are no branch instructions whatsoever. The opcodes operate on a stack that is the same conceptually as the stack of a microprocessor. There is another data structure that the opcodes can manipulate known as the memo. This structure is similiar to the registers of a microprocessor. Unlike registers, the memo stores objects and has an infinite capacity. Objects are stored and retrieved from the memo positionally. The memo is an optimization that allows frequently used strings to be encoded only once in the data stream. This is one of the advantages of pickle over a format like JSON.

Understanding the format

In order to understood the format I began reading Lib/pickle.py from the python source code. It turns out this contains almost no documentation. The file Lib/pickletools.py defines the pickletools module. This is where I found an explanation of the format in great detail.

Each opcode starts with a single byte that uniquely identifies it. Zero or more bytes follow the leading byte. These bytes make up the operands. The total width of each opcode and its operands can be variable. This is the case of opcodes that encode strings, for example.

Each protocol version defines successfully higher valued opcodes. By checking the range of the opcodes, the version of pickled data can be determined. Each pickle protocol version is a superset of all the previous versions. For example, protocol 2 includes all the opcodes from versions 1 and 0.

The pickletools.opcodes list contains objects that describe each opcode in detail. It includes an indication of what the operands to each opcode are as well. This is quite helpful to read, as the encoding of some values is non-obvious.

Inspecting the format

The function pickletools.dis can be used to inspect pickled data. This is very useful when debugging an implementation. You can inspect how encodings changed across protocol versions.

I compared the encoding of the string "hydrogen18" in protocol versions 0 and 2 for this example.

>>> pickletools.dis(pickle.dumps("hydrogen18",0))
    0: S    STRING     'hydrogen18'
   14: p    PUT        0
   17: .    STOP
highest protocol among opcodes = 0
>>> pickletools.dis(pickle.dumps("hydrogen18",2))
    0: \x80 PROTO      2
    2: U    SHORT_BINSTRING 'hydrogen18'
   14: q    BINPUT     0
   16: .    STOP
highest protocol among opcodes = 2

The disassembly of each encoding is shown. The encoding of the string in protocol version 0 uses the STRING opcode which is newline terminated. In protocol version 2 the SHORT_BINSTRING opcode is used which includes the length of the string as a byte directly before the string.

Protocol changes across versions

Each protocol version introduces new opcodes. These opcodes do not replace old ones, but generally make them obsolete. Protocol versions 0 & 1 predate Python 2.3. Protocol version 2 was introduced in Python 2.3.

Protocol version 0 was the original pickle implementation. Prior to Python version 2.3, this was refered to as "text mode". It uses opcodes like all the following versions, but the encoding of objects is based primarily off the result of calling repr(obj). As a result, strings are encoded with a leading and trailing single quote. Integers are encoded as the base-10 ASCII representation. Bizarrely, the python values True and False are encoded as the strings "01" and "00" respectively.

Almost all the opcodes in version 0 use newline terminated operands on the opcodes. When encoding, the input of strings must have any newlines escaped. When output is decoded this must be unescaped. Due to decisions such as this the performance of unpickling data encoded using version 0 is poor.

Protocol version 1 introduced "binary" opcodes. Prior to Python version 2.3 this was refered to as "binary mode". These opcodes can be used in place of the version 0 opcodes for encoding integers, floats, strings, etc. For values such as integers, the size of the operand is fixed. For strings, the length of the encoded string is the first operand. This removes the need for newline termination and un-escaping when the value is decoded.

Protocol version 2 was added in Python 2.3. It introduced opcodes to encode the values True and False directly. It also introduced opcodes that allow faster decoding of small tuples. Most importantly it introduced encoding python long types as their base-256 complement (a byte string, basically) rather than the base-10 ASCII string representation.

When writing data from Python, you should always use the newest version of the protocol possible. This can be done by passing -1 to the functions in the pickle module. The newer protocol versions are much faster to read and write.

How classes are serialized

One of the pecuilar facts about the pickle format is that reading pickled data is completely insecure as implemented in Python. This has to do with the way that objects are pickled. The object is not pickled completely. The data of the object is pickled and a reference to its type is pickled along with that data.

Let's take a look at what happens when a user-defined class instance is serialized. This program produces the following output.

import pickle
import pickletools
class myclass(object):
    def __init__(self,a,b):
        self.a = a
        self.b = b
        self.x = 0

myinst = myclass(1,2)
pickletools.dis(pickle.dumps(myinst,2))

   0: \x80 PROTO      2
    2: c    GLOBAL     '__main__ myclass'
   20: q    BINPUT     0
   22: )    EMPTY_TUPLE
   23: \x81 NEWOBJ
   24: q    BINPUT     1
   26: }    EMPTY_DICT
   27: q    BINPUT     2
   29: (    MARK
   30: U        SHORT_BINSTRING 'a'
   33: q        BINPUT     3
   35: K        BININT1    1
   37: U        SHORT_BINSTRING 'x'
   40: q        BINPUT     4
   42: K        BININT1    0
   44: U        SHORT_BINSTRING 'b'
   47: q        BINPUT     5
   49: K        BININT1    2
   51: u        SETITEMS   (MARK at 29)
   52: b    BUILD
   53: .    STOP

The opcode GLOBAL finds the definition of the class. The NEWOBJ opcode constructs an instance of myclass. However, this calls __new__ on the class, not the __init__ defined in the program. The members of the class are restored using the opcodes between MARK and SETITEMS. The various BINPUT opcodes are just memoization and not important in this example.

This pickle program is loosely equivalent to the following Python code

import __main__
cls = getattr(__main__,'myclass')
i = cls.__new__(cls,*())
setattr(i,"a",1)
setattr(i,"x",0)
setattr(i,"b",2)

Obviously, if this is unpickled and there is no definition of myclass the unpickling fails. This is the same thing that makes unpickling data from an untrusted source so dangerous: pickle programs can cause arbitrary code to be imported. This means that unpickling can result in arbitrary code execution. Since there is no equivalent to Python's runtime evaluated import statement in Go, the same insecurity doesn't exist in my implementation.

Due to the complexity of trying to unpickle Python object instances into some sort of representation in Go, I skipped that for now.

Implementation

After understanding the format I decided on a couple of goals for my library.

Conversion to native Go types whenever possible.
Complete support for Pickle protocols 0,1, & 2.
Allow unpickling a python dict into a Go struct where the keys of the dict become the fields of the struct.

I opted not to try and tackle the problem of unpickling an instance of a Python object for now because I have no real need to do so. If you attempt to unpickle Python object instances using my library you'll get an error indicating an opcode is unsupported.

Python tuples have no good equivalent in Go. In the end I just create an []interface{} whenever a tuple is encountered.

Since the opcodes are always a single byte, I decided to use a jump list to call the function that implements each method.

I'm not terribly interested in writing a ton of boilerplate, so I ended up writing a Python program that reads pickletools.opcodes to produce the function definitions for each opcode and populates a jump list. The jump list has a static length of 256. By default, each value points at a function that just returns an error indicating the opcode is invalid. The values that represent valid opcodes are populated with the appropriate function. All the generated boilerplate just returns ErrOpcodeNotImplemented. So all I had to do was fill in the blanks. I also had the Python program generate documentation for each opcode's function. This is how I generated the files protocol_0.go,protocol_1.go, protocol_2.go, populate_jump_list.go.

The implementation of each opcode consists of checking any logical preconditions about the stack then performing the action associated with opcode. For example the SETITEM opcode without at least 3 objects on the stack is not computable.

The result of calling Unpickle is always the type interface{}. This isn't useful, so I implemented helpers that take the result of Unpickle and attempt to convert it to the requested type. This solution is inspired by the package github.com/garyburg/redigo. You can use them like so:

result, err := stalecucumber.Bool(stalecucumber.Unpickle(reader))

If the result of stalecucumber.Unpickle was not an error and can be converted into a bool, then result has that value after the helper Bool returns. Otherwise an error is returned indicating why the type conversion cannot take place.

To unpack python dict types into a struct in Go, I had to use the reflect package. This turns out mainly to be an exercise in boilerplate and corner cases. In order for this to work, the python dict must have keys that are all unicode or string types. The keys are assigned to matching field names in the struct with the exception that the first letter of the key is always capitalized.

This example demonstrates the unpacking into structs

pickle.dumps({"apple":1,"banana":2,"cat":"hello","Dog":42.0})

This Python code pickles a python dict

var pickledDict io.Reader
mystruct := struct{
    Apple int
    Banana uint
    Cat string
    Dog float32}{}

err := stalecucumber.UnpackInto(&mystruct).From(stalecucumber.Unpickle(pickledDict))

If pickledDict is an io.Reader containing the result of the call to pickle.dumps then mystruct is populated with the values. The types of the structure must loosely agree with the values in the python dict.

Getting it

You can get my library at github.com/hydrogen18/stalecucumber or using the following command

go get github.com/hydrogen18/stalecucumber

You can read the full documentation at godoc.org/github.com/hydrogen18/stalecucumber

Future work

In the future, I plan on adding to the library the writing of pickled objects from Go. This should be simpler as are less problems around type conversion. I have some really interesting ideas about how to handle unpickling Python object instances that I'd like to try out. I haven't benchmarked this library against something like encoding/json at all yet, but I suspect it is extremely poor. This is certainly the case when reading pickled objects using protocol 0. Realistically I am only concerned about the performance when reading objects pickled using protocol 2.