Golang, ORMs, and why I am still not using one.

Thursday November 27 2014
golang rdbms

In the past year greater than 90% of my output has been directed at writing software written in Golang. In that time frame all of the employment related projects I've worked on involve a relational database.

So far I am content using the built in database/sql package with the github.com/lib/pq driver package for PostgreSQL. I've resisted the urge to move to an ORM or similar package completely so far.

The low entry cost of the sql package

The first reason why I am not using an ORM is that the built in database/sql package gives you so much functionality without adding significant complexity to your program. Once you have it working you immediately get connection pooling and goroutine safe sharing of connections. Executing a query against the database is as simple as calling the correct function for your query. Those are broken roughly into the following groups

A query that returns no rows.
A query that returns one rows.
A query that returns zero or more rows.

After obtaining the query result the Scan function is called on the sql.Row type to read the query result into native types in Go. If you pass in a pointer to a type that is somehow incompatible, you get a runtime error.

All of your SQL is parameterized, there is no reason to use to string interpolation to generate queries.

You also get transactions, which superficially appear the same as using the database connection object. Prepared statement support also exists, which can make your life easier.

Downsides to the sql package

There are definitely places where the database/sql package falls short of my wants.

Since query parameters are all passed around as interface{} you can easily execute a query at run time that will always fail. The same goes for scanning results into types. Interestingly, if you query a column that could be null you have to craft your query to return a default value or pass a pointer to a pointer type when scanning the results. Otherwise when you select rows with non-null entries the query executes just fine, but fails whenever null is encountered.

--Table with null columns
CREATE TABLE expense_reports (
    id int,
    trip_id int,
    title varchar not null
);

--This row doesn't cause problems for the code below
INSERT INTO expense_reports (id,trip_id,title) VALUES(1,591628,'First expense report');
--This row causes the code below to fail
INSERT INTO expense_reports (id,trip_id,title) VALUES(2,null,'Second expense report');

package main

import "database/sql"
import _ "github.com/lib/pq"
import "os"
import "fmt"

type ExpenseReport struct {
    Id     int
    TripId int
    Title  string
}

func GetExpenseReportById(db *sql.DB, id int) (*ExpenseReport, error) {
    const query = `SELECT trip_id,title from expense_reports where id = $1 `
    var retval ExpenseReport
    err := db.QueryRow(query, id).Scan(&retval.TripId, &retval.Title)
    retval.Id = id
    return &retval, err
}

func main() {
    db, err := sql.Open("postgres", "dbname=example sslmode=disable")
    if err != nil {
        panic(err)
    }
    defer db.Close()
    for i := 1; i != 3; i++ {
        expenseReport, err := GetExpenseReportById(db, i)
        if err != nil {
            fmt.Fprintf(os.Stdout, "Error:%s\n", err)
        } else {
            fmt.Fprintf(os.Stdout, "Expense Report:%v\n", expenseReport)
        }
    }
}

When the function GetExpenseReportById is used to retrieve the expense report with an id column of 2 this code fails. The output of the example is shown here.

ericu@luit:~$ go run ./sql_null_column_example.go 
Expense Report:&{1 591628 First expense report}
Error:sql: Scan error on column index 0: converting string "<nil>" to a int: strconv.ParseInt: parsing "<nil>": invalid syntax

As you can see above SQL statements are just some strings shipped around to different pieces of software. There is no way to verify that your SQL is valid, much less that it returns the correct result without using an actual database.

All this means that you've given up one of the most awesome things about Golang: the compile-time verification of types. I can't really say I've ever been excited to find a stack trace involving Python's AttributeError because I forgot to implement some intermittently called method on an object. Golang's compile-time checking of types is sometimes dismissed as of little value. It is important to realize that compile-time checking of types can only verify syntactic validity of code. It is of course nigh-impossible to produce a program with invalid syntax that has the correct semantics. The compile-time type checking provides a first layer of defense against such errors.

My solution to this was to make sure each function that contains SQL is completely tested against an actual database. There does seem to exist some sentiment against requiring a database to run unit tests. Some might even argue that such tests aren't even unit tests. I have never personally understood the desire for this, as SQL is just more source code as far as I'm concerned. It rarely is a mistake to test code. I don't plan on mocking out a complete RDBMS just so I can claim I am doing "real" unit testing.

The second problem with the database/sql package is that if you have some object that could logically be returned from many different queries you wind up duplicating the logic to Scan the result rows into the object all over the place.

I mitigated this by defining functions that are only package visible such as scanCustomer that is called as a helper from other functions. This works, but I ended up defining my own interface called scanner to capture the signature of the (*sql.Row).Scan function. This requires that the caller use the same set and ordering of columns when querying the row from the database.

I of course do not intend to leave out that using the database/sql package tends to involve a bunch of boilerplate. You wind up checking for things like sql.ErrNoRows in lots of places. There are good arguments against boilerplate, but I do not find it much of an impediment to software development. The amount of time I spend actually writing code tends to be very small, so spending 20 seconds getting the boiler plate in place for a function isn't an issue for me. Boilerplate does become a huge problem in the maintenance of existing code. You may suddenly need to touch every function in a package if your general way of doing things changes. It is also shows up as a real problem if it acts as a barrier to prototyping new things in an application. Spending 2 minutes to implement a new widget in a user interface and 2 hours to get the data access code to populate the widget is never fun. Eventually this results in developers not wanting to work on the project at all. Those that do work on the project actively avoid trying to find new, better ways to do things.

Golang & ORMs

There is no short supply of packages for Golang that do something with your objects and a database. Missing of course from all of these is the "R" in relational. I am not the first person to put this view forward. In fact, this presentation gives a great summary of the state of ORMs in Golang.

Fundamentally, a package in Golang is a run-time only affair. If you're expecting the level of integration that ActiveRecord provides with Ruby you are going to be disappointed. There is no way to implement an ORM at runtime in Go that integrates with the language. From the perspective of language designer, this is actually a good thing. But what this means to me is that the aforementioned shortcomings of the sql package still exist when using one of the ORM-like packages available for Go.

In other words, they don't actually solve any of my problems. So it doesn't really make sense to take the time to learn the usage patterns and add another software dependency. After all, if I can't express the data manipulation I need with SQL I have a serious problem. Any ORM-like package I select is likely to making using some features of SQL more difficult than just using SQL. If it does expose all the features of SQL, it probably does so poorly. This is known as the inner-platform effect. A software package that tries to abstract away the facts of the underlying implementation eventually just becomes a poor copy of the underlying implementation.

Why ORMs bother me

The problems of ORMs are not unique to Golang. Any statically typed language eventually runs into this sort of problem. The Java environment is able to deal with this slighlty better than C, but it is still awful. The only reason dynamically typed languages wind up better off is their ability to dictate and alter the behavior of objects at run time.

Certain RDBMS concepts simply do not "map" at all to object oriented programming. Uniqueness and primary key constraints on tables are an excellent example of this. Any half-decent textbook on object-oriented software development covers the difference between a "has-a" and an "is-a" relationship between objects. But none of them cover object identity. Object oriented programming certainly has the concept of object equivalence, but that is actually the opposite of object uniqueness. If you have two objects that are equivalent then you explicitly do not enforce object uniqueness. The idea of object uniqueness doesn't exist in functional or imperative programming languages. You certainly could emulate it in Python for example with code something like this.

already_expensed = set()
class ExpenseReport(object):
    def __init__(self,for_trip):
        global already_expensed
        if for_trip in already_expensed:
            raise ValueError("You can't submit two expense reports for the same trip")

        already_expensed.add(for_trip)

Let's just ignore the fact for a moment that already_expensed isn't persistent. This code certainly enforces that you can't create two instances of ExpenseReport for the same trip. This certainly mirrors an actual database column constraint on a table that is in use somewhere right now. But the implementation has nothing to do with object oriented programming. We just happen to be checking a global variable in a constructor. The same concept could be applied to a function to cache the return value for a given sets of inputs, but using a map instead of a set.

But this isn't really the behavior a developer was aiming for when they added the column constraint on the not-so-hypothetical expense_reports table. The actual goal was to have a single source of information about an expense report for a single trip. The software the marketing department uses for expense reporting might be wildly different from the expense reporting software used by a technician doing on-site product support. The fundamental point of this is no one cares how many or what objects are created as long as the database is internally consistent.

Your ORM is really an application framework

Most of the software that gets referred to as an ORM is really an application level framework. The job of an ORM is to provide some sort of bridge between database relations and objects. Certainly ActiveRecord is an ORM, but it also is a framework. This is because ORMs almost always grow to cover certain other areas such as schema migrations and object serialization. ActiveRecord also places itself at the top of your object hierarchy. This is beyond the scope of bridging objects and relations. ActiveRecord really is an ORM and an application framework.

What I find particularly bizarre are frameworks that expect you to write schema migrations in some language other than SQL. I still cannot wrap my head around why this is supposedly an advantage. If I want to alter my database relations, there is already a language for that: SQL.

Why are ORMs used at all?

The reasons why ORMs get used seem to vary wildly. This an abbreviated list of the reasons I've heard.

I don't know SQL
I don't have time to write SQL
If you don't use a framework, your code is unmaintainable

Pick one or pick them all, it doesn't really matter to me. I have never understood these points until recently. It turns out of course, that I go about using a RDBMS in a fundamentally different way than most people. It wasn't until I read Jeff Atwood's comments that this became apparent to me. The problem that most run into is the "object/relational impedance mismatch" which can is summarized as follows

Object systems are typically characterized by four basic components: identity, state, behavior and encapsulation. ... Relational systems describe a form of knowledge storage and retrieval based on predicate logic and truth statements.

This excerpt is from Ted Neward's The Vietnam of Computer Science.

Furthermore he goes on to state

In truth, this basic approach--to slave one model into the terms and approach of the other--has been the traditional answer to the impedance mismatch, effectively "solving" the problem by ignoring one half of it.

The summary at the end perfectly captures the possible outcomes of "solving" this problem. My current approach falls into the following category.

3. Manual mapping. Developers simply accept that it's not such a hard problem to solve manually after all, and write straight relational-access code to return relations to the language, access the tuples, and populate objects as necessary. In many cases, this code might even be automatically generated by a tool examining database metadata, eliminating some of the principal criticism of this approach (that being, "It's too much code to write and maintain").

The reason why I am so comfortable with this is because I write the needed SQL DDL statements first to capture the information I need to store. Later on, I figure out how to make my software work with that. This is in complete contrast to the approach that most choose, which is to attempt to force the database into being a box in which they can store objects they already have.

I find this approach to be tremendously less painful. If I had the ideal set of objects to capture the data domain in whatever problem space I am working in, I doubt I would wind up with an id member on every single object. Furthermore, there is no way I'd wind up with a trip_id member on my ExpenseReport object. You simply store a reference or pointer to the Trip instance. It is right there for everyone to access. Trying to force an RDBMS to capture all of this seems like an exceptionally tortuous practice.

On the other hand, if you designed the SQL schema the idea of creating some relatively simple objects that allow you interact with the schema should be incredibly straightforward. If you've got an expense report and you want to get the associated trip with it, you call Trip.find_by_id(an_expense_report.trip_id). There is a whole code smell to this, but the smell is there because you are using the database as it is intended. Furthermore, the code is actually easier to reason about. Anyone who has a vague understanding of how an RDBMS works instantly knows that this is a database query, that the result might change on subsequent calls, and that the whole operation can fail if the network is down.

Two ends of a continuum

What I have discovered is that there exists a continuum with two ends. On the left side of the continuum you have relation concepts. On the right side of the continuum are the object oriented concepts. When a database is incorporated into software by a developer, a point on this continuum must be picked. This is always done, even if the decision is made subconsciously.

If you choose the right side of the continuum, the database winds up getting used as a persistence layer. Objects are created, modified, and deleted. That is pretty much it. These are the obvious problems with this approach.

The overhead of object access can become unacceptably high. If you need to retrieve and modify thousands of objects from the database you find it is too slow to use basic CRUD methods that operate on a single object at a time. The solution to this is to expose functionality that allows you to change retrieve or modify objects with a single trip to the database.
The problem of object identity and object uniqueness is not addressed. If you build tables that simply store the information associated with each object type, then it is possible to get duplicate objects in your database. A mitigation approach to this problem is to include software checks that are invoked when an object is created or updated. Each check has the option to say yes or no if the object can be saved. These checks always generate false-positives and false-negatives due to race conditions. The false-negatives are less of a problem than the false-positives. The solution to this is to add column constraints to the tables.
The RDBMS fails to capture the "is-a" relationship amongst types. There are many possible solutions to this problem. You can build a single flat table that captures all of the attributes of child types as columns. This means that each row in the table has more null columns than those with data. This table can become unacceptably large. You can generate separate tables for each child type, but this means that the problem of object uniqueness is no longer solvable with column constraints. A hybrid approach is chosen, where there is a single parent table that has the columns with uniqueness constraints in it. You also need a column to store the child type and a foreign key into the child table. At a minimum this imposes a performance penalty because each read or write to the database winds up being a JOIN between two tables.
The read-modify-write behavior of CRUD operations results in data loss. If it is possible to have two different threads of execution modifying the same object the changes of one overwrite the changes of the other. This actually is addressable by using a version column or a timestamp column on each table. Whenever the object is updated, the column is checked. If it has changed since the object has been read the write is rejected. The problem comes with choosing the right strategy to handle write rejections. The choice must be made between abandonment and retries, neither is univerally appropriate.

My proposed solution to each of these problems allows relational concepts to bubble up into an object oriented world. There are more problems to this approach, I just have not run into them yet. It is worth noting that if you truly just need an object-store an RDBMS is a massively overkill solution. If you can serialize your objects and assign a unique identifier to them, a key-value store should work very well.

The left end of the spectrum causes problems as well.

Once you build a schema with strong integrity constraints, an in-memory representation is of limited utility. You can create objects that display information stored in the schema. But as soon you begin to modify that data, you can build impossible objects in memory. The in-memory representation can be made consistent by duplicating the constraints in software, but this creates a maintenance nightmare.
The database types may be too rigid for your domain. This problem is partially mitigated by the presence of extensions such as hstore and json in PostgreSQL.
A simple schema may not map to how you want to display data. The powerful relational algebra concepts in SQL allow you to display your data in almost any format you need. The problem with this is that performance suffers under the presence of a large number of JOIN statements. To address this, more column indexes can be added. This has a negative performance on UPDATE and INSERT statements because the indexes must be updated as part of each query.

There are still more problems than this, but once again I have not run into them yet.

My approach to RDBMS usage

The important thing to take away from comparing the two ends of the spectrum is that the left end gives up data flexibility and ease of access while the right end gives up linearizability. While it may be an annoyance to work within the data representations of SQL, linearizability is far more valuable than anything else you lose.

My reason behind this choice is that when your data manipulation is linearizable building a distributed system is simpler. The phrase "distributed system" immediately invokes the idea of horizontal scaling with a notion of incredibly large scale. I am actually significantly more interested in being able to design architectures that are distributed in the sense that there are multiple separate processes all fulfilling a role within a system. Each process does not need to be identical to the others. Any shared state between the processes is somehow in the database. This design decision does not automatically lend itself to a scalable or failure-tolerant design. What it does allow me to create is a system with each component having a limited set of responsibility. By placing a limited set of responsibilities in each component, you gain several advantages.

Each component is easier to develop and test.
Developers can comprehend and maintain components in isolation.
New functionality can be added to the system by adding new components.
Each component can have use the software architecture that makes the most amount of sense for the domain of its responsibilities.

If your system has a large amount of shared state that is frequently accessed or manipulated this is a sign that the division of responsibilities is poorly chosen. When you minimize the amount of shared state, you are able to build a system without running into scalability barriers associated with an RDBMS.

My goal is to use the RDBMS as the immutable source of truth as to the state of a software system. The data integrity constraints and linearizability guarantees of an RDBMS are well suited to this. Even if the software goes completely awry, the database should be designed with enough integrity guarantees to allow the disaster to be recovered from.

When the database types are not sufficient to capture the information you need to store there are two easy solutions to this problem. If the data is never used as a constraint or a query parameter, you can safely serialize the data. The choice of serialization format is mainly one of portability. If you have multiple software projects all looking at the same data, then it makes sense to choose a serialization format like JSON.

Where to "Go" from here

The most puzzling fact in all of this is that there does not seem to be a framework that attempts to use code generation as a strategy for accessing an RDBMS from Golang. Code generation is expected to be standardized in Golang 1.4. I am not the only person with this view.

Given my "SQL First" strategy, it makes sense to generate Golang code from the schema of the database. To do this, you need to be able to parse the SQL DDL statements. I discovered that the Vitess project includes an SQL parser. Unfortunately I discovered that the library does not support DDL statements. FoundationDB offers an open source SQL parser in Java. It appears to be well documented and able to parse all SQL statements. There is something a little strange about the idea of requiring Java as a build time dependency of a Golang project.

I have not yet committed to writing a code generation library to address this problem. If I do, my preference is to stay very close to the relational side. My goals for such a library are

Generate code by parsing the SQL DDL statements of the database.
Expose relational concepts in any generated objects.
Use a fluent interface to expose a query builder. The resulting SQL should be non-ambiguous.
Facilitate greater compile time correctness checking of code.

When I develop a library, I want to realize gains immediately as the result of my work. So, I will need to develop a library that can estalbish a useful core functionality immediately and have new features added as I need them.