Ben Homnick

Building Django admin inlines for non-related models

Ben Homnick — Wed, 15 Jul 2020 09:12:00 GMT

Django's admin app is an incredible piece of software, making it trivial to do 99% of what you'd expect a CRUD admin interface to do with minimal code. Unfortunately there's always that 1% of cases when you need to step outside the pattern, and then it's down the rabbit hole of digging through source code, messing with private interfaces, and inevitable tears.

Recently a client asked me to experiment with implementing an admin inline view for a third party app where records are linked via natural keys but no explicit foreign key relationship exists between models. It turns out not only is this possible with minimal tears, it can also be applied in a more generic way to allow arbitrary relationships between any two models.

The code is available on GitHub.

The players

There are four components at work when dealing with inline admin relationships:

The inline model admin class (I'll use admin.StackedInline for this example). This you're probably already familiar with. It's a more specialized case of a regular ModelAdmin that has configuration options for related models.
An instance of admin.checks.InlineModelAdminChecks. This class basically checks to make sure a related model instance and its parent do actually have a relationship. It's part of Django's system checks framework, and we're going to fool it.
A formset for handling creation of non-related inline objects. This will be a subclass of forms.models.BaseModelFormSet. The base formset makes assumptions about FK relationships, so we're going to override those.
A formset factory for building classes of the formset in the previous step with a few slight modifications, mostly to pass things from our inline admin on to our formset class.

The inline admin

The goal here is to move as much of the configuration for this relationship into the inline admin subclass. Where an explicit FK relationship would know exactly how to fetch related models and save or update related models, we'll create methods here to allow subclasses to define that behavior. The interface will look like this:

class NonrelatedStackedInline(admin.StackedInline):
    
    def get_form_queryset(self, obj):
        raise NotImplementedError()

    def save_new_instance(self, parent, instance):
        raise NotImplementedError()

Here the get_form_queryset, given a parent object, should return all related child instances. On the other side, save_new_instance should set up the implicit relationship between child instances whenever a form is saved.

To give you an example, let's say we have a third-party app that handles billing and contains Invoice objects. Due to circumstances beyond our control these invoices don't have FKs to our User model. Instead, we know an Invoice belongs to a User if they share the same email address. Now if I want to create an Invoice belonging to a User, I just need to make sure I set the email address of that Invoice to whatever the user's email is, so my subclass would look like:

class UserInvoiceStackedInline(NonrelatedStackedInline):
    model = Invoice

    def get_form_queryset(self, obj):
        return self.model.objects.filter(email=obj.email)

    def save_new_instance(self, parent, instance):
        instance.email = parent.email

And that's it. Pretty straightforward.

Now that we have our interface defined, we need to make sure our lookup and save methods are called in the correct places. This is where things get a little tricky.

The inline base class has a method get_formset that is called as part of the form lifecycle. Normally get_formset ends up calling the built-in inlineformset_factory with the parent model, child model, and a bunch of other configuration options, but this won't work in our case because the default formset makes too many assumptions about FK relationships. We'll modify the get_formset method to call a factory for a custom formset we'll define later, and use this opportunity to query get_form_queryset and pass that along. We'll pass our save_new_instance method as well so the formset save method will have access to this as a callback.


class NonrelatedStackedInline(admin.StackedInline):
    ...

    def get_formset(self, request, obj=None, **kwargs):
        ...
        defaults = {
            ...
        }  # This is all from the parent class method

        # Pass our custom queryset on to the formset factory
        queryset = self.model.objects.none()
        if obj:
            queryset = self.get_form_queryset(obj)
        defaults['queryset'] = queryset

        return nonrelated_inlineformset_factory(
            self.model,
            save_new_instance=self.save_new_instance,
            **defaults
        )

Our formset factory now has everything it needs to deal with these two models.

The formset and formset factory

Our goals for our custom formset are to 1) override the queryset used to look up instances, and 2) override the save method so it uses the callback we've defined in our inline admin.

For the first, a BaseModelFormset has a queryset attribute that it uses to find relevant instances, but unfortunately it's overwritten during __init__ in a complicated way and we need to be a little creative here. What I ended up doing was using the factory method to define what we want to queryset attribute to look like after initialization and then setting it explicitly once the parent class __init__ method is finished. That looks like this:

def nonrelated_inlineformset_factory(
    model, obj=None,
    queryset=None,
    formset=NonrelatedInlineFormSet,
    save_new_instance=None,
    **kwargs
):
    FormSet = modelformset_factory(model, formset=formset, **kwargs)
    FormSet.real_queryset = queryset
    FormSet.save_new_instance = save_new_instance
    return FormSet

We're essentially attaching our desired queryset as a class attribute. I've also taken the opportunity to attach our save callback.

Now to set the queryset attribute:

class NonrelatedInlineFormSet(BaseModelFormSet):
    def __init__(self, instance=None, save_as_new=None, **kwargs):
        self.instance = instance
        super().__init__(**kwargs)
        self.queryset = self.real_queryset

Slightly hacky, yes, but our formset will now use whatever we've specified in the inline's get_form_queryset method instead of the default FK lookup. In addition we also have a reference to the parent instance we can use when it comes time to save related instances later.

The last piece for forms is saving new instances. To do this we'll just override the built-in save_new method and have it call the save_new_instance callback we also defined in our admin inline:

class NonrelatedInlineFormSet(BaseModelFormSet):
    ...
    def save_new(self, form, commit=True):
        obj = super().save_new(form, commit=False)
        self.save_new_instance(self.instance, obj)
        if commit:
            obj.save()
        return obj

If you try to run this as-is, Django will still complain that there's no FK relationship between these models, so let's fix that next.

The checks object

If you look at the Django source code for InlineModelAdmin, it has a checks_class attribute that is set to InlineModelAdminChecks. This class performs quite a few sanity checks on an inline object, most of which are still applicable to non-related inlines such as whether or not certain configuration values are the right type, the formset has the right base class, etc. There are two in particular, however, that we want to override:

class NonrelatedInlineModelAdminChecks(InlineModelAdminChecks):
    def _check_exclude_of_parent_model(self, obj, parent_model):
        return []

    def _check_relation(self, obj, parent_model):
        return []

These two checks examine the relationship between an inline object and its parent model, and I've overriden them to simply return an empty array indicating all is well, since as far as Django can tell there is no actual relationship here. Now we need to attach that checks class to our inline admin class:

class NonrelatedStackedInline(admin.StackedInline):
    checks_class = NonrelatedInlineModelAdminChecks
    ...

That's it, we now have the 4 components necessary to deal with non-related inline objects.

A full example

Continuing with the user and invoice example above, I would most likely want my inline admin to look like this:

class CustomerInvoiceStackedInline(NonrelatedStackedInline):
    model = Invoice
    fields = [
        'id',
        'amount'
        ...
    ]

    def get_form_queryset(self, obj):
        return self.model.objects.filter(email=obj.email)

    def save_new_instance(self, parent, instance):
        instance.email = parent.email

I've included an explicit fields list so that the email field is hidden. It's not the end of the world if it shows up on the form, as it will just be reset to the correct value on the next save, but it should be implicit and is unnecessary to display.

Generalizing this pattern

To make this more useful, I've packaged these components into a PyPI release along with some unit tests against different Python and Django versions since we are mucking around a bit with private interfaces which could change in future releases. I'll include writeup on how these were tested in a future article.

factom-api is now part of factomProject

Ben Homnick — Sun, 30 Jun 2019 09:03:00 GMT

The Factom Python API client described in my previous post has been picked up by the Factom Project and is now part of their official Github org. You can find the updated project here:

https://github.com/FactomProject/factom-api

Expect more of the API to be covered soon.

A Design Pattern for Python API Client Libraries

Ben Homnick — Wed, 24 Jan 2018 14:47:11 GMT

Recently I've been getting more and more interested in blockchain and ways it can be used in the education industry, in particular how to secure student information to prevent tampering and allow audits without exposing sensitive user data. The details of that project will come in the next post, but as part of the process I ended up writing a new Python client library for Factom, a blockchain service provider. I'm going to use this post to share some thoughts on what makes a good API client and share the patterns I follow when designing client libraries from scratch, using the Factom client as an example.

Before we get into the technical details, I first want give some background on the Factom Project. Given the obscene amount of money being blindly thrown into cryptocurrencies, it's incredible how few blockchain services with any kind of practical application exist. Factom is a refreshing exception to that trend. If you're unfamiliar with the project I encourage to check out the whitepaper, but tl;dr Factom is a service layer on top of the Bitcoin blockchain that allows arbitrary amounts of data to be secured quickly and at a fixed cost. It's a perfect fit for any application that needs an immutable records system and was the natural choice for this project.

The code for the API client is available on GitHub.

Design goals

Let's first consider the interface from a developer's perspective and then we'll take a look under the hood. These are the goals I keep in mind when designing a client library:

1. Minimal boilerplate: The whole point of a library is to reduce the number of lines a developer needs to write. Ideally we'd write zero lines of code, but as I haven't figured out how to get away with that yet, let's assume the best we can do is one line per action. One line of code gets you a client interface that's ready to go. One line of code gets you the result of a method call.

# Instantiate a new client interface and query the balance of 
# Factoid tokens at our address.
>>> client = Factomd(username='rpc_username', password='rpc_password')
>>> client.factoid_balance(fct_address)
{'balance': 50000}

2. Seamless integration: As a developer, the last thing I want to have to think about is transport layer or server implementation details. Ideally, I want my API clients to look like any other local class or module interface. Methods should take native data types and spit out native data types.

# Read all of the entries in a Factom chain. Make a JSON-RPC call 
# to the 'chain-head' method at localhost:8088/v2 with the chain 
# ID as a parameter to determine the Key Merkle Root of the first 
# data-containing entry block and I really don't want to think
# about this I just want my data...
>>> factom.read_chain(chain_id)
[{'chainid': 'da2ffed0ae7b33acc718089edc0f1d001289857cc27a49b6bc4dd22fac971495', 'extids': ['random', 'entry', 'id'], 'content': 'entry_content'}, {'chainid': 'da2ffed0ae7b33acc718089edc0f1d001289857cc27a49b6bc4dd22fac971495', 'extids': ['random', 'chain', 'id'], 'content': 'chain_content'}]

3. Sensible error handling: There's a special circle of hell reserved for libraries whose error handling mechanism consists of checking for a 4XX HTTP status code. Good error handling should make use of language control structures and error types. Python exceptions are great:

# Create a new Factom chain.
>>> walletd.new_chain(factomd, chain_id, chain_content, ec_address=ec_address)
Traceback (most recent call last):
  File "", line 1, in 
  File "/src/factom/client.py", line 196, in new_chain
    'ecpub': ec_address or self.ec_address
  File "/src/factom/client.py", line 56, in _request
    handle_error_response(resp)
  File "/src/factom/exceptions.py", line 18, in handle_error_response
    raise codes[code](message=message, code=code, data=data, response=resp)
factom.exceptions.InvalidParams: -32602: Invalid params

# Uh oh. Let's try that again with some exception handling.
>>> try:
...     walletd.new_chain(factomd, chain_id, chain_content, ec_address=ec_address)
... except InvalidParams as e:
...     print(e.message)
Chain already exists

The code

Typically, I organize my API clients into 3 modules:

`session.py`

This is where transport layer-related code lives. For HTTP clients the session class is usually wrapped around a requests.Session object and is responsible for maintaining authentication and configuring TLS--basically anything having to do with making the actual remote request.

A very simple session.py might look like:

class FactomAPISession(requests.Session):

    def __init__(self, *args, **kwargs):
        super(FactomAPISession, self).__init__(*args, **kwargs)
        
        self.headers.update({
            'Accept-Charset': 'utf-8',
            'Content-Type': 'text/plain',
            'User-Agent': 'factom-api/{}'.format(VERSION)
        })
    
    def init_basic_auth(self, username, password):
        credentials = b64encode('{}:{}'.format(username, password).encode())
        self.headers.update({
            'Authorization': 'Basic {}'.format(credentials.decode())
        })

Here we're initializing a basic HTTP session and setting a few custom headers, as well as providing a method to configure basic HTTP authentication via username and password. Reusing the session reduces a quite a bit of boilerplate when it comes time to make actual requests.

`client.py`

The client module provides the interface class. It typically contains a single class that gets initialized with any configuration needed to build the API client and contains the methods for making requests.

class Factomd(object):

    def __init__(self, host='localhost:8090', version='v2', username=None, password=None):
        """
        Instantiate a new API client.
        Args:
            host (str): Hostname of the factomd instance.
            version (str): API version to use. This should remain 'v2'.
            username (str): RPC username for protected APIs.
            password (str): RPC password for protected APIs.
        """
        self.version = version
        self.host = host

        # Initialize the session.
        self.session = FactomAPISession()

        # If authentication credentials are provided, pass them
        # along to the session.
        if username and password:
            self.session.init_basic_auth(username, password)

    # Convenience method for building request URLs.
    @property
    def url(self):
        return urljoin(self.host, self.version)

    # Perform an API request.
    def _request(self, method, params=None, id=0):
        data = {
            'jsonrpc': '2.0',
            'id': id,
            'method': method,
        }
        if params:
            data['params'] = params

        # Ask the session to perform a JSON-RPC request 
        # with the parameters provided.
        resp = self.session.request('POST', self.url, json=data)

        # If something goes wrong, we'll pass the response
        # off to the error-handling code
        if resp.status_code >= 400:
            handle_error_response(resp)

        # Otherwise return the result dictionary.
        return resp.json()['result']

    # API methods

    def chain_head(self, chain_id):
        return self._request('chain-head', {
            'chainid': chain_id
        })

    def commit_chain(self, message):
        return self._request('commit-chain', {
            'message': message
        })

    def commit_entry(self, message):
        return self._request('commit-entry', {
            'message': message
        })

    ...

The real meat and potatoes of this class is the _request method which essentially translates Python parameters into a format that the transport layer can deal with and asks the session to make a request. The result is either then returned or, if an error condition is present, sent off to the exception-handling code (below) to be turned into a native exception.

I prefer to explicitly define class methods for each API endpoint to add a little more structure, but you could just as easily do this in a generic way by accepting an RPC method name or URL as a parmeter, but then you're toeing the line of forcing the developer to consider server details.

`exceptions.py`

This module deals with (you guessed it) error handling. These are the bits that translate error conditions in API responses into useful exceptions the developer can work with.

To me, a useful API error has 3 components: Some kind of constant code or type that can be used in conditionals to vary behavior based on different categories of errors, a user-friendly error message explaining what went wrong, and additional metadata unique to the error type. To give you an example, here's what I consider a good form validation error:

{
    "code": "validation_error",
    "message": "One or more fields could not be validated.",
    "data": {
        "email": ["This field is required"],
    }
}

From just this we know we're dealing with a validation_error type of error, so we should be looking for more data (in this case field-specific errors) under the data key to highlight problematic fields. We can also display the "One or more fields..." message in an alert to notify the user where to look.

Here's the error handling code:

# Take a HTTP response object and translate it into an Exception
# instance.
def handle_error_response(resp):
    # Mapping of API response codes to exception classes
    codes = {
        -1: FactomAPIError,
        -32700: ParseError,
        -32600: InvalidRequest,
        -32602: InvalidParams,
        -32603: InternalError,
        -32601: MethodNotFound,
        -32011: RepeatedCommit,
        -32008: BlockNotFound,
    }

    error = resp.json().get('error', {})
    message = error.get('message')
    code = error.get('code', -1)
    data = error.get('data', {})

    # Build the appropriate exception class with as much
    # data as we can pull from the API response and raise
    # it.
    raise codes[code](message=message, code=code, data=data, response=resp)


class FactomAPIError(Exception):
    response = None
    data = {}
    code = -1
    message = "An unknown error occurred"

    def __init__(self, message=None, code=None, data={}, response=None):
        self.response = response
        if message:
            self.message = message
        if code:
            self.code = code
        if data:
            self.data = data

    def __str__(self):
        if self.code:
            return '{}: {}'.format(self.code, self.message)
        return self.message

# Specific exception classes

class ParseError(FactomAPIError):
    pass


class InvalidRequest(FactomAPIError):
    pass

...

The handle_error_response method attempts to build the appropriate exception class from the API's response including the error code, message, and additional data. If it can't do this then a generic FactomAPIError is raised with the default values.

Usually it's enough to catch a FactomAPIError, however if more control is needed we can catch specific error types, falling back to the generic exception:

try:
    client.do_something()
except ParseError:
    # Do something more specific here
except FactomAPIError:
    # Do something more generic here

Other types of APIs

This pattern works well for RPC-style APIs, but tends to break down for more object-based or RESTful APIs as having a single interface class gets messy quickly. In those cases I find it makes more sense to break the interface down to resource-level, modeling things more like an ORM. I'll cover that in a later post, next time I find the need to build one.

Cracking CAPTCHAs using the vector space model: analysis of a security weakness in the Taiwan Railways Administration online booking system

Ben Homnick — Tue, 05 Dec 2017 12:20:52 GMT

Disclaimer: The methods described in this article are for learning purposes only and should not be used against production systems you do not own. Since publication TRA has updated their software and the methods described here pose no risk to their systems.

One of the most frustrating things about the design of the Taiwan Railways Administration (TRA) online booking system is the lack of transparency into whether or not seats are actually available on a train before attempting to book. A customer is required to fill out an entire page of booking details and complete a CAPTCHA only to be shown an "Order Ticket Fail：No seat available in this journey" message.

With no other options to query seat availability, the obvious solution is to automate these booking requests until seats are found. CAPTCHAs are typically designed to protect resources from non-human requests like this, but some CAPTCHAs are designed better than others, and with a bit of scripting it turns out the TRA booking system can be fooled relatively easily.

This project demonstrates a method for doing just that using Python and some basic image processing. The code is available on GitHub.

Cleaning the captcha

There are two main pieces to breaking the catpcha: isolating characters from the original image, and then guessing what each character is by comparing image data to a sample training set.

To give you an example of what we're dealing with, here are a few examples of captchas I've encountered:

As you can see they're not the greatest captchas in the world. There's some color variation, a bit of background noise, rotation in the characters, but the real weakness is only numerical digits are used and all captchas only have either 5 or 6 characters. This means we'll only have to guess among 10 possibilities instead of 30+ you'd expect in a normal alphanumeric captcha. We also know exactly how many characters to expect, which significantly increases the accuracy of our decoder.

The first step is to remove anything that isn't part of a character, so we'll try and get rid of the background color and those weird web-1.0-looking boxes. We can take advantage of the fact both the background and boxes are generally lighter than the characters and filter out colors above a certain pixel value.

colors = sorted(
    filter(
        lambda x: x[1] < max and x[1] > min,
        im.getcolors()
    ),
    key=itemgetter(0),
    reverse=True
)[:n]

Here we're returning the n most prominent colors in the image by pixel color frequency, ignoring pixels above (lighter than) max or below (darker than) min. All the images we're working with here have been converted to 256 color pallete gifs, so 255 always represents white. Next we'll use this list of colors to build a new monochrome image using only pixels from the original that appear in our list of prominent colors.

sample = Image.new('1', im.size, 1)
width, height = im.size
for col in range(width):
    for row in range(height):
        pixel = im.getpixel((col, row))
        if pixel in colors:
            sample.putpixel((col, row), 0)

If we run this against the top right captcha in the image above the result looks like this:

This is already looking pretty reasonable, but some of the background noise still remains and will cause trouble when we try and break the image into characters. This can be removed by doing some more filtering:

im = im.filter(ImageFilter.RankFilter(n, m))

The rank filter takes windows of n pixels, orders them by value, and sets the center point as the mth value in that list. In this case, I'm using n=3 and m=2, but these numbers were just eyeballed and there's probably a better way to do this. Here's what the filtered image looks like:

Now it's looking pretty obvious where the characters are. Things are a little blurrier from the filter but luckily this doesn't matter as the character shapes are still unique enough for comparison. The last step is to break them up into features.

arr = numpy.asarray(im.convert('L'))
labels, _ = ndimage.label(numpy.invert(arr))
slices = ndimage.find_objects(labels)

features = []
for slice in slices:
    feature = arr[slice]
    features.append((feature, numpy.count_nonzero(feature),
                     slice[1].start))

Here we're taking advantage of scipy's ndimage to determine slices representing contiguous areas of black pixels in our image, pull those features out of the image, and create a list of features along with their pixel count and starting x-coordinate. Since we know there are only 5 or 6 characters in the captcha we only need to look at the 6 largest features.

features = sorted(
    filter(lambda x: x[1] > min_pixels, features),
    key=itemgetter(1),
    reverse=True
)[:max_features]

features = sorted(features, key=itemgetter(2))

Features with pixel counts below min_pixels are filtered out, we only return max_features features (6 in this case), and the features are sorted from left to right. This list, when converted back to images, looks how we'd expect:

Now it's time to take a guess at what these characters are.

Using the vector space model to compare images

Some research (googling) led me to Ben Boyter's Decoding Captchas book where he uses a vector space model for classifying images and calculating similarity values. It works, the math isn't too complicated, and it sounds cool to talk about things in vector space. That's good enough for me.

The vector space model is something that appears frequently in text search. Documents can be represented as vectors where components are weights of specific terms, and it's then possible to retrieve relevant documents by comparing a query to a list of documents by examining the angle between their respective vectors.

We can do the same thing for images, except instead of words we're going to be identifying images by pixel values. Let's take these three blown-up images for example:

It's obvious to a human A and C are more similar visually. Now we need to teach our future AI overlords.

We can convert each of these images into vectors by setting each component to a pixel value, either 0 for white or 1 for black, left to right, top to bottom. Doing this yields:

$$
\bf{A} = [0, 1, 1, 1, 0, 0, 0, 1, 0]\\
\bf{B} = [0, 0, 1, 0, 1, 1, 1, 1, 0]\\
\bf{C} = [1, 1, 1, 1, 0, 0, 0, 1, 0]\\
$$

Now that we have vector space representations of our images, we need to compare them somehow. It turns out the cosine of the angle between two vectors is relatively easy to calculate:

$$
\cos\theta=\frac{\bf{A}\cdot\bf{B}}{\vert\vert\bf{A}\vert\vert\:\vert\vert\bf{B}\vert\vert}
$$

Remember both the norm of a vector and the dot product of two vectors return scalars, which is what we expect since we're looking for a cosine. Identical images will have no angle between them so $\theta=0$ and $\cos\theta=1$, and since we're only dealing with postive values, completely unrelated images will be orthogonal, i.e. $\theta=\pi/2$ and $\cos\theta=0$. Let's do the math for A and B:

$$
\bf{A}\cdot\bf{B}=\sum_{i=1}^n\it{A_iB_i}=2\\
\vert\vert\bf{A}\vert\vert\:\vert\vert\bf{B}\vert\vert=\sqrt{\sum_{i=1}^n\it{A}_i^2}\sqrt{\sum_{i=1}^n\it{B}_i^2}=\sqrt{20}\\
\cos\theta=\frac{2}{\sqrt{20}}\approx0.4472
$$

Not so close. However, if we do the same for A and C:

$$
\cos\theta=\frac{4}{\sqrt{20}}\approx0.8944
$$

That's a much better match. Python makes it easy to do this for PIL images:

def cosine_similarity(im1, im2):
    def _vectorize(im):
        return list(im.getdata())

    def _dot_product(v1, v2):
        return sum([c1*c2 for c1, c2 in izip(v1, v2)])

    def _magnitude(v):
        return sqrt(sum(count**2 for c in v))

    v1, v2 = [vectorize(im) for im in scale(im1, im2)]

    return _dot_product(v1, v2) / (_magnitude(v1) * _magnitude(v2))

Making some training data

Armed with a way of isolating characters from the original captcha and a method for comparing image similarity we can start cracking captchas.

The one tricky thing about these captchas is the characters have a random rotation angle. And while we could try to deskew the characters or rotate them and compare to a sample iconset, I thought it would probably be easier just to generate an iconset containing samples already rotated at various angles.

Putting it all together

It's a bit brute force, but now all our decoder has to do is compare each isolated character to the sample iconest and pick the character with the greatest similarity:

def guess_character(self, im, min_similarity=0.7, max_guesses=5):
    guesses = []

    for symbol, icon in _get_icons():
        guess = cosine_similarity(im, icon)
        if guess >= min_similarity:
            guesses.append((symbol, guess))
    
    return sorted(
        guesses,
        reverse=True,
        key=itemgetter(1)
    )[:max_guesses]

Next steps

My initial goal for the project was 20% accuracy and I was pleasantly surprised that just eyeballing parameters for the image processing yielded a whopping 40% accuracy over my test dataset. Note to developers: that's what happens when you try and roll your own captcha. The biggest trip-ups seem to be overlapping characters and leftover noise artifacts from filtering.

Overlapping characters could be dealt with by considering a maximum pixel size for features. We already have a minimum feature pixel size to filter out leftover background noise, so why not add a maximum pixel feature size, split chunks larger than that horizontally, and hope they're not too weird for the guessing bits? Because I'm lazy, that's why.

Filter settings could probably be optimized further to deal with leftover background noise, but at the end of the day I'm calling 40% squarely in "ain't broke" territory.

Thanks for reading.

Building an exchange limit order book in Go

Ben Homnick — Wed, 08 Nov 2017 13:25:31 GMT

With all of the hype around cryptocurrency trading I thought it might be fun to take a look at what it takes to build an exchange order book. It's been a while since I've worked on a Go project, so it seemed like a good choice for a refresher.

For this project, I wanted my basic exchange to:

Support adding, canceling, and executing buy/sell orders
Execute all operations in as close to O(1) as possible
Be durable and recoverable in the event of failure

I never got around to implementing the third, but I'll talk about the basic operations and performance in more detail below, along with some ideas for improvements. The project source is available on GitHub.

Under the hood

There are 3 data models worth looking at:

type PricePoint struct {
    orderHead *Order
    orderTail *Order
}

Each PricePoint represents a discreet limit price. For products that need to be priced at higher precision this might not be a scalable solution, but for this exercise I'm going to assume whatever we're trading is priced in USD, so 8 or 9 digits is the max we'll need and can easily be handled in memory. Each PricePoint just contains pointers to the first and last order entered at that price for ease of executing and appending new orders.

type Order struct {
    id uint64
    isBuy bool
    price uint32
    amount uint32
    status OrderStatus
    next *Order
}

Each Order is either a buy or sell, has a limit price and amount, and a status that lets us know whether the order is open, partially filled, filled, etc. This value isn't actually used by the exchange as you'll see below, but it might be useful to report to trader clients. Lastly, each order is linked to the next order at the same price point so we can ensure orders are examined in the order they are entered.

type OrderBook struct {
    ask uint32
    bid uint32
    orderIndex map[uint64]*Order
    prices [MAX_PRICE]*PricePoint
    actions chan<- *Action
}

The OrderBook does most of the heavy lifting by keeping track of the current maximum bid and minimum ask, an index mapping order IDs to pointers so we can easily cancel outstanding orders, an array of all possible price points, as well as a channel to report actions (fills, cancellations, etc.) to some handler as they occur.

Adding orders

Inserting a new order into the book is just a matter of appending it to the list of orders at its price point, updating the order book's bid or ask if necessary, and adding an entry in the order index. These are all O(1).

func (ob *OrderBook) openOrder(o *Order) {
    pp := ob.prices[o.price]
    pp.Insert(o)
    o.status = OS_OPEN
    if o.isBuy && o.price > ob.bid {
        ob.bid = o.price
    } else if !o.isBuy && o.price < ob.ask {
        ob.ask = o.price
    }
    ob.orderIndex[o.id] = o
}

Canceling orders

Canceling is done by looking up the order by ID from the index and setting the amount to 0, also O(1). We don't actually need to remove the order from the book, just ignore it.

func (ob *OrderBook) CancelOrder(id uint64) {
    if o, ok := ob.orderIndex[id]; ok {
        o.amount = 0
        o.status = OS_CANCELLED
    }
    ob.actions <- NewCancelledAction(id)
}

Filling orders

Filling orders in the case of a sell is done by starting at the max bid and iterating over all open orders until we either fill the order or exhaust the open orders at that price point. Then we move down to the next price point and repeat.

func (ob *OrderBook) FillSell(o *Order) {
    for ob.bid >= o.price && o.amount > 0 {
        pp := ob.prices[ob.bid]
        ppOrderHead := pp.orderHead
        for ppOrderHead != nil {
            ob.fill(o, ppOrderHead)
            ppOrderHead = ppOrderHead.next
            pp.orderHead = ppOrderHead
        }
        ob.bid--
    }
}

func (ob *OrderBook) fill(o, ppOrderHead *Order) {
    if ppOrderHead.amount >= o.amount {
        ob.actions <- NewFilledAction(o, ppOrderHead)
        ppOrderHead.amount -= o.amount
        o.amount = 0
        o.status = OS_FILLED
        return
    } else {
        if ppOrderHead.amount > 0 {
            ob.actions <- NewPartialFilledAction(o, ppOrderHead)
            o.amount -= ppOrderHead.amount
            o.status = OS_PARTIAL
            ppOrderHead.amount = 0
        }
    }
}

Buys happen in the exact same way, except starting at the maximum ask and working our way up.

It's interesting to note bid and ask here don't necessarily indicate where open orders are, but rather are starting points for examining potential order pairs. This is what lets us do cancellations without having to manipulate a price point's list of orders. While attempting to fill an order we just ignore orders remaining in the book with an amount of 0.

The performance of fills depends on how sparse the order book is at the time. In the worst case we'd need to iterate over every price point, however as the number of orders in the book increases the number of price points we need to examine approaches 1. All of the other operations are constant time so this can also be done in O(1).

Performance

Testing performance of the system was done by pre-generating large numbers of random orders, varying the number of orders, mean and standard deviation of price, and maximum amount. The number of actions per second was then timed under different configurations.

func TestPerf(t *testing.T) {
    // nOrders, priceMean, priceStd, maxAmount
    doPerfTest(10000, 5000, 10, 50)
    doPerfTest(10000, 5000, 1000, 5000)
    doPerfTest(100000, 5000, 10, 50)
    doPerfTest(100000, 5000, 1000, 5000)
    doPerfTest(1000000, 5000, 10, 50)
    doPerfTest(1000000, 5000, 1000, 5000)
}

$ go test
Handled 16533 actions in 14.846866ms at 1113568 actions/second.
Handled 16597 actions in 17.971637ms at 923510 actions/second.
Handled 165584 actions in 78.475638ms at 2110005 actions/second.
Handled 165867 actions in 465.881034ms at 356028 actions/second.
Handled 1656289 actions in 1.406557785s at 1177547 actions/second.
Handled 1658457 actions in 2.548383913s at 650787 actions/second.
PASS
ok  	_/Users/bhomnick/Projects/bexchange	9.069s

Increasing the sparsity of the order book by widening the standard deviation and range of possible prices as expected has a significant effect on the number of actions that can be handled, however even in non-optimal cases we're still looking at more than 350k actions per second on a 2015 Mac. For reference NASDAQ handles over 60k messages per second.

This is of course a minimum feature set and lacking any durability or persistence, where I'd expect bottlenecks to occur.

Next steps

While a fun exercise, this order book isn't much use without an interface for trading clients. The next step would be to set up some kind of interface to serialize order book writes and reads and allow clients to manipulate orders. A web interface and WebSocket server might be a good follow-up project.

The other item I mentioned earlier was durability. Writing actions ahead to disk with something like Kafka would allow rebuilding of the order book state in the event of a crash. This would also be a good place to decouple the trade settlement process and reporting from the exchange engine. The original project was done with this vaguely in mind by supporting pluggable action handlers. If you're curious check out the code on GitHub.