# Implementing MsgPack.rkt, part 1

published:

categories: open-source

When I originally set out to write MsgPack.rkt, a Racket implementation of
the MessagePack protocol, I had a hard time wrapping my mind around where to
begin. I had no experience in writing a serialisation library, and reading the
source code of existing implementations only revealed the *what*, but not the
*why*. This is why I'm starting this short mini-series of blog posts to provide
a commentary on my implementation. I hope that it will serve other people who
are looking for a starting point to their own implementations.

I will use Racket as my language of choice, since Racket and Guile are the two languages I have contributed implementations for. I am using it only for illustrative purposes, every concept will be explained in prose to make it applicable to other languages as well. Racket can be optionally used with static typing, so I will be using that variant of the language.

I expect the series to span three parts. As it goes on I will add the corresponding links here.

## The motivation behind MessagePack

Suppose you have two processes running and you wish to exchange data between them. These processes can run on the same machine or on different machines, they can be written in radically different programming languages with very different execution models. Consider the following illustration:

```
Process 1 is sending data to Process 2 and receiving
data from Process 2
╭─────────────╮ Outgoing ╭─────────────╮
│ │ ├───────────────────────── │ │
│ Process 1 │ │ Process 2 │
│ │ ─────────────────────────┤ │ │
╰─────────────╯ Incoming ╰─────────────╯
```

We want to be certain that data can pass back and forth between these processes
without any loss or corruption. For instance, if we send over a multi-byte
integer, the numerical value has to be the same regardless of endianness. We
need an agreed-upon data format that can be sent over the wire. The sender
makes sure the data is *serialised* (or *packed* in MessagePack jargon) when
sending it out, and the receiver *deserialises* (*unpacks*) it to its native
format.

Different formats exist for different purpose. A popular format is JSON; it is easy for humans to write by hand, read and edit manually, but it is also somewhat verbose and harder to parse for a machine. The following is an example of what JSON looks like:

```
{"compact": true, "schema": 0}
```

MessagePack is similar in idea to JSON, but it is binary instead of text-based. This makes it unreadable to humans, but it requires less memory and parsing it is very easy for a machine. The equivalent of the above JSON data in MessagePack consists of the following bytes:

```
0x82 0xA7 compact 0xC3 A6 schema 0x00
```

I have written out the bytes that are ASCII characters for readability, but as far as the machine is concerned these are just regular bytes as well. The JSON code requires 27 bytes, while the MessagePack byte string only requires 18 bytes, and the savings only get better as the amount of data grows.

## Structuring a MessagePack library

The structure will of course depend on the particular language and which
libraries you choose to use, but the general outline is usually the same. We
will ignore the question of *how* data is exchanged, and only focus on how to
prepare data for the exchange.

The are two main tasks: packing and unpacking data for transport. Packing means that we take an object of a "packable" type (that is an object for which we know how to transmit it), and we convert it to a sequence of bytes. Depending on our programming language there might be different ways of packing an object, in which case we prefer the shortest one.

Unpacking an object works in reverse. We first read one byte to find out which type of object we are receiving, then we use that information to read the remaining bytes and return the unpacked object. Only after reading that first byte do we know how many bytes we actually need to read.

There are other details that need to be taken care of, such as defining appropriate data types, but these details are left as an exercise to the reader.

## Little bits of Racket

The articles should be understandable without knowing any Racket, but it never hurts to know at least the basics so that you can follow along with the code samples. Racket is a descendent of the Scheme programming language, which is a language in the Lisp family. Being descended from Scheme, Racket is a multi-paradigm language with a bias towards functional programming. We won't be making much use of functional techniques though.

### Lots of irritating superfluous parentheses

Lisp languages have an unusual notation: they use prefix notation with parentheses for grouping. Take for example the quadratic formula as it would be written in most programming languages:

```
(-b + sqrt(b * b - 4 * a * c)) / (2 * a)
```

This is the familiar infix notation we use every day. Notice that there is a
tree-structure in the expression: the outermost operation is the division
(`/`

), its left-hand argument is `-b + sqrt(b * b - 4 * a * c)`

and its
right-hand argument is `2 * a`

. We can further break up both of these into
tree-structures until we have turned the entire expression into a tree with
atomic objects for its leaves:

```
(/)─┬──(+)─┬──(-)───(b)
│ └──(sqrt)───(─)─┬─(*)─┬─(b)
│ │ └─(b)
│ └─(*)─┬─(4)
│ ├─(a)
│ └─(c)
└──(*)─┬──(2)
└──(a)
```

This is known as an *abstract syntax tree* (AST). If we use parentheses to
group items we can write the AST as one-dimensional plain text (the whitespace
is purely visual). This allows us to express our computation directly in terms
of the AST.

```
(/ (+ (- b)
(sqrt (- (* b b)
(* 4 a c))))
(* 2 a))
```

And with this you know all the syntax there is to Lisp! An s-expression is
either an atom (like `2`

, `b`

or `+`

) or a list of s-expressions (like `(* 4 a c)`

). The first item in the list is the function (or special operator) we want
to apply, the remaining items are its arguments. All that is now left is
knowing which functions and special operators exist; this depends on the
particular dialect of Lisp of course.

### Racket is a dialect of Lisp

Racket has the usual mathematical operators (`+`

, `-`

, `*`

, `/`

) and numbers
(integers, floating point and rationals). Variables are defined using the
`define`

special form:

```
(define a 3) ; Exact integer (comments are introduced with semicolon)
(define r 2/3) ; Exact fraction
(define π 3.14) ; Inexact floating-point number (also Unicode glyphs)
```

The special form `lambda`

(or its synonym `λ`

) defines functions. The first
argument to `λ`

is the list of function arguments, the remainder is the body of
the function:

```
(define f (λ (x) (* x x))) ; f = x ↦ x²
(define (f x) (* x x)) ; A shorthand form for the above
```

We will use the shorthand form. These few forms should suffice for now. As a
final note, let us consider conditional expressions. In most langauges `if`

is
used to control the flow of a program (if a condition is met, do this,
otherwise do that), but in Lisp `if`

is used to return a different value
depending on a condition:

```
(define (add-inverse x y)
;; Add 1/y to x, fall back to ∞ if y is zero
(+ x
(if (= y 0) ; Test for numerical equality
+inf.0 ; Return +∞ from the if
(/ 1 y)))) ; Return 1/y from the if
```

If there is more than two cases, the `cond`

special form can be used rather
than nesting multiple `if`

expressions.

```
(define (add-signum x y)
;; Add 0, 1 or -1, to x, depending on y
(+ x
(cond
((zero? y) 0)
((positive? y) 1)
(else -1))))
```

By the way, the question mark has no particular meaning, it is customary to
name predicates in such a way. A predicate is a function which returns either
truth (`#t`

) or falsity (`#f`

). Note that `=`

is technically also a predicate,
but it does not follow this convention.

Next time we will look at how to Pack objects.