Implementing MsgPack.rkt, part 3
published:
categories: open-source
In the previous article we have seen how to pack an object, this time we will see how to unpack it again on the receiving end.
This blog post is one of a multipart series. You can find the rest of the series under the following links:
Overview
Just as with the unpacking code we will create a new module and provide only
one procedure: the unpack
procedure, which accepts one input port as its
argument and returns one unpacked object. A port in a Lisp language is similar
to a file handle in other languages, though it is not necessarily limited to
files only.
(provide unpack)
(: unpack (-> Input-Port Packable))
(define (unpack in)
...)
This is again a simplified view, the actual implementation provides three
procedures: unpack-from
, unpack
and unpack/rest
. These give more
fine-grained control over the source of packed data and how many objects to
unpack. The principle is all the same though, so we will stick with the simple
model.
Unpacking a dynamically typed object
Before we read our fist byte we do not know the type of the packed objects, so
we again use a procedure which returns an instance of the Packable
type
introduced in the previous article.
The first byte of the packed data is special: it is a tag which allows us to know the type of the packed object. Our strategy is to first read one byte and then dispatch on it to a type-specific unpacking procedure based on the numeric value of the tag.
(: unpack (-> Input-Port Packable))
(define (unpack in)
(define tag (read-byte in))
(cond
((< tag #x80) tag)
((< tag #x90) (unpack-fixmap tag in))
((< tag #xA0) (unpack-fixarray tag in))
((< tag #xC0) (unpack-fixstr tag in))
((= tag #xC0) (void))
((= tag #xC1) (error "Unused tag value"))
((= tag #xC2) #f)
((= tag #xC3) #t)
((= tag #xC4) (unpack-bin8 in))
((= tag #xC5) (unpack-bin16 in))
...
((= tag #xCC) (unpack-uint8 in))
((= tag #xCD) (unpack-uint16 in))
((= tag #xCE) (unpack-uint32 in))
((= tag #xCF) (unpack-uint64 in))
...
((= tag #xDC) (unpack-array16 in))
...))
There are a couple of things to note: For small values (less than 0x80
) the
tag is exactly the integer number, so we can return the tag as the result. The
tag values 0xC0
, 0xC2
and 0xC3
correspond to constant values (#<void>
,
#f
and #t
), which we can also return directly.
The types with fix
in their name are collections for which the tag also
contains information on how many items there are contained. This is a little
hack in MessagePack: if the number of items is small there is no need to waste
memory on the length, instead the length it contained inside the tag and can be
extracted through bit-fiddling.
Finally there are types with fixed tags. For example, the tag 0xCC
will
always correspond to an unsigned 8-bit integer, and the tag 0xC4
will always
correspond to a byte string whose length is an (unsigned) 8-bit integer.
Type-specific unpacking
We will now have an in-depth look into the unpacking procedures for a selection of types. Sometimes unpacking one object will also involve unpacking other objects in the process.
Integers
Integers can be signed or unsigned and they can be up to 64 bytes large.
Depending on the type of integer we call the more general unpack-integer
with
different arguments
(: unpack-uint8 (-> Input-Port Integer))
(define (unpack-uint8 in)
(unpack-integer 1 #f)) ; One byte, not signed
(: unpack-uint16 (-> Input-Port Integer))
(define (unpack-uint16 in)
(unpack-integer 2 #f)) ; Two bytes, not signed
(: unpack-int8 (-> Input-Port Integer))
(define (unpack-uint8 in)
(unpack-integer 1 #t)) ; One byte, signed
(: unpack-int16 (-> Input-Port Integer))
(define (unpack-uint16 in)
(unpack-integer 2 #t)) ; Two bytes, signed
We will let Racket's integer-bytes->integer
procedure handle the details; its
arguments are a byte string of raw bytes, whether the integer is signed, and
whether the endianness is big or not (in our case it always is).
(: unpack-integer (-> Integer Boolean Input-Port Integer))
(define (unpack-integer size signed? in)
(define raw-bytes (read-bytes size in))
(integer-bytes->integer raw-bytes signed? #t))
Arrays
An array is an ordered sequence of objects. I have chosen to represent arrays as vectors in Racket; a vector is an ordered fixed-length sequence of objects with fixed-time access, so it is similar to C arrays, C++ vectors, or Python lists, rather than the usual linked lists used in Lisp.
There are two ways to pack an array: as a fixarray and as a regular array. In the case of a fixed array the size has to be retrieved from the tag value by bitwise operations. In the case of a regular array the size is a packed integer and has to be unpacked first.
(: unpack-fixarray (-> Integer Input-Port (Vectorof Packable)))
(define (unpack-fixarray tag in)
(define size
(bitwise-and tag #b00001111)) ; Bitwise 'tag & 00001111'
(unpack-array size in))
(: unpack-array16 (-> Input-Port (Vectorof Packable)))
(define (unpack-array16 in)
(define size
(unpack uint16 in))
(unpack-array size in))
Once we know the number of objects we can start recursively unpacking them. MessagePack allows nesting, so our array might contain other arrays and so on.
(: unpack-array (-> Integer Input-Port (Vectorof Packable)))
(define (unpack-array size in)
(for/vector : (Vectorof Packable) #:length size ([i (in-range size)])
(unpack in)))
This requires a bit of explanation. The for/vector
expression is a vector
comprehension, it loops over something and for each iteration it adds an item
to the vector. Reading from left to right: : (Vectorof Packable)
is the type
of the result (a vector of packable objects), #:length size
says that our
vector will contain size
objects and ([(i (in-range size))])
says that we
will iterate over the range of numbers from zero (inclusive) to size
(exclusive) using the variable i
to hold the current value. At every
iteration step the object to insert into the array is the result of (unpack in)
.
The for/vector
form is a particularity of Racket, in a more mainstream
language like Python we would have written a for-loop instead.
def unpack_array(size, in):
result = [None] * size # Create the list of 'size' items
for i in range(size):
result[i] = unpack(in)
return result
As an aside, in the case of Python specifically there is also list comprehensions, which correspond to Racket's vector comprehension, so if you wanted to write really pythonic code you would have defined the function using a comprehension. I am just trying to make a general point here.
def unpack_array(size, in): # Using a comprehension instead of a loop
return [unpack(in) for i in range(size)]
Binary strings
To unpack a binary string we first need to unpack the integer which specifies the length of the byte string (in bytes).
(: unpack-bin8 (-> Input-Port Bytes))
(define (unpack-bin8 in)
(define size (unpack-uint8 in))
(unpack-bin size in))
(: unpack-bin16 (-> Input-Port Bytes))
(define (unpack-bin16 in)
(define size (unpack-uint16 in))
(unpack-bin size in))
All that is left when we have the size is to read that many bytes from the input port.
(: unpack-bin (-> Integer Input-Port Bytes))
(define (unpack-bytes size in)
(read-bytes size in))
Text strings work similarly, except that we also must convert the bytes read to a Unicode string using the UTF-8 encoding.
Conclusion
We first read one byte from the input port, this byte is the tag and it tells us how to proceed. In a few cases the tag represents a constant value which we can return directly, but most of the time we need to dynamically dispatch to a type-specific function.
Collection types (like arrays, hash maps or strings) contain a certain number
of items, so we first need to know that number. If the type has fix
in its
name the count is contained inside the tag and we need to mask the tag byte.
Otherwise the count is contained as a packed integer among the raw bytes, so we
need to unpack it first. When have to count we can recursively unpack the
individual objects and collect them.
This also goes to show why the MessagePack format is so simple. In JSON given a
list like [1, 2, 3, 4, 5]
we would have to read through the entire list first
before we can tell how many items the list contains. In MessagePack on the
other hand the length of the list follows immediately after the tag, that way
we can initialise a large enough vector before we begin unpacking the
individual items. There is no need to ever move backwards, making the format
very well suited to byte streams.