Variable byte encoding of unsigned integers.

Order can be preserved with variable byte integer encoding. Different implementations indicate optimisations in JavaScript.

A common way to encode unsigned integer is to have a sequence of bytes, with 7-bit data, and 1-bit length encoding per byte. The naive implementation usually does not preserve the integer order into the lexicographical order of the encoded bytes, but is possible.

Encoding:

bytes.append(number % 128)
number = number / 128
while number > 0:
number = number – 1
bytes.append(128 + (number % 128))
number = number / 128
bytes.reverse()

Decoding:

number = 0
while true:
byte = getNextByte()
number = number * 128 + (byte % 128)
if byte & 128:
number = number + 1
else:
break

The key elements to ensure that the lexicographical order of the binary strings preserve the order of the integers are:

  1. The binary string for an integer is unique by adding/subtracting 1 to the number. This both adds slightly larger rang, and guarantees we do not have the case where 1000000 00000001 and 00000001 would both decode to 1, – in contrast to some variable byte integer encodings.
  2. When being order-preserving, it is given that the most significant bit must be the one used for byte length encoding, and it must be 1 if there are more bytes to read and zero otherwise.

This is useful if you have a lexicographically ordered binary keys, and want to encode arbitrary unsigned integers where you do not know the size beforehand. The concrete case I was thinking about was efficient encoding of timestamps appended into a sorted data structure, without hitting the year 2038 bug.

I implemented a couple of versions in JavaScript with the following conclusions on v8:

  • Implementation using typed arrays are 10-30x faster than using arrays (when optimised correctly).
  • Optimising for only encoding 32bit integers, instead of up to 53 integer bits from a double, can yield about 30% performance.
  • You need to benchmark each version individually, as the JIT otherwise can make the results incomparable, (even if you warm it up a bit).
  • In JavaScript on V8 on a slow laptop, with above optimisations you can encode 1MB/s version, versus 40KB/s with the naïve version.
  • GIST available here: https://gist.github.com/rasmuserik/47ce8e20e88f8443a79d