B-trie

After I stumbled upon ipfs, I have been thinking a bit about how to implement an efficient data storage / database on top of this, especially targeting linked open data. This has resulted in the following data structures / algorithms, which I would be very curious to implement later, to see how it performs in practise.

Intended features:

  • Key-value-store
  • Distributed/decentral
  • Trustless (ie. intermediaries cannot fiddle with the data)
  • Ordered traversal
  • Snapshot support
  • Compact (I assume that disk/network is the slowest resource)
  • Scaleable

Essentially any B-tree like append-only data structure using ipfs-addresses as pointers to blocks would do the job.

For persistent/append-only data structures HAMT has shown to have good performance in memory. A Trie data structure includes prefix key compression reducing the size of the tree. B-trees is good for caching / out-of-memory data.

I propose storing data in a trie with a half-octet alphabet (a la HAMT), and path compression, – and combining subtrees into immutable blocks which can then be stored out-of-memory. The half-octet alphabet is choosen, as that yield much smaller nodes than octets, and thus better performance when using it as a persistent data structure, – plus it is easier to pack into balanced blocks.

Conceptually a node in the trie contains the following information:

  • The prefix
  • Either content/data (for leaf-nodes) or list of 1-16 child nodes.
  • Total number of leafs for this subtree

A node could live in two ways: 1) compact encoding within a block, backed to disk/network 2) non-distributed local memory representation. The following will focus on the compact encoding, – the local memory representation, is just a normal datastructure.

In practise there would be four kinds of trie nodes in a block:

  1. Branch-nodes, – between 2 and 16 children, based on on the current half-octet
  2. Path-node, – a sequence of half-octets without branches
  3. Leaf, – the suffix content/data for the key+value. For small values this is encoded in the block, for large data, a single leaf-node lives in a separate block.
  4. Reference to external node, with count of leaf-nodes in sub-tree

Every node has meta-information about it length (including children encoded in the block), – and node-type. This is an integer + 2 bits, which can be encoded with a variable number of bytes (number of bytes is encoded in 1 bit per byte). This will be 1-3bytes assuming a block size less than half a megabyte.

  1. Branch nodes are encoded with a 16bit bitmask, indicating which children it has, followed by meta-information for each child, and then the actual children. No pointers are needed, as the length of the children are known, and can just be skipped. Total node memory usage: 2 byte bitmask + 1-3 byte meta information.
  2. Path nodes contains a variable length encoded integer of the number of half-octets in the path, then the actual path data, and then followed by one child-node  with meta-info. A path node thus takes 1-2 byte length + actual data + 1-3 byte meta info.
  3. Leaf nodes just contains content: actual data excluding prefix + 1-? byte meta information (1-? as a leaf node might be longer than the usual block size). The content might also be compressed, with something similar to lz4, but using other leaf-nodes in block as context too.
  4. References contains variable integer encoded count of leaf-nodes in subtree, plus the actual reference – link to a ipfs content hash.

Notice that the following information each node is implicit:

  • The prefix of the node, is known through the access path to the node.
  • The number of leafs in subtree for every, – as the leafs within the block is known, and number of leafs for external reference in the block is known.
  • References to children of nodes, via known length and storage structure.

The conversion into block happens through traversal of the non-distributed in-memory representation, where the known number of leafs are used to choose most balanced block-tree.