A key-value data warehouse


Grand sale! Grand sale!

Welcome to Paul's crazy data warehouse.

Everything reduced!

So here we are in 2009. Disks got big, but they didn't get fast. We have so much memory now that it takes a titanic amount of data to even start hitting the disk. But when you hit that wall, oh boy do you feel it.

I happen to be working with sequencing data that's just brushing up against that wall. Annoying.

Scattered reads and writes are agonizingly slow now. It's like the era of tape all over again. Flash storage solves this for the read side of things, but you still want coherent writes.

So, I've written myself a key-value store with this in mind. It's the flavour of the month thing to do.

Basic concept:

Merge sort can be performed with serial reads and writes. We'll maintain a collections of sorted lists, progressively merging lists together in the background. Writing data is simply a matter of adding a new list to this mix. Reading is a matter of checking each list in turn, from latest to earliest, for the key you want.

Refining things a little:

For random access to a database, you want items that are accessed with about the same likelihood to be stored near each other (ie in the same disk block). This way, your memory cache mostly contains frequently accessed items.

So I store each list as a balanced binary tree, with each level of the tree stored in a separate file. The list merging operation only ever needs to append to the end of each of these files, so this is still coherent writing.

Levels of the tree nearer the root will be accessed more often, and should end up in the memory cache. In-order traversal of the list remains fairly nice, since each level remains in order.

This scheme is cache oblivious, it works well with any size disk block. Which is handy, because you don't know jack about disks.

There is some room for improvement here, perhaps a trie.

Some nice properties:

Writing is fast. Chuck it on disk, merge at your leisure. In parallel, because this is 2009 and CPUs have stopped getting faster but there are plenty of them.

Multi-Version Concurrency Control is trivially easy. With only a little work, it will be impossible for the database to become corrupted. And it's backup-friendly.

You don't need to look up a record in order to modify it. For example, if you want to append to a record or add to a counter, you can set things up so this is deferred until merge-time. Each key effectively becomes its own little reduce. For example, you can implement a very nice word counting system this way. Which is exactly what I need for my sequencing data.


My implementation is not terribly optimized, but it is written in Cython, and is at least within the ballpark of reasonable. You'll need a recent Cython, and for Python 2.5 you'll also need the "processing" library.

I batch up key-value pairs in an in-memory dict until I have one million of them, before sorting them and writing them to disk. Python hashtables are fast, it would be silly not to use them until we are forced not to. One million items seems to be about optimal. This is a long way from filling available memory... something to do with cache size perhaps?

The performance on my machine is about 10x faster for writing than bsddb, 2x faster for reading, but still about 10-20x slower than an in-memory Python dict.

It's far from polished. Just proof of concept for the moment.

As examples, version 0.2 contains:

Update 2/8/2009: This paper appears to be describing the same idea.

Update 5/8/2009: nesoni includes an updated version of treemaker.