Project Voldemort (Part II): How it works

Jay Kreps

Co-founder and CEO at Confluent

April 1, 2009

Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check back tomorrow for regular LinkedIn programming. In the meanwhile, check out some of our recent announcements, tips and tricks, or success stories.

In my last blog entry I described what LinkedIn is doing with our open source key-value storage system Project Voldemort. In this entry I will talk about what how Voldemort works, and what features we will be adding to it.

With Voldemort we hope to scale both the amount of data we can store and the number of requests for that data. Naturally the only way to do this is to spread both the load and the data across many servers. But spreading across servers creates two key problems:

1. You must find a way to split the data up the data so that no one server has to store everything
2. You must find a way to handle server failures without interrupting service

Scaling

The first point is fairly obvious—if you want to handle more requests you need more machines, if you want to handle more data, you need more disks and memory (and servers to hold the disks and memory). But there are still a number of subtleties.

Any system that doesn’t maintain local state, can easily be scaled by just making more copies of it and using a hardware load balance to randomly distribute requests over the machines. Since the whole point of a storage system is to store things, this becomes somewhat more difficult: if we randomly distribute writes then the data will be different on each machine, if we write to every machine then we will potentially have dozens of machines to update on each write.

In order to effectively use all the machines, the data in Voldemort is split-up amongst the servers in such a way that each item is stored on multiple machines (the user specifies how many). This means that you have to first figure out which is the correct server to use. This partitioning is done via a consistent hashing mechanism that let’s any server calculate the location of data without doing any expensive look ups.

This kind of partitioning is commonly done to improve the performance of write requests (since without it, every single server would have to be updated every time you did a write). What is not commonly understood is that this is also required to improve read performance. Memory access is thousands of times faster than disk access, so the ratio of memory to data is the critical factor for accessing the performance of a storage system. By partitioning the data you increase this ratio by shrinking the data on each machine. Another way to think of this is as improving cache locality—if requests are randomly balanced over all machines then “hot” items end up in cache on all servers and the hit ratio is fairly low, by partitioning the storage among machines the cache hit ratio dramatically improves.

Detecting Failure

To handle this problem any distributed system must do some kind of failure detection. Typically this is done by some kind of heart-beat mechanism—each server pings some master co-ordination nodes (or each other) to say “Hi, I am still alive!” In the simple case if a node fails to ping for some time then it is assumed to be dead.

But this raises a few problems, first there aren’t any master nodes in the Voldemort cluster, each node is a peer—so what if one server gets a ping and another does not? Then the servers will have a differing view of who is and is not alive. In fact, maintaining the state about who is alive is the exact same distributed state management problem we were trying to solve in the first place.

The second problem is a bit more existential: what does it mean to be alive? Indeed, just because a server is alive enough to say “hi!” or “ping!” doesn’t mean you are alive enough to correctly service requests with low latency. One solution is to increase the complexity of the ping message to include a variety of metrics on the server’s performance, and then make the prediction as to whether that server is alive or not. But what we do is much simpler. Since Voldemort only has a few types of request (PUT, GET, DELETE, etc.) and since each server is getting hundreds of these requests per second, why invent a new ping request to detect liveness? Instead, since each of these requests has similar performance, it makes sense to simply set an SLA (service level agreement) for the requests and ban servers who cannot meet their SLA (this could be because they are down, because requests are timing out, or many other reasons). Servers that violate this SLA get banned for a short period of time, after which we attempt to restore them (which may lead the them getting banned again).

This is a fairly simple mechanism for the user of Voldemort to use, since they may have their own SLA they need to maintain (i.e. serve 99% of the pages in less than 100ms or something like that). The simplicity of the query model actually becomes something of an advantage in this kind of performance analysis. The three Voldemort queries have known performance, so it is very easy to predict the load a new feature will generate by just counting the number of requests. This is always a challenge with SQL: poorly designed SQL queries may produce thousands of times more load. Compounding this problem, distinguishing the bad queries from the good requires knowing both the index structure and the data on which it will run—neither of which is present in your code—so it easy for an efficiency to slip past even a diligent review if you don’t perform real tests on real data for each modification to see what query plan will be generated.

Dealing With Failure

The redundancy of storage makes the system more resilient to server failure. Since each value is stored N times, you can tolerate as many as N - 1 machine failures without data loss. This causes other problems, though. Since each value is stored in multiple places it is possible that one of these servers will not get updated (say because it is crashed when the update occurs). To help solve this problem Voldemort uses a data versioning mechanism called Vector Clocks that are common in distributed programming. This is an idea we took from Amazon’s Dynamo system. This data versioning allows the servers to detect stale data when it is read and repair it.

The first advantage of this mechanism is that it does not require a consistent view of which servers are working and which are not. If server A cannot get to server C, but server B can get to C that will not break the versioning system. This kind of failure can be especially common in the case of transient failures.

Another advantage relates to challenges faced with expanding data centers. Requests between data centers that are physically remote have much higher latency then requests within a data center (10 or 100x slower depending on geography). With most storage systems it isn't possible to take concurrent writes across multiple data centers without risk of losing data, since if two updates occur at once, one in each data center, there is no principled way for the storage system to choose between them. By versioning the data you can allow the results to be resolved or the conflict to be given back to the application for resolution.

Picking A Name

I should probably also mention how it got its name, since that is something I got a lot of questions about. I wanted to come up with a name that was distinctive and a little self-deprecating (projects shouldn't take themselves too seriously). At the time I was reading the last Harry Potter book, and Voldemort had split himself into many pieces each of which had to be destroyed to kill him. I thought, "that sounds like a distributed system". I don't know whether it is nerdier to be reading Harry Potter or to be wondering what kind of consistency protocol Voldemort uses when keeping all his pieces up-to-date, but regardless, the name stuck.

For more information on the project, check out the web site and mailing list. Patches, bug reports, and suggestions are gladly accepted.

Quick Update: Interested in similar projects, check out the job openings at LinkedIn to work on these challenges full time.