Project Voldemort: Scaling Simple Storage at LinkedIn

Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check back tomorrow for regular LinkedIn programming. In the meanwhile, check out some of our recent announcements, tips and tricks, or success stories.

About a month ago LinkedIn released the code for an open source distributed storage system called Project Voldemort. I wanted to give a little more information about what it is good for, how it came to be, and what our plans are for the future.

Some Background

Like a lot of websites, LinkedIn started with a single big database and a cluster of front-end servers (unlike a lot of websites it also started out with a big social graph in memory on remote machines, but that is a different story). As we grew, this database got split into a variety of remote services for serving up profiles, performing searches, interacting with groups, maintaining network updates, fetching companies, etc. These databases may have read-only replicas, but we didn’t have a system for scaling writes.

Unfortunately for engineers and DBAs, many of the rich features that people expect from a modern internet site either require massive data sets or high write loads, or both. This became a problem as we looked at how to scale some write-intensive features like Who’s Viewed My Profile that require as many updates as reads. We faced a similar scale problem for offline computed data, such as finding similar profiles—the set of all user profiles is very large, but even a modest subset of the set of all user profile pairs is quite huge.

To handle this problem we looked at the systems other internet companies had built. We really like Google’s Bigtable, but we didn’t think it made sense to try to build it if you didn’t have access to a low-latency GFS implementation. Our primary goal was to get low-latency, high-availability access to our data. For complex analysis we had Hadoop and databases, for complex queries we had a distributed search system, and the goal wasn’t to try to duplicate any of these. We were inspired by Amazon’s Dynamo paper, which seemed to meet the needs we have as well as being feasible to implement with low-latency queries–much of our design for Project Voldemort comes from that.

Our experience with the system so far has been quite good. We were able to move applications that needed to handle hundreds of millions of reads and writes per day from over 400ms to under 10ms while simultaneously increasing the amount of data we store.

Open Source

LinkedIn is a big open source user, and we have contributed back a number of the improvements to Lucene we have made such as Zoie, Kamikaze, and Bobo. Most of the things we build are pretty LinkedIn-specific, but things like search and storage are pretty much stand-alone and we are happy to get other users (and contributors!). I myself have been a long-time open source lurker—I am the first to check out the source, but rarely have the time to make any improvements. Fortunately, even if most people are as lazy as me, not all are. In the last few months we have got close to 50 contributions from people around the world. Some are small, just doing a little cleanup, and others have been quite substantial introducing new features or major code improvements.

The long-term success of an open source project depends on its not being controlled by a single company, person, group, but forming a real self-sustaining group of interested developers. This is our goal in working on the open source project. LinkedIn is not a storage systems company, and neither are the other web companies facing some of the same problems, so we think we think we can all benefit by sharing our work in this area.

The Future

So what is next for the project? The most important feature for a storage system is always improving performance and reliability. But there are a couple of other things in the works. We are working on making it easier to incrementally add to clusters of servers, improving our support for batch computed data from Hadoop, and implementing some clients in other programming languages.

For more information on the project, check out the main site. We are always looking for contributors to the project, so if you are interested check out the projects page and mailing list. Ideas, bug reports, patches, etc. are all gladly accepted.

Interested in similar projects, check out the job openings at LinkedIn to work on it full time.

Stay tuned for upcoming blog posts that will reveal more details of the system internals and some of the lessons learned

Tags: , ,

Share: Email | LinkedIn | Digg | Twitter

trackback

http://blog.linkedin.com/2009/03/20/project-voldemort-scaling-simple-storage-at-linkedin/trackback/

comments

  1. [...] saw this statement on succesful open source projects by Jay Kreps on the Linkedin blog: The long-term success of an open source project depends on its [...]

  2. [...] Project Voldemort: Scaling Simple Storage at LinkedIn Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check [...] [...]

  3. I have taken a look at the source and like what I see in Voldemort. After looking at several opensource KV stores it looks like Voldemort has the right shared nothing distributed approach. I’m very excited about putting it to the test.

    Have you guys heard of Redis? It supports lists and sets with atomic operations to push/pop elements. This definitely simplifies data modeling. They have a Twitter clone example that is really compelling (given lines of codes.) I guess it’s no suprise really. If we are going to embrace the humble hashmap, we have to give lists and sets love. They also have a first class PHP client. It would be great to see those additions in Voldemort.

    I get understand feature creep, trust me, but I think we have basically stumbled onto the need for partitionable, persistent, base data structures. Once we acknowledge that fact “fully” then things start to get really interesting.

    I’ll be looking for someplace in the codebase to sink my teeth into real soon. I’m glad one of the most promising solutions out there is written in Java and doesn’t require Hadoop! Looked through the different StorageEngines that were available. I think I could definitely contribute to a robust file store with in memory caching. Do you guys use Netbeans or Eclipse for development. A UI built using NetBeans 6.5 could be interesting.

    -Travell

  4. Hi Travell,

    I haven’t taken an in-depth look at Redis yet, though it looks quite nice. We do have a general optimistic locking solution that will work for lists, sets, or other data modifications. (See StoreClient.applyUpdate().)

    New clients in other languages is definitely a big priority. I have added a Protocol Buffers based protocol to aid in the creation of other clients. This is still fairly alpha but if you are interested in putting together a client might ease that process.

    I am using Eclipse for development, but not everyone is.

    Contributions are definitely welcome!

    -Jay

  5. [...] my last blog entry I described what LinkedIn is doing with our open source key-value storage system Project Voldemort. [...]

  6. need simple file storage for groups – like box.net.
    should be able to add applications to groups.

  7. [...] do movimento NoSQL: Yahoo! (Hadoop com HBase, Sherpa), Facebook e Digg (Cassandra), LinkedIn (Voldemort), Mixi (Facebook do Japão) (Tokyo Cabinet) e a Engine Yard [...]

  8. [...] is open sourcing lots of their tech in project Voldermort, but more is coming, including the reporting layer (if I understood [...]

  9. [...] Leveraging Apache software is only the start. LinkedIn actively contributes code, design and testing to many Apache projects. These efforts insure that these projects continue to grow and evolve to meet our future challenges. In addition to our contributions to Apache Shindig you’ll find LinkedIn active in the Apache Lucene community where we’ve developed a number of extensions to this powerful search technology. LinkedIn code provides faceted search via bobo-browse, real time indexing with zoie, and extra performance with the kamikaze search extension. We’ve also released our data storage solution, Voldemort with an Apache License.  (Read more about Voldemort here) [...]

post a comment

This is a moderated site and comments will appear if and when they are approved. We will review the queue several times daily, so please don't resubmit if your comment doesn't appear immediately.

Close
E-mail It
Powered by ShareThis