Open-sourcing Kafka, LinkedIn's distributed message queue

January 11, 2011

Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check back tomorrow for regular LinkedIn programming. In the meanwhile, check out some of our latest product features, tips and tricks, or user stories. - Ed.

We are pleased to open-source another piece of infrastructure software developed at LinkedIn, Kafka, a persistent, efficient, distributed message queue. Kafka is primarily intended for tracking various activity events generated on LinkedIn's website, such as pageviews, keywords typed in a search query, ads presented, etc. Those activity events are critical for monitoring user engagement as well as improving relevancy in various other products. Each day, a substantially large number of such events are generated. Therefore, we need a solution that's scalable and incurs low overhead.

We first looked at several existing queuing solutions in the market. The most popular ones are based on JMS. Although JMS offers a rich set of features, it also adds significant overhead in the message representation. Additionally, some JMS implementations are optimized for the case when all messages can be cached in memory and their performance starts to degrade significantly when the in-memory buffer is saturated. Finally, most existing solutions don't have a clean design for scaling out.

That's why we decided to build Kafka. In summary, Kafka has the following three design principles: (1) a very simple API for both producers and consumers; (2) low overhead in network transferring as well as on-disk storage; (3) a scaled out architecture from the beginning. More details on the technical aspect of Kafka can be found here.

Today, Kafka has been used in production at LinkedIn for a number of projects. There are both offline and online usage. In the offline case, we use Kafka to feed all activity events to our data warehouse and Hadoop, from which we then run various batch analysis. In the online case, a service will consume events in real time. For example, in LinkedIn Signal, Kafka is used to deliver all network updates to our search engine. Typically , an update becomes searchable within a few seconds after it is posted. The design of Kafka allows us to use a single infrastructure to support events delivery for both cases.

We feel that Kafka can be very useful in many places outside of LinkedIn. By open sourcing it, we hope to work with people in the community to keep improving Kafka in the future. We welcome comments and suggestions from everybody.