LinkedIn has open-sourced another piece of its infrastructure software, “Kafka,” a persistent, efficient, distributed message queue. Kafka is primarily intended for tracking various activity events generated on LinkedIn’s website, such as pageviews, keywords typed in a search query, ads presented, etc.
Those activity events are critical for monitoring user engagement as well as improving relevancy in various other products. Each day, billions of such events are generated. Therefore, we need a solution that’s scalable and incurs low overhead.
“Kafka has been used in production at LinkedIn for a number of projects. There’re both offline and online usage. In the offline case, we use Kafka to feed all activity events to our data warehouse and Hadoop, from which we then run various batch analysis. In the online case, a service will consume events in real time.
For e.g., in LinkedIn Signal, Kafka is used to deliver all network updates to our search engine. Typically, an update becomes searchable within a few seconds after it’s posted. The design of Kafka allows us to use a single infrastructure to support events delivery for both cases.
We feel that Kafka can be very useful in many places outside of LinkedIn,” said LinkedIn.