How Twitter processes 400 billion events daily ?
Twitter's event processing pipelines, old Lambda architecture & improvements through new Kappa architecture.
Hi Folks,
Do you struggle to understand the intricacies of distributed systems, databases, and programming languages? Are you curious to learn how companies design and build software at scale ?
My free newsletter breaks down complex tech concepts, architectural trade-offs, and design decisions into clear, actionable steps. Get weekly articles to level up your tech skills and become a better engineer.
If you haven’t subscribed to my newsletter yet, do subscribe :-
X.com (formerly Twitter) receives more than 400 billion events daily. The events include tweet views, likes, retweets, engagements, etc. The events are critical as they are used by Ad-serving systems & determine the company’s revenue.
There are several components that work behind the scenes to process these events. The events are finally aggregated, and stored in a data store. And multiple downstream consumers rely on this data store.
Scale, performance and accuracy are three important pillars while building an event processing system. Twitter’s old Lambda architecture faced multiple challenges while handling the exponential growth in events. Hence, they adopted new Kappa architecture & improved the accuracy, scalability and performance.
In this article, we will dive into Twitter’s architecture that processes 400 billion events daily. We will look at the old Lambda architecture, why it couldn’t scale & how the new architecture addressed the bottlenecks.
This article uses this blog from Twitter that was written in 2021. Twitter would have made multiple changes in the architecture since then. Hence, the article doesn’t talk about the latest state and it’s evolution over last 3 years.
Lambda architecture
The lambda architecture consisted of batch-processing and real-time pipelines. These two pipelines store the data in different data stores.
Different consumer services query the Query Service for fetching the aggregated data. The Query service aggregates the computed data from different sources and returns it to the consumer services.
We will go over the different components comprising the architecture :-
Batch processing
Data sources - The raw data exists in the form of logs stored in HDFS. This contains event data for tweet views, clicks, timelines, etc.
Scalding - Scalding platform enables developers to write data pipelines for batch processing. Scaling pipelines read the raw data and then perform different types of aggregations.
Stream processing
Data sources - The streaming data is written to different Kafka topics.
Heron - Heron reads & processes the streaming data from the Kafka topics.
Data stores
Manhattan - The computed batch data is stored in Manhattan, a distributed database.
Nighthawk - It is Twitter’s distributed cache used for storing the processed streaming data.
Challenges with Lambda architecture
Although, the above architecture was able to support the scale but it ran into several limitations. Following were the challenges with the old architecture :-
Performance - The performance of the system was dependent on the throughput of Heron. Heron nodes used to get impacted due to high garbage collection costs. Slowness of Heron consumers increased the overall processing latency.
Accuracy - The system didn’t guarantee data durability. And Heron node restarts would result in data loss. This would impact the data accuracy. Additionally, it didn’t process late or out-of-the order events.
Complexity - Two separate pipelines added to the architectural complexity and increased maintenance costs.
Operational overhead - Developers often had to intervene to restart Heron nodes to speed up real-time data processing.
Kappa architecture
To address the challenges, Twitter decided to adopt the Kappa architecture. This architecture used stream processing pipelines only. Further, they leveraged multiple Google cloud services for data processing.
The architecture consists of two parts -
Twitter data center - It hosts components and services within Twitter.
Google Cloud - It consists of all the Cloud services used for stream processing.
Twitter Data Center
Event Mappers - These were part of preprocessing pipelines. They transformed, re-mapped the fields and sent it to Kafka topics.
Event processors - They were responsible for data model conversion from internal representation to Google pub-sub format. They streamed events with at-least-once semantics.
Twitter LDC Query Service - This service served the queries from different consumers by fetching the data from different cloud data stores.
Google Cloud
Cloud Pub/Sub - The pub/sub relays the data pushed by Twitter’s Event processors to Dataflow. It guarantees no data loss in the event processing journey.
Dataflow - Dataflow workers dedup (remove the duplicates) and aggregate the data.
Bigtable/BigQuery - The aggregated count for the query key is written to the BigQuery and BigTable.
Kappa’s performance
The new architecture reduced the costs to build and maintain a separate batch processing pipeline. With no data loss guarantees, it ensured high accuracy.
Following were some key improvements observed in the Kappa architecture :-
Events processed - Handled more than 4 million/second
Throughput - More than 1 GB/sec. It was 10x improvement over Lambda architecture.
Latency - The approximate latency was 10 sec. This was significant improvement over Lambda’s latency of 10 sec-10 mins.
Besides, Kappa added support for late event handling and guaranteed no data loss. This improved the overall accuracy of the system.
Conclusion
Twitter’s original event processing system was based on Lambda architecture that combined batch-processing and stream-processing system together. There were several limitations of the Lambda architecture such as performance, accuracy, complexity and operational overhead.
To overcome Lambda architecture’s limitations, Twitter decided to adopt the Kappa architecture. This architecture adopted real-time streaming of events.
The event pre-processing and transformation was done via services in Twitter’s data center. And they leveraged Google Cloud services for event aggregation and storage.
Here is how the Kappa architecture addressed Lambda architecture’s limitations :-
Performance - It reduced the overall latency to ~10 sec from ~10 sec-10 min.
Accuracy - Safety against data loss, at-least-once delivery semantics, event deduplication & processing of delayed events improved the accuracy.
Complexity - In contrast to Lambda architecture, Kappa required only real-time processing pipeline. This resulted in architectural simplification.
Operational overhead - It eliminated the need for developer intervention to restart components (Heron nodes) and fix issues. This improved the developer productivity.
Before you go: