Uber, the global transportation and delivery giant, has successfully scaled its Cassandra database service to handle an astounding volume of queries and data. This feat of engineering enables the company to facilitate millions of rides and deliveries worldwide with remarkable efficiency and reliability.
The Scale of Uber’s Cassandra Infrastructure
Uber’s Cassandra database as a service platform has achieved impressive metrics over its six-year evolution:
- Processes tens of millions of queries per second
- Manages petabytes of data
- Operates across tens of thousands of Cassandra nodes
- Supports thousands of unique keyspaces
- Maintains hundreds of unique Cassandra clusters, each with over 400 nodes
- Provides multi-region support
This scale wasn’t achieved overnight but through years of dedicated engineering efforts and problem-solving.
Architecture of Uber’s Cassandra Setup
Uber’s Cassandra ecosystem spans multiple regions, with data replicated between them. The company’s in-house stateful management system, Odin, handles the configuration and orchestration of thousands of clusters.
Key components of the architecture include:
Cassandra Framework: An in-house development responsible for managing Cassandra’s lifecycle in Uber’s production environment.
Cassandra Client: A forked and adapted version of open-source Go and Java Cassandra clients, tailored to work within Uber’s ecosystem.
Service Discovery: A critical component that helps discover service instances dynamically, eliminating the need for hardcoded configurations.
Challenges and Solutions in Scaling Cassandra
As Uber’s Cassandra service grew, the engineering team faced several reliability challenges:
1. Unreliable Node Replacement
The team improved node replacement reliability by:
- Proactively purging hint files belonging to orphan nodes
- Dynamically adjusting hint transfer rate limiters
- Improving Cassandra’s bootstrap and decommission path
These changes resulted in a 99.99% reliable node replacement process.
2. Lightweight Transactions Error Rate
Uber’s engineers enhanced error handling within the Gossip protocol, making Cassandra Lightweight Transactions more robust.
3. Data Inconsistency Issues
To address data inconsistency, Uber implemented a fully automated repair scheduler within Cassandra itself, reducing operational overhead significantly.
The Cassandra Team’s Responsibilities
A dedicated team manages Uber’s Cassandra platform, with responsibilities including:
- Implementing new features and contributing to the Cassandra community
- Integrating Cassandra into Uber’s ecosystem
- Building and maintaining the managed Cassandra solution
- Ensuring 99.99% availability with 24/7 support
- Guiding application teams on best practices and data modeling
Conclusion
Uber’s success in scaling Cassandra demonstrates the power of incremental improvements and dedicated engineering. By addressing challenges head-on and developing innovative solutions, Uber has created a robust, highly available database service capable of supporting its massive global operations.
This scalable Cassandra infrastructure forms the backbone of Uber’s ability to provide reliable transportation and delivery services to millions of users worldwide, processing an enormous volume of data with exceptional speed and consistency.
Read more about the topic here.
Read more such articles from our newsletter here.
Add comment