How Uber Scaled Cassandra to Handle Tens of Millions of Queries Per Second

Uber, the global transportation and delivery giant, has successfully scaled its Cassandra database service to handle an astounding volume of queries and data. This feat of engineering enables the company to facilitate millions of rides and deliveries worldwide with remarkable efficiency and reliability.

Jump to

The Scale of Uber’s Cassandra Infrastructure

Uber’s Cassandra database as a service platform has achieved impressive metrics over its six-year evolution:

Processes tens of millions of queries per second
Manages petabytes of data
Operates across tens of thousands of Cassandra nodes
Supports thousands of unique keyspaces
Maintains hundreds of unique Cassandra clusters, each with over 400 nodes
Provides multi-region support

This scale wasn’t achieved overnight but through years of dedicated engineering efforts and problem-solving.

Architecture of Uber’s Cassandra Setup

Uber’s Cassandra ecosystem spans multiple regions, with data replicated between them. The company’s in-house stateful management system, Odin, handles the configuration and orchestration of thousands of clusters.

Key components of the architecture include:

Cassandra Framework: An in-house development responsible for managing Cassandra’s lifecycle in Uber’s production environment.

Cassandra Client: A forked and adapted version of open-source Go and Java Cassandra clients, tailored to work within Uber’s ecosystem.

Service Discovery: A critical component that helps discover service instances dynamically, eliminating the need for hardcoded configurations.

Challenges and Solutions in Scaling Cassandra

As Uber’s Cassandra service grew, the engineering team faced several reliability challenges:

1. Unreliable Node Replacement

The team improved node replacement reliability by:

Proactively purging hint files belonging to orphan nodes
Dynamically adjusting hint transfer rate limiters
Improving Cassandra’s bootstrap and decommission path

These changes resulted in a 99.99% reliable node replacement process.

2. Lightweight Transactions Error Rate

Uber’s engineers enhanced error handling within the Gossip protocol, making Cassandra Lightweight Transactions more robust.

3. Data Inconsistency Issues

To address data inconsistency, Uber implemented a fully automated repair scheduler within Cassandra itself, reducing operational overhead significantly.

The Cassandra Team’s Responsibilities

A dedicated team manages Uber’s Cassandra platform, with responsibilities including:

Implementing new features and contributing to the Cassandra community
Integrating Cassandra into Uber’s ecosystem
Building and maintaining the managed Cassandra solution
Ensuring 99.99% availability with 24/7 support
Guiding application teams on best practices and data modeling

Conclusion

Uber’s success in scaling Cassandra demonstrates the power of incremental improvements and dedicated engineering. By addressing challenges head-on and developing innovative solutions, Uber has created a robust, highly available database service capable of supporting its massive global operations.

This scalable Cassandra infrastructure forms the backbone of Uber’s ability to provide reliable transportation and delivery services to millions of users worldwide, processing an enormous volume of data with exceptional speed and consistency.

Uber’s Journey to Scale Cassandra for Massive Query Volumes

The Scale of Uber’s Cassandra Infrastructure

Architecture of Uber’s Cassandra Setup