The Talent500 Blog
walmart's innovative use of MPS

Walmart’s Trillion-Message Kafka Setup: Scaling Challenges and Solutions

Walmart operates a massive Apache Kafka deployment that processes trillions of messages daily with 99.99% availability. This robust setup supports critical operations like data movement, event-driven microservices, and streaming analytics across private and public cloud environments. However, managing Kafka at Walmart’s scale presents unique challenges, including handling traffic spikes and supporting multi-language consumer applications while maintaining high reliability and cost-efficiency.

Key Challenges

1. Consumer Rebalancing

Consumer rebalancing in Kafka occurs when the number of consumer instances in a group changes. Common triggers include Kubernetes deployments, rolling restarts, or automatic scaling. Additionally, issues like missed heartbeats (due to JVM pauses or garbage collection) or delays in processing batches can cause the broker to assume a consumer is unresponsive, prompting rebalancing.

While rebalancing ensures even partition distribution, it disrupts operations and adds latency, especially in Walmart’s fast-paced e-commerce environment.

2. Poison Pill Messages

A “poison pill” refers to a message that repeatedly causes consumer failures, often due to malformed data, unexpected content, or bugs in the consumer code. This creates a loop where the consumer repeatedly fetches and fails to process the message, blocking progress for other messages in the partition.

3. Scaling and Cost Constraints

Kafka’s partition-based model tightly couples the number of partitions to the maximum number of consumers that can read in parallel. Scaling up consumer instances requires increasing partitions, leading to challenges:

  • Kafka brokers have limits on partition numbers, and exceeding these necessitates larger, costlier broker instances.
  • Adding partitions involves coordination among multiple teams, creating overhead in large organizations like Walmart.
  • More partitions increase resource usage, driving up costs.

The Message Proxy Service (MPS)

To address these challenges, Walmart developed a Message Proxy Service (MPS), which decouples Kafka message consumption from its partition-based limitations.

How MPS Works

MPS acts as an intermediary between Kafka and consumer applications, reading messages from Kafka partitions and placing them in an in-memory queue. Consumer applications then fetch these messages via REST APIs, enabling independent scaling of consumers without increasing Kafka partitions.

Components of MPS

  1. Reader Thread
    Reads messages from Kafka and places them in a bounded queue (PendingQueue). If the queue is full, the thread pauses to prevent memory overload.
  2. PendingQueue
    A bounded buffer that allows asynchronous operations, letting the reader thread operate at Kafka’s speed and the writer threads at their pace.
  3. Order Iterator
    Ensures in-order processing of messages with the same key by preventing concurrent handling of such messages.
  4. Writer Threads
    These threads fetch messages from the PendingQueue and deliver them to consumers via HTTP POST. Failed deliveries are retried or moved to a Dead Letter Queue (DLQ) for further analysis.
  5. Offset Commit Thread
    Periodically commits processed message offsets to Kafka, ensuring no duplication or message loss in case of service restarts.
  6. Consumer Service REST API
    Defines the API format for consumer applications, including how they respond to HTTP requests and manage failures.

Benefits of MPS

  1. Improved Rebalancing
    MPS reduces the frequency of rebalancing by maintaining steady Kafka polling through its PendingQueue. It isolates message consumption from consumer application scaling, minimizing disruptions.
  2. Efficient Handling of Poison Pill Messages
    Consumer applications can detect problematic messages and inform MPS, which redirects them to the DLQ, ensuring the rest of the messages are processed smoothly.
  3. Cost Savings and Scalability
    • Stateless Consumer Services: Applications can scale dynamically with Kubernetes based on demand, without pre-allocating resources.
    • Kafka Optimization: Kafka clusters are scaled based on throughput rather than partition count, reducing infrastructure costs.
  4. Simplified Operations
    REST-based communication between MPS and consumer services ensures compatibility with various programming languages and frameworks, promoting flexibility.

Additional Considerations

  1. Rebalancing in MPS
    MPS itself operates as a Kafka consumer and is subject to rebalancing. Its design, however, accommodates this gracefully, with the PendingQueue acting as a buffer during transitions.
  2. REST vs. Other Protocols
    REST was chosen over protocols like gRPC for its simplicity and broad compatibility, despite its higher overhead.
  3. Increased Complexity
    Introducing MPS adds an intermediary layer, requiring additional resources for monitoring and maintenance. However, its benefits outweigh the added complexity.

Conclusion

Walmart’s innovative use of the Message Proxy Service demonstrates how large-scale organizations can enhance Kafka’s capabilities to meet demanding operational requirements. By addressing rebalancing disruptions, poison pill messages, and cost inefficiencies, MPS ensures seamless message processing, scalability, and reduced costs. While it introduces additional complexity, the overall architecture empowers Walmart to maintain high reliability and flexibility in its Kafka deployment.

Read more such articles from our Newsletter here.

1+
Avatar

prachi kothiyal

Add comment