How Databricks Rate Limits Billions of API Calls at Near Zero Latency

In early 2023, every API request at Databricks had to wait for a rate limit check that took 10 to 20 milliseconds. Two network hops. One Redis call. For a platform handling billions of requests, those milliseconds added up to an unacceptable tax on every single user.

The team rewrote their rate limiter from scratch. The result: a 10x reduction in tail latency, zero network calls on the critical path, and a system that handles orders of magnitude more traffic with fewer servers.

The catch? They had to give up something most engineers consider non negotiable: accuracy.

❝

Welcome to Grind Engineer, your guide to becoming a better software engineer! No fluff. Pure engineering insights.

The Old Architecture: Two Hops Too Many

The original setup was textbook. A request hit the Envoy ingress gateway, which forwarded it to a Ratelimit Service, which then queried Redis to check the counter. If the counter was below the threshold, the request passed. If not, it got rejected.

This meant every single API call required two network hops before it could proceed: Envoy to Ratelimit Service, then Ratelimit Service to Redis.

Metric	Old System
Network hops per check	2 (Envoy → Service → Redis)
P99 latency per hop	10 to 20ms
Total rate limit overhead	20 to 40ms per request
Redis	Single instance, single point of failure
Scaling approach	Add more Ratelimit Service instances

The team tried optimizing within this architecture. They configured Envoy with consistent hashing so that requests with the same rate limit key always landed on the same service instance, enabling local counting. It helped, but hit three walls:

Non Envoy services could not participate, fragmenting the rate limit view
Service restarts and scaling events churned hash assignments, forcing syncs back to Redis
Hot keys saturated individual machines while neighbors sat idle

Adding more machines stopped helping. The architecture itself was the ceiling.

How Databricks Rate Limits Billions of API Calls at Near Zero Latency

The Old Architecture: Two Hops Too Many

Reply

Keep Reading

Subscribe to Grind Engineer

How Databricks Rate Limits Billions of API Calls at Near Zero Latency

The Old Architecture: Two Hops Too Many

Subscribe to keep reading

Reply

Keep Reading

Subscribe to Grind Engineer