In early 2023, every API request at Databricks had to wait for a rate limit check that took 10 to 20 milliseconds. Two network hops. One Redis call. For a platform handling billions of requests, those milliseconds added up to an unacceptable tax on every single user.
The team rewrote their rate limiter from scratch. The result: a 10x reduction in tail latency, zero network calls on the critical path, and a system that handles orders of magnitude more traffic with fewer servers.
The catch? They had to give up something most engineers consider non negotiable: accuracy.
Welcome to Grind Engineer, your guide to becoming a better software engineer! No fluff. Pure engineering insights.

The Old Architecture: Two Hops Too Many
The original setup was textbook. A request hit the Envoy ingress gateway, which forwarded it to a Ratelimit Service, which then queried Redis to check the counter. If the counter was below the threshold, the request passed. If not, it got rejected.
This meant every single API call required two network hops before it could proceed: Envoy to Ratelimit Service, then Ratelimit Service to Redis.
Metric | Old System |
|---|---|
Network hops per check | 2 (Envoy → Service → Redis) |
P99 latency per hop | 10 to 20ms |
Total rate limit overhead | 20 to 40ms per request |
Redis | Single instance, single point of failure |
Scaling approach | Add more Ratelimit Service instances |
The team tried optimizing within this architecture. They configured Envoy with consistent hashing so that requests with the same rate limit key always landed on the same service instance, enabling local counting. It helped, but hit three walls:
Non Envoy services could not participate, fragmenting the rate limit view
Service restarts and scaling events churned hash assignments, forcing syncs back to Redis
Hot keys saturated individual machines while neighbors sat idle
Adding more machines stopped helping. The architecture itself was the ceiling.

