
Welcome to Hello Engineer, your weekly guide to becoming a better software engineer! No fluff - pure engineering insights.
On November 18, 2025, a huge portion of the Internet suddenly felt “down.” The reason? Cloudflare, the service that sits between millions of websites and their visitors experienced a major multi-hour outage that triggered widespread 5xx errors across the web.
TL;DR
Cloudflare went down because a database permission change accidentally doubled the size of a machine-learning “feature file” used by its Bot Management system. The proxy that loads this file has a hard limit, and the oversized file caused it to crash, returning 5xx errors across the network. Once Cloudflare figured this out, they stopped generating the bad file, pushed a known-good one, restarted the proxy, and services gradually recovered.
How Cloudflare Fits Into the Internet ?
Most websites don’t directly communicate with your browser. Your request hits Cloudflare first. It acts as a reverse proxy, handling TLS, caching, performance tuning, DDoS protection, bot filtering, and routing traffic to the origin server.
In short, Cloudflare sits in front of a huge chunk of the world’s websites. When Cloudflare breaks, the impact is global.

Why the Outage Happened : The Root Cause ?
This wasn’t a cyberattack. It began with a small internal permission change in Cloudflare’s ClickHouse database cluster.
Cloudflare’s Bot Management system relies on a “feature file” that is regenerated every few minutes. This file contains a list of ML features used to score requests. After the permission update, the query that builds this file started returning duplicate rows, which doubled the number of features.
Cloudflare’s proxy (FL and the newer FL2) has a limit of 200 features for memory preallocation. The duplicated file exceeded that limit, causing the module loading it to panic. Once that happened, the proxy began returning 5xx errors for traffic that needed the Bot Management module, eventually affecting most requests.
Because the file regenerated every five minutes, the network briefly recovered whenever a “good” file was generated, then failed again when a “bad” one appeared. That led teams to initially suspect a large-scale attack.

What the Investigation Looked Like
Internally, Cloudflare engineers first saw:
• spikes in 5xx errors• Workers KV failures• Access authentication issues• unusual CPU load due to error reporting• even their off-platform status page breaking (coincidentally unrelated)
The intermittent nature confused things, sometimes everything looked fine, sometimes it crashed again. Early thinking leaned toward a massive DDoS, especially given recent Aisuru attacks.
As the investigation deepened, a clearer picture emerged. KV wasn’t the root cause; it was a downstream victim of proxy failures. The proxy was crashing when loading the Bot Management feature file. That file was malformed because the database query started producing duplicates. And that duplication was caused by the permission change. Once that chain connected, the real cause became obvious.
When the bad file with more than 200 features was propagated to our servers, this limit was hit — resulting in the system panicking. The FL2 Rust code that makes the check and was the source of the unhandled error is shown below:

This resulted in the following panic which in turn resulted in a 5xx error.
How Cloudflare Resolved It
Once they understood the issue, the response happened in several steps:
• They stopped generating and distributing the bad feature file.• They injected a known-good version of the file into the distribution pipeline.• They restarted the proxy so it would load the good file.• Workers KV and Access were temporarily shifted to fallback paths to avoid relying on the failing proxy.
By 14:30 UTC, core traffic started recovering.By 17:06 UTC, all downstream services were stable again.

Final Thoughts
Cloudflare described this as their worst outage since 2019. A small internal permissions change led to an oversized config file, which caused a core part of their proxy to crash, a classic example of how tiny changes in distributed systems can ripple into massive failures.
Incidents like this are painful, but they lead to stronger, more resilient infrastructure for the entire Internet.
Want to crack your next interview?
Get 10% OFF on Educative.io System Design courses here : https://www.educative.io/
Subscribe to my Youtube Channel : https://www.youtube.com/@scortierHQ
See you next week with more exciting content!
Signing Off,Scortier
