Cloudflare explains outage that disrupted customer traffic; cause traced to oversized bot feature file

After thousands of sites displayed error pages yesterday due to a Cloudflare outage, the company has now explained that the problem originated from its system designed to protect websites from DDoS attacks.

Cloudflare’s “Bot Management” system prevents DDoS attacks such as traffic flooding, which can crash websites, and content-scraping attempts, which are used to extract data without authorisation.

The system uses an AI model to score incoming traffic requests. Each time a visitor tries to access a website, the AI assigns a score to determine whether the request comes from a bot. The model considers various features of the request, which are stored in a “feature file.”

This feature file is refreshed every five minutes to keep up to date with evolving bot behaviours and is used across Cloudflare’s network. However, a recent change caused the system to duplicate the information many times, making the feature file unusually large and ultimately triggering the outage.

As a result, attempts to access the many websites protected by this tool generated error pages.

Cloudflare CTO Dane Knecht posted on Twitter admitting his company had let customers and “the broader internet” down, and apologized to users.

The CTO explained that “a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation of our network and other services.”

“The issue was not caused, directly or indirectly, by a cyberattack or malicious activity of any kind,” Cloudflare wrote.

Several high-profile customers were affected during the incident: OpenAI reported a “full outage” impacting ChatGPT, APIs and Sora that has since been resolved, design tool Canva flagged users unable to load the site due to issues at its CDN provider (Cloudflare), dating app Grindr logged a “Cloudflare Outage,” and Dropbox’s DocSend noted degraded performance tied to the Cloudflare event.

Impacted services included core CDN/security, Turnstile, Workers KV, and Access; authentication to the dashboard also failed during parts of the incident, Cloudflare said.

The firm assured that it is following up on the incident by “hardening ingestion of Cloudflare-generated configuration files”, enabling more global kill switches for features, eliminating the ability for core dumps, and reviewing failure modes for error conditions across all core proxy modules.