The wide scale Cloudflare outage last month that took much of the internet offline was more than a temporary disruption. It was a reminder about the fragile state of global IT resilience. And we were reminded of that fragility again in December with albeit a smaller outage due to a configuration change. For all that our digital world has become more interconnected, interdependent, and AI-driven than ever, the system is concerningly shaky.
Cloudflare wasn’t the first highly visible outage of the year. Major disruptions at AWS in October, and a global ChatGPT (OpenAI) outage in June shows how even the largest cloud and AI systems are prone to disruption.
And, if enterprise dependence on AI continues accelerating at its current pace, these outages are likely to become increasingly common. Ultimately, if AI is going to sit at the center of every application and workflow, it needs to be treated like core IT infrastructure. I would even suggest that it needs to be treated as core foundational infrastructure like power and cooling. This means investment in AI resilience is essential.
The evolution of outage response and AI dependency
Years ago, when an outage struck, the response process was clear. It meant assembling incident first-responders and pulling in every domain expert who might hold a missing piece of the puzzle. Together they’d troubleshoot in swat-team style and work tirelessly to identify the fault.
Fast forward to today – cloud and SaaS were supposed to remove that uncertainty, or at least outsource the work to a subject matter group to reduce the occurrences and improve the responses. Instead, IT infrastructure complexity has grown exponentially, alongside our dependence on AI.
As organizations lean further into AI, every workflow has become increasingly intertwined with LLM models, assistants, and automated decision-making. In parallel, the human operational knowledge that once solved complex issues is shrinking.
Much of the institutional troubleshooting skill that kept systems running has been replaced with AI-driven tooling. Tools that themselves rely on uninterrupted access to the very AI systems that are now failing more frequently.
The moment the AI model goes offline, the network path breaks, or an automation flow halts, our dependency becomes painfully visible. Unlike individual applications, these processes are now truly digital employees that we depend upon to run the business, and they can’t take unplanned PTO.Â
During the first Cloudflare incident, the consequence of AI dependency was seen firsthand. Outages didn’t strike a single application or department. They hit dozens at once: productivity suits, AI-driven automations, copilots, ticketing systems, and entire SaaS ecosystems.
When one foundational service faltered, it took an enormous slice of the digital workplace down with it. This is the new shape of disruption when outages occur. It’s the equivalent of a departmental sick-out or an industry-wide strike.
Today’s outages don’t respect organizational boundaries. They cascade across vendors, workflows, and business functions simultaneously. Ultimately, every minute of downtime now affects tens or even hundreds of thousands of employees globally.
Outages and the need for real-time visibility
Real-time visibility into IT infrastructure is key to combat disruption from wide-scale outages. IT teams can’t afford to wait for help-desk tickets to accumulate, or for service providers to confirm the existence of an issue. Organizations need instant clarity to answer the most important questions. Where is the problem coming from? Who is affected? When did it start? What systems are compromised? Has the resolution been confirmed? The longer those questions remain unanswered, the more productivity, revenue, and customer trust slip away.
However, real-time visibility alone isn’t enough. These outages are highlighting a deeper vulnerability: when AI-driven systems fail, the blast radius reaches directly into the employee experience, and in turn the consistency and predictability of business results. If IT teams don’t have a clear understanding of what employees are experiencing in the moment, they can’t accurately assess the impact or respond with precision. This is why real-time digital employee experience (DEX) monitoring is no longer optional – it’s the human-centric counterpart to infrastructure monitoring
If AI is going to sit at the center of every application, workflow and business function, then organizations must start treating AI dependence with the same seriousness typically reserved for core infrastructure. IT can’t just monitor performance anymore, it needs to safeguard the operational intelligence of the organization, identify and classify AI-related risk, and develop mitigation strategies for AI downtime.
Human-centric operations, grounded in continuous DEX insight, give IT the context required to understand how outages affect employees, what work is blocked, and where to triage first. With real-time visibility paired with DEX monitoring across the entire digital estate, IT teams can detect, classify, escalate, and mitigate risks before they escalate into global service disruptions.
Organizations must also accept that outages of this nature are not anomalies. They are a result of hyper-interconnected digital ecosystems. As AI becomes embedded into every business process, these incidents will only become more frequent and more consequential unless enterprises change their approach.
Preparing for the next decade of digital resilience
The Cloudflare outage wasn’t just a disruption; it was another warning siren for what the next decade of IT resilience needs to account for. Organizations must be willing to confront the fragility of AI dependence head on and invest in the tools, processes, and human-centric visibility required to navigate an increasingly interconnected world. Organizations that take this moment as a call to action will be prepared no matter what the next AI outage brings.
By Tim Flower, VP of DEX strategy at Nexthink