Facebook’s global outage: What happened?
Facebook has blamed a “technical issue” for a worldwide outage that brought down the social media giant’s primary platform, as well as WhatsApp and Instagram, and also employees’ work passes and email.
The services went down around 4pm GMT, with WhatsApp users complaining they were unable to send or receive messages until around 10pm on Monday, with some 10.6 million problem reports around the world detected by Downdetector.
Facebook issued a statement on Tuesday confirming that the cause of the outage was a configuration change to the backbone routers that coordinate network traffic between the company’s data centres, which had a cascading effect, bringing all Facebook services to a halt.
In a detailed blog on the causes, Cloudflare outlined something that will be familiar to anyone who knows how the internet works – Border Gateway Protocol (BGP).
BGP exchanges routing information between autonomous systems across the internet, allowing the global system to deal with constantly updating lists of possible routes that the biggest routers are delivering packets to. To put it another way, without BGP, the internet wouldn’t function.
In a blog apologising for the outage, Facebook VP of infrastructure Santosh Janardhan wrote: “Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centres communicate, bringing our services to a halt.”
The error for Facebook occurred due to an update to its systems which effectively told BGP (and subsequently the internet) that Facebook (the platform, not the company) no longer existed as a potential route for internet traffic.
Way back in 2019, Facebook announced plans to shift platforms it had acquired, such as Instagram and WhatsApp, onto existing infrastructure it used for Facebook and Messenger. This, of course, made sense at the time as it would allow the company to update all of its technology at once. However, it was also controversial due to privacy concerns.
With all of its platforms sitting on Facebook infrastructure, the update effectively said to the internet that all of those platforms no longer existed either, meaning users could not access WhatsApp, Instagram or Messenger.
Worse, perhaps for Facebook, is that it also impacted the company’s internal systems as well, with reports from the New York Times claiming Facebook staff were also locked out of offices and unable to use their own internal comms platform. This may have slowed down the implementation of a fix, which should have been relatively simple – another update could reconnect Facebook as an internet route.
Some reports claim even Facebook staff’s security passes were rendered unusable by the outage, so in order to fix it, the company reportedly sent staff down to its core data centre in California to perform a manual reset on the servers where the problem originated.
A key lesson for businesses is the inherent risk in having a single point of failure in their technological architecture. Because Facebook runs all of its systems through Facebook, it meant there was no quick fix when Facebook went down.
Though these outages are rare, they do happen from time to time – a 2019 outage saw Facebook and its other services down for more than 14 hours, for example – and they can be difficult to avoid entirely.
They can also be very expensive: The outage came just a day after a former Facebook civic integrity product manager Frances Haugen went public with explosive allegations that Facebook had prioritised growth and profit over public safety. According to the business website Fortune, the incidents cost Facebook founder Mark Zuckerberg an estimated $6bn (£4.4bn) at one point as shares in the company plummeted over recent days.
Subscribe to our Editor's weekly newsletter