Remember that Facebook Outage in October?

Guest blogger, Chris Bonatti, an IECA Consultant
Explains What Happened

Forward by CyberWyoming:  This is the type of procedure we hash out in Wyoming’s Cybersecurity Competition for Small Businesses – what to do if the worst happens.   Join the competition at www.cyberwyoming.org/competition.  (IECA’s membership makes the Competition free to all Wyoming small companies.)

On 4 October, a configuration error at Facebook caused their routers to send out Border Gateway Protocol (BGP) advertisements for Autonomous System (AS) number 32934 that updated routing instructions for Facebook’s networks to refer to something that no longer worked. Since every such router shares its routing table with its peers, these instructions were relayed via BGP, router to router, until the misinformation was spread globally. Consequently, traffic to Facebook crashed to zero over the course of maybe 5 minutes. Other Facebook-owned services such as Instagram and WhatsApp were taken out as well. This is a story we’ve heard before. Usually the problem is rapidly diagnosed, and set right. However, in Facebook’s case, the story didn’t end there.

Initial reports indicated that Facebook was suffering from a Domain Name Service (DNS) outage, but the fall off in traffic to Facebook was too abrupt for DNS to be responsible. Facebook’s authoritative DNS servers were presumably still up, but were now simply unreachable due to the routing error.

It turned out that once the Internet was cut off, Facebook employees working remotely were unable to help fix the problem. Those few who did have physical access did not have the authorizations required to fix the BGP servers. Access badges at Facebook wouldn’t even open doors, as their whole system is extensively Internet based. Facebook said that their systems are designed to audit commands like this BGP update to prevent mistakes, but a bug in that audit tool prevented it from properly stopping the fatal command. They also said that, to ensure reliable operation, their DNS servers automatically disable BGP advertisements if they cannot contact Facebook’s data centers, as this is an indication of an unhealthy network connection. Since the backbone was removed from the operation, these locations declared themselves unhealthy and withdrew from BGP advertisements. This too probably inhibited recovery. The total loss of connectivity also broke many of the internal tools they would normally use to investigate and resolve outages. So they had to send engineers to their data centers in person.

All told, Facebook services were only out for about six hours, but it’s a serious problem that this kind of issue is so easy to create with BGP… all it takes is one defective entry. It also illustrates why you should avoid single points of failure… no matter how global they may seem.

Share:

Register to Receive the Tech Joke of the Week!

This Week's Joke:

How many programmers does it take to change a light bulb?

None, it is a hardware problem!

More Posts: