The unthinkable happened yesterday: Facebook, Instagram, and Whatsapp all fell down at the same time. While we were all twiddling our thumbs, pondering how we could possibly engage with friends, family, and folks we had not seen since high school, Facebook’s servers were in a state of emergency. So, what occurred, exactly?
The information we currently know comes from a combination of leaks from “insiders,” a brief and unclear blog post issued by Facebook itself, and a superb write-up by CloudFare, a web infrastructure business.
UNDERSTANDING DNS AND BGP
To the untrained eye, Facebook appeared to have vanished from the Internet. When users attempted to access the website, they received an error message, and the servers were completely unavailable. This is a very exceptional occurrence for a corporation as large as Facebook. We now know that the outage was caused by a configuration change to Facebook’s routers”backbone,’ which sends and receives data over networks. Communication between data centers ceased, and all of their services ceased. This was made worse by an opportunely timed glitch in their building’s card readers, which purportedly stopped staff from entering the facility and resolving the problem.
Let us look at the timeline to see what went wrong, at least from the outside. Facebook, like every other website on the Internet, relies on advertising to attract users to its platform. Border Gateway Protocol is used by the Internet to accomplish this (BGP). BGP is a method that determines how data is routed across the Internet, similar to how the postal service determines how your letter is sent to another country. BGP governs how all data communicate across networks, therefore without it. The Internet would collapse.
The Domain Name System is another important aspect of the Internet (DNS). DNS is the Internet’s Yellow Pages; it converts sophisticated numbering systems into something humans can understand. The Internet, for example, displays “66.220.144.0” (among other things), but we see “www.facebook.com” because DNS servers have kindly translated it – otherwise, the Internet would be unintelligible jumble of numbers. The following is how these operate together: if you Google ‘Facebook,’ the user will see ‘www.facebook.com.’ DNS servers converted the IP address to a domain name, which then routed through the Internet by BGP, allowing them to advertise their website. I realize that is many acronyms.
WHY DID FACEBOOK GO DOWN?
Let us return to the Facebook downtime. When the configuration of Facebook’s servers was altered, the company’s routes no longer announced to their DNS servers, indicating a BGP issue. Some Facebook IP addresses were still functional, but they were virtually useless without DNS servers to translate them. According to what we know now, Facebook disabled its own BGP system, thus disconnecting itself from the Internet.
From there, things just got worse. Engineers attempted to reach the data centers to address the problem, but it looked that they were unable to do so. When users swipe their keycards to obtain access to the Facebook buildings, the recognition system sends the card to Facebook’s servers, which allows them to enter. Because Facebook’s servers were down, engineers were unable to enter the facility to rectify the problem.
“There are now people attempting to gain access to the peering routers in order to implement fixes, but people with physical access are separate from people with knowledge of how to authenticate to the systems and people who know what to do, so there is now a logistical challenge with getting all that knowledge unified.” BGP activity was restored hours later, and DNS servers resumed translating IP addresses into domain names. The servers of Facebook were down for around six hours, but the pain for employees will probably endure much longer.