To understand why the impact was so big, and what steps that companies can take to prevent something like these disruptions in the future, it makes sense to take a step back and take a look at what cloud computing is, and what it’s good for.
So what is cloud computing and AWS?
Whenever you connect to anything over the internet, your computer is essentially just talking to another computer. A server is a type of computer that can process requests and deliver data to other computers in the same network or over the internet. But running your own server isn’t cheap. You have to buy the hardware box, install it somewhere, and feed it a lot of power. In many cases, it needs internet connectivity too. Then, to ensure that data is received and sent with minimal delays, these servers need to be physically close to its users. Additionally, you have to install software that needs to be updated regularly. And you have to build fail-safe mechanisms that will switch over operations to another server if a main server malfunctions. For example, the code running Netflix does something different compared to the code running a service like Venmo. The Netflix code is serving videos to users, and the Venmo code is facilitating financial transactions. But underneath, most of the computing work is actually the same. This is where cloud providers come in. They usually have hundreds to thousands of servers all over the country with good bandwidth. They offer to take care of the tedious tasks like security, day-to-day management of the data center operations, and scaling services when needed.
What went wrong with AWS on Dec. 7 and 15
What caused the AWS outages appeared to be related to errors with the automated systems handling the data flow behind the scenes. AWS explained in a post that the December 7 error was due to a problem with “an automated activity to scale capacity of one of the AWS services hosted in the main AWS network,” which resulted in “a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.” Later on December 15, a status update issued by AWS said that the outage was caused by “traffic engineering” incorrectly moving “more traffic than expected to parts of the AWS Backbone that affected connectivity to a subset of Internet destinations.” Big data centers have lots of internet connections through different internet service providers. They get to choose where online traffic gets routed, whether it’s over one cable through AT&T, or another cable through Sprint. Their automatic “traffic engineering” decides to reroute traffic based on a number of conditions. “Most providers are going to reroute traffic mostly based on load. They want to make sure things are relatively balanced,” Sherry says. “It sounds like that auto-adaptation failed on the 15th, and they wound up routing too much traffic over one connection. You can literally think of it like a pipe that has had too much water and the water is coming out the seams.” That data ends up getting dropped and disappears. Despite some prevalent outages over the past few years, Sherry argues that AWS is “quite good at managing their infrastructure.” Inherently, it’s very difficult to design perfect algorithms that can anticipate every problem, and bugs are an annoying but regular part of software development. “The only thing that’s unique about the cloud situation is the impact.” A growing number of independent companies are turning to third-party centralized services like AWS for cloud infrastructure, storage, and more.
Back to basics?
During the time AWS went out, Sherry could not control her television. Normally, she uses her phone as a remote control. But the phone does not directly talk to the TV. Instead, both the phone and the TV talk to a server in the cloud, and that server is orchestrating that in-between. The cloud is essential for some functions, like downloading automatic software updates. But for scrolling through cable offerings available from an antenna or satellite, “there’s no reason that needs to happen,” she says. “We’re in the same room, we’re on the same wireless network, all I’m trying to do is change the channel.” The cloud can offer convenient tech solutions in some instances, but not every application needs it. One account of a marooned technology that struck her most as an unnecessarily roundabout design was a timed cat feeder that had to go through the cloud. Automated cat feeders have been around before the cloud. They’re basically paired to an alarm clock. “But for some reason, someone decided that rather than building the alarm clock part into the cat feeder, they were going to put the alarm clock feeder in the cloud, and have the cat feeder go over the internet and ask the cloud, is it time to feed the cat?” Sherry says. “There’s no reason that that needed to be put into the cloud.” Moving forward, she thinks that application developers should review every feature that’s intended for the cloud and ask if it can work without the cloud, or at least have an offline mode that’s not as completely debilitating during an internet, data center, or even power outage. “There are other things that are probably not going to work. You’re probably not going to be able to log in to your online banking if you can’t get to the bank server,” says Sherry. “But so many of the things that failed are things that really should not have failed.”