That Massive AWS Outage Explained: Failures and Fixes Tripping Over Themselves

4 hours ago 1

It wasn't funny for anyone who couldn't access regular online destinations, or for the engineers trying to fix the problems, but Monday's massive Amazon Web Services outage was something of a comedy of errors.

In a dense, detailed note posted after all the issues had been settled, AWS explained how the series of events unfolded and what it's planning to do to prevent similar collapses in the future.

The outage rendered huge portions of the internet unavailable for much of the workday for many people. As Monday rolled along, it affected more than 2,000 companies and services, including Reddit, Ring, Snapchat, Fortnite, Roblox, the PlayStation Network, Venmo, Amazon itself, critical services such as online banking and household amenities such as luxury smart beds.

In its explainer post, AWS apologized for the breakdown's impact, saying, "We know how critical our services are to our customers, their applications and end users, and their businesses."

Don't miss any of our unbiased tech content and lab-based reviews. Add CNET as a preferred Google source.

Why were so many sites affected?

AWS, a cloud services provider owned by Amazon, props up huge portions of the internet. When it went down, it took many of the services we know and rely on. As with the Fastly and CrowdStrike outages over the past few years, the AWS outage shows just how much of the internet relies on the same infrastructure -- and how quickly our access to everyday sites and services can be revoked when something goes wrong.

Our reliance on a small number of big companies to underpin the web is akin to putting all our eggs in a handful of baskets. When it works, it's great, but only one tiny thing needs to go wrong for the internet to fall to its knees in minutes.

In total, outage reporting site Downdetector saw over 9.8 million reports, with 2.7 million coming from the US, over 1.1 million from the UK, and the rest largely spread across Australia, Japan, the Netherlands, Germany and France. Over 2,000 companies in total were affected, with around 280 still experiencing issues at 10 a.m. PT. (Downdetector is owned by the same parent company as CNET, Ziff Davis.)

"This kind of outage, where a foundational internet service brings down a large swath of online services, only happens a handful of times in a year," Daniel Ramirez, Downdetector by Ookla's director of product, told CNET. "They probably are becoming slightly more frequent as companies are encouraged to completely rely on cloud services and their data architectures are designed to make the most out of a particular cloud platform."

How does AWS explain the outage?

A lot of the blame goes to automated systems that slipped up -- or did what they were supposed to do, which unfortunately knocked things off track again.

"The incident was triggered by a latent defect within the service's automated DNS management system that caused endpoint resolution failures for DynamoDB," AWS wrote. DNS stands for domain name system and refers to the service that translates human-readable internet addresses (for example, CNET.com) into machine-readable IP addresses that connect browsers with websites. DynamoDB is a database service.

When a DNS error occurs, the translation process cannot take place, interrupting the connection. DNS errors are common internet roadblocks, but usually happen on a small scale, affecting individual sites or services. Because the use of AWS is so widespread, a DNS error can have equally widespread results.

Screenshot of a Downdetector page showing an AWS outage affecting sites and services including Reddit, Snapchat, Ring, Roblox and Fortnite.

In Monday's outage, AWS said, the root cause was a scenario known as a "race condition." Over and over again, multiple components and processes designed to fix problems were essentially competing with one another. In their well-intentioned but overlapping and thus uncoordinated efforts, they were undoing each other's work.

For instance: "The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. Therefore, this did not prevent the older plan from overwriting the newer plan."

Timing and missed opportunities were also factors for an AWS resiliency system called the Network Load Balancer. This system routes traffic to functioning nodes -- though on Monday, some of them weren't yet ready. It ties into a separate network health check system, which was experiencing its own failures as an increased workload caused it to degrade.

"This meant that in some cases, health checks would fail even though the underlying NLB node and backend targets were healthy," AWS wrote. "This resulted in health checks alternating between failing and healthy."

The findings have spurred the cloud computing platform to make changes, including:

AWS has disabled some automation. Before reenabling, it "will fix the race condition scenario and add additional protections to prevent the application of incorrect DNS plans."
AWS is adding a "velocity control mechanism" to limit health check failures.
AWS plans to improve a throttling mechanism to "limit incoming work based on the size of the waiting queue to protect the service during periods of high load."

How the outage unfolded

AWS first registered an issue on its service status page just after midnight PT on Monday, saying it was "investigating increased error rates and latencies for multiple AWS services in the US-East-1 Region." Around 2 a.m. PT, it had identified a potential root cause of the issue. Within half an hour, it had started applying mitigations that were resulting in significant signs of recovery.

That seemed like a good sign. AWS said at 3.35 a.m. PT: "The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now."

But even though the issues seemed to have been largely resolved as the US East Coast came online, outage reports spiked again dramatically after 8 a.m. PT, when the workday began on the West Coast.

Chart showing Amazon Web Services outages reported on Downdetector

As of 8:43 a.m. PT, the AWS status page showed the severity as "degraded" and offered this brief description: "The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers."

Also at that time, AWS noted: "We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations." (EC2 is AWS shorthand for Amazon Elastic Compute Cloud, a service that it says "provides secure, resizable compute capacity in the cloud.")

Amazon didn't respond to a request for further comment beyond pointing us back to the AWS health dashboard.

Around the time that AWS says it first began noticing error rates, Downdetector saw reports begin to spike across many online services, including banks, airlines and phone carriers. As AWS dealt with the issue, some of these reports saw a drop-off even as others had yet to return to normal.

Around 4 a.m. PT, Reddit was still down, and services including Ring, Verizon and YouTube were still experiencing significant issues. According to its status page, Reddit finally came back online around 4:30 a.m. PT, which CNET verified.

As of 3:53 p.m. PT, Amazon declared that the problems were resolved.

What else should we know?

According to Amazon, Monday's issue was geographically rooted in its US-East-1 region, which refers to an area of northern Virginia where many of its data centers are based. This region is a significant location for Amazon and many other internet companies, and it supports services spanning the US and Europe.

"The lesson here is resilience," said Luke Kehoe, industry analyst at Ookla. "Many organizations still concentrate critical workloads in a single cloud region. Distributing critical apps and data across multiple regions and availability zones can materially reduce the blast radius of future incidents."

Although DNS issues can be caused by malicious actors, that was not the case with Monday's AWS outage.

Technical faults can, however, allow hackers to look for and exploit vulnerabilities when companies' backs are turned and defenses are down, according to Marijus Briedis, CTO at NordVPN.

Briedis added that when an outage occurs, you should look out for scammers hoping to take advantage. You should also be extra wary of phishing attacks and emails telling you to change your password to protect your account.

"This is a cybersecurity issue as much as a technical one," he said in a statement. "True online security isn't only about keeping hackers out, it's also about ensuring you can stay connected and protected when systems fail."

Read Entire Article