In my previous post, we covered the concept of software dependencies and how they often cause “monsters” (read: unexpected and unpredictable issues) in cloud production. The “monsters” we described are problems such as sudden outages and unpredictable behaviors that can be very frustrating to developers and customers alike.
This post discusses what to do about these “monsters.” Because they’re outside the development team’s control, remediating these issues requires an out-of-the-box perspective.
Usually, when we see something going wrong in production, our first instinct is to go back and fix the application’s code. But, this way of thinking won’t help to remediate the root problems behind unpredictable dependency “monsters.”
Picture a code path as a flight. It’s a transaction that runs from point A to B. If we think in these terms, code correctness does matter because the flight needs to get to its destination on time, without crashing or landing in the wrong place!
But, our plane isn’t the only one in the air. As we can observe, whenever we go to the airport, there are a lot of airplanes in the air, traveling to given destinations every day. Dealing with other planes is a whole other concern for our plane to consider, which is why airplanes rely on radars and air traffic control to function.
When we use this analogy, we can split our application runtime issues into two categories: “cockpit” concerns that relate to getting the transaction to destination on time and “airspace” concerns that relate to ensuring the success of our transactions in the face of other processes, dependencies, continuous change and capacity issues.
Many of the problems we see in production are caused by “other planes” that share our airspace. As a result, it’s not enough to perfect the “cockpit” concerns. If we fail to address the “airspace” concerns, we expose ourselves to “noisy neighbors,” cascading failures and other ripple effects.
To put the airplane analogy into context, we need to realize that just going back and fixing code doesn’t solve the growing “airspace” problems in cloud production. Cloud production issues are complex, large-scale events outside the bounds of the application itself, so there is no single line of code to “fix.”
Plus, cloud production issues are caused by a chain reaction of numerous factors, meaning that they are very volatile and unpredictable. Let’s say we were to go back and fix some code in an attempt to remediate the issue. By the time we’d gotten some sort of cause resolved, the event would have already played out and done the damage. Plus, it’s usually not a single, isolated reason that causes the outage. We often see the outage, damage, and other effects caused by a confluence of circumstances, meaning that searching for the root cause becomes a wild goose chase. But even if we get lucky and manage to fix the problems somehow, there’s still no guarantee that our fix will prevent similar events in the future.
Ultimately, these “monsters” come from a very specific interplay of service behaviors in production (out in “airspace”). So this means that they’re virtually impossible to assess correctly from within a service’s thread of execution.
As I’ve tried to show, “monsters” happen due to the explosion of dependencies we’re seeing in modern cloud operations. Because our cloud environments are so highly connected, many parts could break at any time. And, because we have so many dependencies, any undesirable behavior of one system has the potential to affect many connected systems. It’s like dominoes. Finally, because these behaviors are mostly unpredictable, they need to be detected quickly and responded to rapidly.
The Amazon Web Services outage in December 2021 served as an example of what could happen when the pathologic behavior of one part of the system (the “thundering herd“ on the internal network) causes issues that ripple through connected systems.
The outage, which lasted for almost 7 hours, was explained in a post-mortem message by AWS: “An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network.”
Because of the interconnected nature between AWS’s networks, the failure ended up creating a “large-scale event” for the business.
So, today we learned that the AWS outage was due to a "normal" activity that triggered an "unexpected behavior" in a large number of services, which "overwhelmed" the network devices that link two networks.— tkunze (@tkunze) December 11, 2021
What can we possibly learn from this? 1/https://t.co/zO2V2OhdVX
“Thundering herds” are a prime example of pathologic, disruptive behaviors. Simply put, this term describes a large number of connected services waiting for a condition and then when they attempt to do the same thing at the same time, and the condition suddenly comes true for all of the systems, the environment is overloaded and brought down.
It is important to note that each service does the right thing from a code path perspective. Every service has been patiently waiting for the condition to come true, and individually, they each react correctly. The real problem lies in the lack of airspace control. “Thundering herds” can be prevented if some kind of “airspace control” is in place to quickly detect the issue and then respond to it with the proper runtime controls to avert degradation and, in the case of AWS, outage.
Without runtime control, you are flying blind.
Had AWS had the capability to detect the issue and then trigger backpressure against the services on the internal network as a response, the whole outage (and any future issues) could have been avoided.
Proper runtime control enables developers to
Holistically measure application and service behaviors at scale.
Regulate behaviors and control “monsters” by reactively responding to them.
Proactively prevent issues from arising—ahead of time.
Calibrate the interactions between your application and its dependencies.
Optimize for business goals such as performance, service level objectives (SLOs), cost, etc.
Create holistic visibility, strong reliability, enforced security, and predictable performance.
At Glasnostic, we are laser-focused on enabling every developer to keep their production “airspace” under control, no matter who deploys what and when. Click here to try it for free.