There is an explosion of dependencies in today’s cloud native world. Our applications are no longer stand-alone. As it turns out, the valuable data is always outside our applications, so they connect to many other services. Some of these services may be in-house; others are third-party dependencies: partner APIs or cloud services such as managed databases, machine learning endpoints or any number of other services.
These dependencies create quite a bit of unpredictability in modern production environments. In the “old world” of traditional, stand-alone applications, we mostly had to only deal with code defects—bugs, if it were. In today’s cloud native world, however, we not only have to make sure our code is correct but also have to deal with performance degradations, reliability issues and security breaches in any of our dependencies. As a result, the overall landscape is prone to cascading chains of events that can ripple through it—often rapidly and always unexpectedly.
To developers and DevOps engineers, these chains of events can feel like “monsters”: creatures they’ve never seen before, that appear unexpectedly, and are large-scale and vicious.
Hunting down these “monsters” is frustratingly time-consuming: You may start by looking into changes made to a recent deployment. This will have you checking logs, maybe adding print statements and re-deploying, or using more sophisticated log aggregation tools. Of course, logs are rarely conclusive, so you’ll end up getting in touch with other teams to learn what they see, then looking at more logs and more monitoring data. Then rinse and repeat for any other service that may be implicated, all without a clear path forward for hours, if not days. That’s a terrible Mean Time to Remediation (MTTR).
Slaying such systemic “monsters” is all the harder because their large-scale nature often requires collaboration across team boundaries and because their propagation along dependency links often means that there is no single, obvious part of code that could be changed to remediate the situation.
The recent October 28, 2021 Roblox outage serves as a vivid example of how dependencies can create such “monsters.”
The game was down for 73 hours, owing to a load issue that quickly spiraled out of control and brought a large number of services down. As the October 31 update put it, “A core system in our infrastructure became overwhelmed [with the result] that most services at Roblox were unable to effectively communicate and deploy.”
Of course, what actually unfolded and the heroic engineering effort that went into analyzing the incident was much more complicated. But, as we’ve said before, the story arc of every outage is always the same: “normal event meets conspiracy of factors to unexpectedly exceed a hitherto unknown limit, and chaos ensues.” Software at or beyond the limit always behaves pathologically. What matters in this incident is that a highly connected (core) system of the infrastructure got overloaded and that the operations team did not have the ability to detect and respond to it in time. Had they detected the load issue quickly and had they been able to exert backpressure right there, none of this would have happened.
Had they been able to exert backpressure, none of this would have happened.
The high degree of interdependence between systems turned a simple capacity issue into a “monster” of an unpredictable domino effect that required discovery across team boundaries and ultimately was not fixable in code. It merely needed the ability to quickly detect the load issue and rapidly respond to it in production by exerting backpressure.
When dealing with such unpredictable “monsters,” our first intuition is always to lean on tools and methods we already have and know. Unfortunately, these tools come from a world where we could still fix code to address production issues. They are unable to respond to dependency issues effectively and in time.
Investing in deeper observability. Piecing the “big picture” together from deep observability data is excruciatingly difficult. Observability can answer deep questions about how code executes and how assets perform but is unsuited to detecting large-scale behaviors that are rooted in the interactions between systems. Investing in more observability merely provides you with deeper data, not the data you need to rapidly detect and respond to large-scale issues.
Tracing. Manually instrumenting all services to trace code paths is time-consuming and difficult in today’s fast-paced development environments. Not all services can or should be instrumented, and as a developer, you don’t want to trace into a dependency you don’t own. Fundamentally, dependency-related events are environmental problems, not thread-of-execution problems.
“Slamming the brakes.” Slowing the pace of deployment down to avoid “monsters” is counter-productive. Speed is what your business wants! It wants to move fast, use new cloud services, utilize ML models to make smarter and better products, connect more devices, and reach more users.
As the Roblox outage illustrates, the production “monsters” that modern, cloud native landscapes tend to exhibit must be detected and responded to as quickly as possible to prevent them from cascading through the system. That means that DevOps and SRE teams need to invest in the ability to quickly detect such large-scale behaviors and to control them, directly and at runtime.
Glasnostic is a dedicated runtime control layer that aims to make this painless and effective. Use Glasnostic to get the holistic visibility, predictable performance, strong reliability and essential security you need to run modern cloud native environments.
Read the second part on why merely showing up with good code is no longer enough here.