Building resilient systems and running them reliably is hard, especially in cloud-native environments, where everything is ephemeral and very few assumptions can be made over longer stretches of time.
It used to be easier. Applications used to be self-contained, two- or three-tier architectures that could be provisioned with infrastructure and maintained over time with reasonable predictability, even once developers started to decompose them into microservices.
Enterprises today, however, are running service architectures that are vastly more complex than they were ten or even five years ago. Today’s architectures are highly interconnected landscapes of services that change continually and can evolve rather rapidly. What used to be an “application” is today a federation of services that depends on other federations, shared business services or SaaS services—and that, in turn, other such “applications” depend on.
The high-velocity, fully decoupled development that such service landscapes enable is hugely beneficial to the business, but the explosion of service communication they bring about makes them also prone to pathological and disruptive interaction behaviors among applications and services such as noisy neighbors, feedback loops, thundering herds and other ripple effects.
Worse, these interaction behaviors are fundamentally unpredictable. This unpredictability makes it impossible to avert them with “better code.” It also makes it impossible to avert them with “operational excellence.” Provisioning right-sized capacities and managing support stacks for maximum uptime is powerless against the shockwave of a cascading failure rippling through the architecture or even a simple outage in an external dependency.
This problem of interaction behaviors among applications and services is massive and affects virtually every company today. We are running ever more services in ever more locations—on-premises, hybrid, cloud, multi-cloud, edge—and using ever more diverse technologies. At the same time, we treat an ever-growing number of services as software capabilities—services for other applications to build on top of. And, if interaction behaviors in service landscapes cannot be brought under control, they can severely limit the company’s ability to execute, stifle innovation and cause costs to spiral out of control.
Interaction behaviors in service landscapes are fundamentally unpredictable due to the non-linear nature of the chains of events that constitute a macro-behavior. A usage spike may meet an event backlog of another, connected application, which changes the fanout pattern of a service, which leads to a memory issue and subsequent thrashing of a cache, which causes a CPU issue in one place and a retry storm in another, only to culminate in a thundering herd.
It is always a confluence of many factors that leads to a “black swan” event. Except that these events should be better called “black raven” events as there is a seemingly infinite number of such events possible. After all, systems of reasonable complexity are always in some state of degradation. What most of these storms have in common, though, and what SRE teams tend to identify as their “root cause” is that, again, many factors conspired to trigger a previously unknown and unexpected limit of some software component. SRE talks such as Laura Nolan’s “What Breaks our Systems: A Taxonomy of Black Swans” are full of examples of incidents caused by limits: systems running out of file descriptors, connection pools getting exhausted, and so forth. The truth is that all systems are finite, especially if we pretend they are not.
All systems are finite, especially if we pretend they are not
These interaction behaviors complicate operations considerably. We can no longer simply provision machines with enough CPU and memory, trust the code we deploy works and feel confident that we’ll hit the same “nines” as our underlying infrastructure promises. The problem is that, in a service landscape, the stability of the service we are responsible for depends on the “good behavior” of many other components, which we may not have control over.
In the past, an operations team would run, say, five Java applications. And if an application didn’t work, the ops team would go in and shut it down or stand it up somewhere else.
However, trying to ensure the stability and uptime of a service in a service landscape by provisioning it with enough capacity and a solid software base is akin to a kindergarten teacher provisioning her class with enough room and then leaving it to their own devices. And as we learned in Kindergarten Cop: “Kindergarten is like the ocean. You don’t want to turn your back on it!”
But “fixing” the unpredictability of interaction behaviors in code doesn’t work either! Because code is written ahead of time, it can only prevent predictable events. As a result, whatever fix we release in code or configuration today is ultimately only another limit to trigger a future “black raven.” Also, because interaction behaviors in service landscapes are large-scale and because fixes tend to be very local affairs, their effect, on the whole, will be, more likely than not, homeopathic.
Lastly, ever more and ever deeper observability is of no use either. It simply takes too long to run full diagnostics on large-scale, non-linear chains of events. Unpredictable interaction behaviors are, by definition, unexpected and require an immediate response—in real-time.
As operators, we need to acknowledge that today’s cloud-native architectures are inherently unstable and prone to unpredictable interaction behaviors. These behaviors are not even “black swans.” They are “black ravens,” and there is nothing we can do on the development or the operations side to prevent them. They must be controlled.
The chaos will continue to stifle companies’ ability to innovate until the kindergarten teacher returns.
But it is critical that the kindergarten teacher manages the situation as a whole and does not fixate on individual parts. This is the hard thing to swallow for us as engineers. In a federated landscape of services, the individual service can’t save the federation. There are no heroes in IT. We must control the entire landscape, not just the parts we care most about.
Done right, operational control of interaction behaviors consists of four tasks. Operators need to:
Observe what is going on—at the global level, not the level of individual parts.
Remediate sudden degradations and “black ravens” without falling for a “root cause” analysis.
Actively create predictability by applying proactive policies that help stabilize the landscape.
Applying management controls to interaction behaviors in this way allows the business to fully realize the promise of the cloud. It enables it to build, deploy and run more services, faster, in more places and with more technologies. It allows operations and SRE teams to manage outages effectively, build resilience proactively and manage the digital experience. This, in turn, has the effect of unblocking innovation, allowing developers to deploy at lower levels of maturity and directly to production. All the while, costs remain contained.