Illustration of air traffic
Controlling air traffic behaviors (Source: nats.aero)

Simplifying the “Ops Equation”

From Applications to Services

Back in the day, an “application” was a well-structured, comprehensible affair. It was self-contained software deployed on target hardware to implement the desired functionality. That code was an “application” because it—quite literally—allowed the hardware to be applied to a real-world problem.

We’ve lived with this “application as application-of-hardware“ model ever since, even after the specifics of hardware ceased to matter and we started to provision infrastructure through APIs. The “ops equation” has always been: capacity plus healthy infrastructure and up-to-date software equals a happy system.

This application model is gone. While creating and running applications used to be difficult, time-consuming and expensive, this is no longer the case in our software-first world. More software is produced today than ever, by more teams, more rapidly, in smaller chunks—and with APIs. As a result, the application layer has become a continually growing and rapidly evolving landscape of connected applications, microservices and shared services that may run anywhere: on premises, in hybrid setups, in the cloud, across multiple clouds and at the edge.

This is great for the business. Creating new software capabilities on top of and next to existing capabilities provides it with the agility it craves and allows it to accelerate innovation.

The Trouble with Cloud-Native Service Landscapes

But these modern, complex and dynamic environments are difficult to operate. Their applications and services are no longer self-contained. They rely on dependencies, many of which are shared with other services, and are subject to an unknown number of consumers. The complexity of these environments also means they are virtually always in some state of degradation. This makes them prone to unpredictable behaviors at the application level, which can spiral out of control and bring the business down.

In fact, all of the 2020 outages we reviewed recently occurred due to unpredictable behaviors at the application level. It is counter-intuitive, but the details of outages are merely distractions. The story arc of every outage is always the same: “Innocuous event meets conspiracy of circumstances and unexpectedly causes an unknown limit to be exceeded. Mayhem ensues.”

“Story of every outage: innocuous event meets conspiracy of circumstances, exceeds unknown limit, drama ensues.”

Of course, not every degradation becomes an outage. But both, degradations and outages, are results of the same underlying application behaviors. These behaviors are large-scale in nature and affect many services. They are always environmental behaviors—outside of any specific thread of execution—and appear as noisy neighbors, thundering herds, retry storms, feedback loops, random ripple effects and cascading failures. They are also highly unpredictable and tend to occur unexpectedly, without notice. And because different components react differently to changes in interaction behaviors, they are non-linear: what is a CPU issue in one component becomes a latency issue in another, and so forth.

Photograph of starlings murmuring (source: nationalgeographic.org and unsplash.com/@alanrobertjones)
Application-level behaviors in nature.

Because the health of the application layer is no longer dependent on the infrastructure layer but instead on the health of its complex system of dependencies, the conventional “ops equation” does no longer work. We can no longer assure a healthy environment by merely provisioning capacity and keeping infrastructure and software maintained. For instance, when Coinbase’s frontends were flooded in June, the standard operations response of provisioning more frontends failed because the application behavior was faster than the speed with which frontends could scale out. Similarly, the thundering herd that brought down Robinhood’s DNS in March was not controllable with conventional operational techniques.

The New Normal: Application Behaviors

Application behaviors are the key determining factor for operational success in a modern service landscape. And they are here to stay. We have moved to cloud-native architecture and, ultimately, service landscapes because the business needs the agility they provide. There is no going back to stand-alone, monolithic applications with their long development cycles. And because service landscapes come with unpredictable application-level behaviors, they are inescapable. Unpredictable application behaviors are inevitable, fundamental aspects of modern, complex and dynamic service landscapes.

Without proper control, application behaviors can quickly get out of hand, causing severe degradations or even outages. Yet, because they are unpredictable, they can’t be prevented with “better code.” And, as we’ve seen, they can’t be prevented with conventional operational control either.

Bringing these behaviors under control is the defining problem for operators today.

Cloud-Native Traffic Control

But how can they be brought under control? The answer to this question comes in two parts. First, we need to be able to observe behaviors. This means we need to see what’s running. We need to see every endpoint that participates in the service landscape. Since this will likely include external as well as internal endpoints, we can’t rely on agents to make them visible. And once we see what is running, we need to be able to observe how services interact. We need to measure these interactions based on a set of golden signals that are universally meaningful, such as number of requests, latency and bandwidth.

Second, we need to be able to control interaction behaviors based on these golden signals so we can prevent disruptions, remediate unpredictable failures, assure the digital experience and manage costs. In short, we need to detect and shape application-level behaviors, actively and in real-time, as they happen.

Glasnostic control plane screenshots
Observing a denial-of-service attack (1) and remediating it in real-time (2) using Glasnostic.

Although ABs are a fundamental aspect of CNEs, they can’t be controlled adequately within the conventional “ops equation” of resource provisioning, infrastructure health and software maintenance. As a result, it’s best to separate them from traditional operations and control them independently. This allows operators to focus their conventional work on improving resource provisioning, infrastructure health and software maintenance without having to also address application behaviors.

“Remove application behaviors from the ‘ops equation’ and control them independently.”

In many ways, controlling application behaviors is very much like controlling air traffic. Back when aviation was new, and there were few flights, it was enough to leave the flying to individual pilots. Now, with thousands of aircraft in the air at any time, flights need to be managed independently. Air traffic control discovers aircraft by transponder and call sign and then observes flight behavior based on the golden signals of position, altitude, direction and speed. This enables them to control flight behaviors and, thus, to prevent collisions, react to emergencies, assure a safe flight experience and manage cost. In short, air traffic control detects and shapes flight paths, actively and in real-time, just like modern operators detect and shape application behaviors.

Like control of application behaviors—“cloud traffic control,” if you will—air traffic control is powerful because it is an independent concern. Air traffic control is eliminated from the “flight equation,” thus freeing pilots from having to assure the safety of the airspace and allowing them to focus on running the flight as safely and efficiently as possible. This, in turn, allows for more efficient use of the airspace and more flights to be carried out, more flexibly.

Summary

We live in a new world of rapidly growing landscapes of connected, cloud-native applications and services that may run anywhere, across clouds, on premises and at the edge. The benefits such service landscapes bring for the business in terms of agility and innovation makes the transition to them inescapable and irreversible. Service landscapes are here to stay. But the interconnected nature and rapid evolution of service landscapes also make them prone to unpredictable behaviors at the application level, some of which can turn into “black swans” and bring the business down. Keeping these behaviors under control is the defining problem for operators today.

Unfortunately, the systemic and unpredictable nature of these behaviors makes it impossible to control them with “better code.” Similarly, because behaviors are not due to issues with the application itself but rather due to its dependencies, we also can’t control them with conventional operations. As a result, we need to separate the control of application behaviors from the conventional “ops equation” and manage them independently. This management requires observability at the interaction level and control based on golden signals, in real-time.

Separating the control of application behaviors results in an environment that can be operated with excellence. Operators are freed from having to address application behaviors with conventional operational means and can therefore focus on further improving operational efficiency. Developers are freed from having to wait on “ops” approval and thus can deploy more services more rapidly, knowing that the operations team is able to actively assure the digital experience, in real-time.