The resilience of a distributed microservice application depends fundamentally on how gracefully it can adapt to those all-too-certain environmental degradations and service failures. It is therefore not only a good, but essential practice that such applications be tested for how they will behave under various failure scenarios.

Because distributing an application across hosts introduces a crucially unreliable factor—the network—most failure testing scenarios justifiably focus only on network-related characteristics such as unexpected latencies or communication errors like timeouts. However, when applications are distributed, communication issues are not the only causes of failure. Capacity and performance mismatches between individual hosts, nodes or containers can wreak significant havoc as well. And, as an architecture progresses from a single, distributed microservice application to a set of applications built on top of shared services, and finally to a continually evolving, growing organic architecture, such mismatches become ubiquitous and the primary sources of failure.

In this post, we explore what chaos engineering is and how it is not only a good, but essential practice for distributed applications. We then turn to more advanced, organic microservice architectures and look at how their failure management needs can be better served by a cloud traffic controller.

The Trouble with Distributed Applications

Creating a distributed application is hard, particularly if it is designed to execute large-scale transactions and perform consistently at the upper end of what its components can support. Worse, the necessary tight coupling between the components that such applications require is inherently brittle and thus not only expensive to create, but even more costly to maintain.

This brittleness is not the result of component logic that is difficult to test. Functional tests are straightforward to create and automate at the unit and integration testing stages, but the non-functional behaviors of distributed applications are not only out of the architect’s control: they can’t even be tested without accurately simulating the actual application, at relevant scale and with realistic traffic.

Furthermore, setting up a staging environment that is a meaningful twin of the production environment is a massive project at scale, in particular if the production environment is dynamic, that is if it changes often. And, if microservices are “done properly,” the topology of the application, the dependencies between its services and their interaction behaviors are going to be evolving on a daily basis, if not continually. In such environments, the test environment is often no longer a meaningful representation of the production environment by the time the first chaos experiment is run.

As a result, it is vital to test in production.

Of course, the sheer number of tests that need to be written, run, analyzed and maintained to account for new services, previously unknown interdependencies and interactions is daunting—a burden that increases as a service architecture scales and gains complexity.

Making Distributed Applications More Resilient with Chaos Engineering

“Chaos engineering” refers to the methodical design and implementation of experiments to test assumptions about—and discover previously unknown behaviors of—distributed service architectures. The premise is straightforward: since load spikes, errors and slowness are bound to happen, it is best to test our assumptions about how the architecture will behave when these degradations and failures occur. The ultimate goal with chaos engineering is to verify that the system will behave as expected during adverse conditions, and in case it won’t, to learn what steps can be taken to make it more resilient.

The practice of chaos engineering was pioneered by Netflix as a method to increase the stability and resilience of their growing, large-scale microservice application. For instance, to determine if their services would degrade gracefully in the face of network outages, Netflix would run a “Chaos Monkey” program that would simulate the effect of randomly “pulling cables” in the network. This allowed Netflix’ engineers to learn whether services timed out as intended, didn’t end up in a spinlock that chewed up system resources or employed a sound retry strategy.

The practice gained wider use when Netflix, in 2011, open-sourced their “Simian Army” collection of failure-detecting, -inducing and -remediating tools. Since then, the practice has grown and a number of products and tools have become available, not least Netflix’ successor to the “Simian Army,” its “chaos automation platform,” ChAP.

While deliberately introducing failure into a production environment may seem counterintuitive at first, it is actually not all that different from other changes that are introduced into production environments on a daily basis, as Netflix’ resilience engineering team points out in a recent paper. In fact, despite the word “chaos,” the controlled nature of chaos engineering experiments makes them less risky than regular operational actions such as redeployments, upgrades or migrations, whose risks are often simply unknown.

Chaos experiments are typically designed to be narrow in scope so that causes and effects can be observed in reasonable isolation. Once the system behavior is understood, the aperture of the experiment can be increased until wider parts of the system as a whole can be tested under the specific, adverse conditions. In addition, experiments are often combined with other operational patterns. For instance, setting up a “blast radius” helps minimize potential fallout of an experiment and running it in a canary deployment helps contain its risk, in particular for automated experiments and when the canary is accompanied by a parallel baseline deployment that can be compared programmatically. Other ways to limit the risk of experiments include applying them only to a small traffic segment, limiting the experiment’s duration or confining it to off-peak hours. Containing the extent of experiments has the additional benefit of being able to run many experiments in parallel.

Of course, there are limits to how much chaos experiments can be minimized. Experiments are only meaningful if they are run at a scale that allows conclusions to be drawn for the full environment. This means that a contained experiment and the actual production environment must be substantially identical in characteristics and behaviors.

The practice of chaos engineering does not only extend to experiments. It also includes the operational aspects of creating and running experiments. For instance, it is crucial that experiments are actually carried out, not dropped during “busy” times. This requires a suitable level of automation, both for designing and running experiments, which, due to the often very peculiar nature of experiments, is easier said than done. For example, Netflix developed an in-house service called “Monocle” that introspects service calls and uses the discovered data to generate experiments automatically.

From Chaos Engineering to Cloud Traffic Control

Diagram placing applications on a scale vs. complexity grid.
Scale vs. complexity of service architectures. Most enterprises graduate from small-scale microservice applications to an organic architecture. Netflix’ microservice application is an outlier with respect to scale, but comparatively simple in nature: a movie streaming app with a recommendation engine.

Chaos engineering is an essential practice to improve the resilience of a distributed microservices application. But what happens when the application not only increases in scale but also in complexity?

When traditional enterprises become agile enterprises, it has a profound effect on their IT organization. They adopt tools and processes to double down on developer productivity. They organize themselves around parallel development teams to more effectively execute on a microservices and likely an in-house cloud services strategy. While this provides the business with great flexibility, it also leaves their operations, SRE and security teams with a sprawling landscape of hyperconnected services that is difficult to bring under control. Moreover, in order to support an agile enterprise, tech teams need to run this service landscape as an equally agile, continually growing and evolving organic architecture.

Organic architecture is a continually evolving service landscape that is operated with the aim to maximize its adaptability to the ever-changing needs of the business.

Such organic architectures are fundamentally different from distributed microservice applications. Unlike distributed microservice applications, which are engineered to implement a defined set of requirements in a way that is as deterministic as possible so they can be reasoned about, organic architectures prioritize adaptability over the ability to reason about them as a whole.

As a result, running such an organic architecture requires powerful stability and security controls. These controls need to be universally applicable, to any and all service interactions, and because organic architectures are by their nature dynamic, they need to operate in real-time. This is what a cloud traffic controller does. A cloud traffic controller enables operations, SRE and security teams to run an organic architecture, in real-time.

This is where a parallel to chaos engineering comes in. The stability and security controls required by an organic architecture can be viewed as “continuous chaos engineering,“ except that they are not experiments but activities that are essential to operating an organic architecture. And, instead of carrying out experiments every now and then, organic architecture requires the ability to exert control continually.

As businesses continue to transform digitally, their applications do not merely grow in scale to become the size of Netflix. Instead, the application infrastructure grows in complexity as well as scale. The pressure then to also transform into an agile organization means that most enterprises will graduate from individual microservice applications to a rapidly growing and evolving organic architecture that can adapt to their changing business needs. This also means that they will have to graduate from chaos engineering to full cloud traffic control.

A cloud traffic controller enables operations, SRE and security teams to run an organic architecture, in real-time.

Like functional testing, which continues to be useful in an organic architecture—albeit at the limited scale of a developer’s horizon of responsibility—chaos engineering continues to play a role in organic architecture. It is still valuable to test assumptions about how individual service groups interact locally and to conduct experiments to learn about their behaviors. At the global level of the organic architecture, however, the purpose of control is not to discover but to accomplish: to implement operational patterns that are both predictable and effective. In other words, cloud traffic control is a superset of chaos engineering.

The Glasnostic cloud traffic controller.

Glasnostic is a cloud traffic controller that provides businesses with the real-time stability and security controls they need to run an organic architecture. As a control plane for operations, SRE and security teams, it allows enterprises to manage the sprawling and increasingly connected application landscapes that their successful cloud and microservice strategies create. At the same time, Glasnostic helps engineering teams by making chaos engineering tests and experiments much easier to define, run and analyze.

Summary

“Chaos engineering” is the methodical design and implementation of experiments to test assumptions about—and discover previously unknown behaviors of—distributed service architectures. Given that communication-related degradations and failures are frequent and natural characteristics of distributed architectures, particularly at scale, chaos engineering is an essential practice to improve the resilience of a distributed microservices application.

However, as digital transformation and the associated need to transform into an agile organization cause enterprises to adopt an organic architecture, the need to control stability and security predictably and effectively at the global level takes precedence over the desire to discover and learn at the local level. As a result, testing and experimentation in organic architectures are displaced by the continuous, real-time implementation of operational patterns—by cloud traffic control.

The ability to control for stability and security in real-time is what ultimately matters when operating a resilient, organically growing and evolving service architecture. In a future post, we’ll compare how chaos experiments can be run using Kubernetes, service meshes like Istio or Linkerd and a cloud traffic controller like Glasnostic.