Photo of a control instrument at Oak Ridge National Laboratory
Circuit breaking a nuclear reactor in the 1940s.

Preventing Systemic Failure: Circuit Breaking—What it is and How it Works, Part 1

This is the first of a two-part series on circuit breaking. In this post, we cover the pattern and how it is approached differently by developers and operators. In part two, we’ll explore its typical use cases and how it is implemented in modern service middleware.

Service-to-service calls across the network are the essence as well as the bane of microservice architectures. It is not a question of whether such calls will fail, but rather when they will fail and how often. Service dependencies may fail due to logical errors, capacity issues or any other external factor. In fact, failure is one of the better outcomes. The worst case is a call that simply “hangs” until it times out.

Such failures are bad enough in standalone applications, but if they are allowed to ripple and compound through a service architecture, they can quickly morph into a series of catastrophic events. For example, a load spike that at first might choke off only a single data store can quickly lead to widespread slowness in downstream systems. In turn, this can lead to exhausted connection pools, interference by auto-scaling, further downstream failures or unexpected behaviors that can cause the situation to quickly spiral out of control.

Therefore, the critical questions for microservice practitioners are: how do I prevent such failures from cascading and compounding through the architecture and how do I ensure a service’s performance and availability if one or more of its dependencies fail? The answer to these questions is often to protect important service calls with a circuit breaker.

Circuit breaking is a fundamental pattern designed to minimize the impact of failures, to prevent them from cascading and compounding, and to assure end-to-end performance.

Like its analog from the world electrical circuits, the purpose of a circuit breaker is to protect systems from overloading and to prevent failures from—quite literally—burning down the house.

Microservices and The Circuit Breaker Pattern

The circuit breaker pattern was already a fundamental operational pattern when Michael Nygard introduced it to the wider audience of software practitioners in his book “Release It!” It has lived a double life as both an operational and development pattern ever since. Like in electrical circuits, the basic pattern involves injecting a switch into a line of communication that can be either closed or open. If the breaker is closed, calls are allowed to pass through. Conversely, if the breaker is open, calls fail immediately (figure 1).

The difference between circuit breaking as an operational pattern and circuit breaking as a development pattern lies in what causes these states to change and who changes them.

Circuit Breaking as a Developer Pattern

Because developers care foremost about their code, circuit breaking as a developer pattern is in effect a compensation strategy to minimize the impact failures in dependencies may have on it.

Circuit breaking as a developer pattern is a compensation strategy designed to minimize a service’s exposure to a failed dependency.

As a result, developers gravitate towards circuit breakers that they can link against or are otherwise close to their service, such as Hystrix for example. Developers can spend a considerable amount of time tweaking the logic that opens and closes the breaker. Although many variations of such logic exist, the basic mechanism involves the breaker library passing calls through as long as the breaker is closed and failing them immediately when the breaker is open (figure 1). If calls are being passed through, the breaker checks the return status of the responses. Failed calls are counted towards a threshold or limit, while successful calls reset the failure count. If failures exceed the allowed maximum, the library then opens the breaker. Because this has the effect of failing all subsequent calls, a timer is set at the same time that upon firing sets the limit to a low, “canary” value and closes the breaker again. This state is also referred to as being “half open”. The first “canary” call that succeeds resets the breaker back to its “fully closed” state and the cycle continues.

Circuit breaker logic diagram

Figure 1. Traditional circuit breaker logic. Client calls are passed through as long as the breaker remains closed and are failed immediately if the breaker is open. While it is closed, the breaker observes service responses for failures and “trips” itself as soon as a predefined threshold is reached. A timer then closes the breaker again after a while to let a few “canaries” through. If they succeed, the breaker is closed again. If not, it remains open. This mechanism helps clients avoid excessive slow-down due to a failed service.

There are many variations of this basic circuit breaking mechanism. For instance, breakers may expire accumulated failures after some time or employ failure budgets based on moving averages. However, as these variations become more sophisticated, they also depend increasingly on what the code in question aims to achieve. For example, a call to a recommendation engine may require a circuit breaker combined with a shorter timeout while a call to an authentication service should likely only be failed if it is known to be unresponsive. As a result, none of these variations have so far been able to replace the basic logic of the developer pattern. This is why circuit breaking as a compensating developer pattern is a fundamental, yet fairly limited technique for minimizing a service’s exposure to dependency failures.

Circuit Breaking as an Operational Pattern

Because operators are responsible for the health and systemic stability of the overall service landscape, they approach circuit breaking primarily from the perspective of protecting services, not merely isolating clients. As a result, circuit breaking as an operational pattern often resembles load shedding, i.e. the pattern of terminating requests to relieve services of pressure. An operations team may for instance notice that the latencies of a storage cluster began to increase by 10ms a few minutes ago and decide to shed load by starting to circuit-break non-mission critical services.

Circuit breaking as an operational pattern aims to manage systemic stability and availability by relieving systems under duress of pressure.

More complex architectures often require operators to refine how circuit breaking is applied. In our previous example of attempting to remediate the storage cluster’s slowness, the operations team may likely get better results (while causing less downstream disruption) by circuit-breaking long-running queries first and then adjusting the breaker as needed. Such refinements are often achieved by combining operational patterns such as circuit breakers with timeouts.

Another big difference between applying circuit breaking as an operational pattern and applying it as a developer pattern relates to how circuit breaking capabilities are made available. For developers, circuit breakers are typically provided in the form of a library that applications can link against. With a few exceptions, operators don’t have this option. Instead, they rely on the circuit-breaking capabilities of separately deployed middleware like API gateways or as part of an underlying platform such as a service mesh.

However, with the advent of large-scale, organic architectures, operators need the ability to apply circuit breaking not just between individual services, but between arbitrary and at times extensive sets of services. Implementing circuit breaking between sets of services is excruciatingly difficult to achieve with middleware or platforms that are designed to connect pairs of services in fairly static, small or mid-sized architectures. This is because each proxying instance has to be configured to differentiate between all kinds of clients as well as services and to do so in a manner that is consistent with every other instance of the cluster. Cloud traffic controllers such as Glasnostic are much better suited for this task.

Finally, due to the complex and unpredictable nature of emergent behaviors in large-scale and dynamic architectures, circuit breaking for operational purposes can no longer rely on a static, “set-and-forget” configuration based on timers, failure budgets and reset canaries. Instead, operators need to be able to open or close breakers on demand and to adjust their functionality in real-time.