Cloud native microservice architectures evolve rapidly, and tend to become ever more complex as they evolve. This makes it important for services to be as resilient as possible. Unsurprisingly, resilience depends in part on how defensively applications are written. However, it’s important to not only focus on thread-of-execution resilience but also on aggregate resilience, as we laid out in The Two Axes of Observability. After all, applications don’t exist in a vacuum. They interact with a growing number of dependencies, which in turn have other dependencies—and most share their environment with other applications.
One key way to strengthen the reliability of the entire landscape is to ensure that no application or service is ever overloaded. An overloaded system can potentially trigger a cascading failure or, worse, “monsters,” as we talked about in our post on dependencies.
The question developers need to ask themselves is: How can we ensure reliability in production? Backpressure and Bulkheads are two effective operational patterns that can be applied to prevent cascading failures.
Another effective pattern is Load Shedding. LinkedIn Engineering just published a blog post about their Hodor load shedding system, providing a great use case on how load shedding can help strengthen a system’s reliability. Hodor stands for “Holistic Overload Detection and Overload Remediation.” Author Brian Barkley summarizes the motivation behind Hodor:
"We now operate well over 1,000 separate microservices running on the JVM, and each has its own set of operational challenges when it comes to operating at scale. One of the common issues that services face is becoming overloaded to the point where they are unable to serve traffic with reasonable latency."
Let’s dive into it and see how it works.
Hodor consists of 3 components: an overload detector, a request adapter and a load shedder. The overload detector runs as an agent within the microservice and monitors a meaningful metric such as Java Virtual Machine (JVM) time. When this metric exceeds a given threshold, the overload detector signals an overload condition.
The request adapter is specific to the REST framework used by a service and hooks into the framework filter chain. It queries the overload detector, passes the result to the load shedder, and then carries out the load shedder’s verdict. As a result, shedding is ultimately done in a platform-specific manner.
The load shedder then decides whether to shed load or not. Load shedding happens at the granularity level of a single request. The load shedder intelligently adjusts the degree of shedding based on several factors, including contextual information about the request that is provided by a platform-specific request adapter.
Hodor supports different load shedding strategies. Most of the time, however, the best strategy is to limit request concurrency. But, service behaviors have a dynamic nature since traffic patterns tend to change over time, either through usage changes or intra-application load changes. Because of this, concurrency limits need to be adaptive, and Hodor’s load shedder starts with a known good value, then periodically runs experiments to adjust the limit.
Hodor also supports retry handling features, such as instructing clients to retry with other instances during an overload event and using retry budgets to avoid overloading the entire service. Testing is done with dark canaries to avoid detecting false overload situations.
Introducing a load shedding framework like Hodor into the application landscape has three key benefits.
It avoids capacity issues. Because all software behaves pathologically at the limit, correct capacity planning and instance right-sizing are essential for operating a production environment successfully. Hodor provides a form of insurance against unexpected capacity issues.
It prevents issues from cascading through the landscape. When capacity runs low, unpredictable chains of—often disruptive—events tend to cascade through the system. Detecting overload serves as an early warning system, and shedding load prevents cascades from forming.
It detects overload continuously, based on a meaningful metric. This is significantly better than common health checks based on HTTP status codes.
Hodor works across the entire LinkedIn landscape, with Rest.li, Netty and the Play framework, and the framework-specific request adapter component can be ported to other frameworks. In addition, it can adapt to specific service needs.
At the limit, all software behaves pathologically.
Using a load shedding framework like Hodor also has drawbacks, however:
It doesn’t account sufficiently for differences between applications. Different applications behave differently depending on language, framework, architecture, and what they do. Overload looks different enough between applications to consider basing load shedding decisions on external measures of application performance instead.
It requires careful experimentation to ensure that load shedding is only applied when absolutely necessary. Shedding entire requests is a pretty “brutal” strategy that can have substantial side effects, so it should only be applied when necessary.
It is a crude remedy. The load shedder verdict is binary: It can only respond with a “Drop the entire request” or “Don’t drop it” verdict. There is no in-between. Depending on the nature of the affected requests, a more nuanced response could be needed.
It requires an agent. Hodor only works with the JVM and uses a framework-specific request adapter that works with the request frameworks that LinkedIn uses: Rest.li, Netty, and Play. So, it’s not a strictly universal solution. And, as always, supporting agents does incur substantial costs over time.
You may not be running a giant application landscape like LinkedIn, but if you’re running a cloud native production environment and deploying often, you’ve likely seen cascading issues like the ones that Hodor seeks to address.
Maybe you are redeploying a service with a new feature or security patch. The tests looked great, but response times are suddenly slow in production, putting the service under heavy load. You don’t know what’s causing the issue.
The first response might be to roll back. But this is a Catch-22: Once the deployment is rolled back, you can’t investigate it. You need real production traffic—in fact, precisely the current traffic levels—to diagnose what’s going on and take steps to ensure that the service works when it is released again.
A better way is to measure load levels at runtime across the landscape and then control them enough to prevent the issue from developing. This keeps the deployment live and gives the team time to diagnose and address the issue while the situation is under control.
This is how Glasnostic approaches capacity issues and prevents them from cascading. Similar to Hodor, Glasnostic applies in-band data path controls that calibrate and auto-tune demand patterns and disruptive interaction behaviors.
However, unlike Hodor, Glasnostic is not limited to load shedding alone and can use other methods as needed, such as applying backpressure.
Glasnostic is designed to manage capacity (and other) issues at runtime and as flexibly as possible. This brings several advantages. Specifically, it:
Adapts to the specific behaviors of applications. We measure outward service behaviors, which is a more comprehensive measure than examining, for instance, JVM time. We look at how these measures relate to each other and how they change over time to get a full picture of the service behavior. This unlocks a whole library of patterns, not just load shedding as a singular solution.
Assesses situations based on external behavior rather than internal data analysis. External behavior is what matters to dependencies, not internal metrics. If a service slows down under load, we know it is degrading and can take action, potentially even after running a brief inline experiment. This represents a significantly more general approach than relying on deep analysis of internal metrics.
Aims to apply fine-tuned controls. Practitioners shouldn’t have to decide whether to drop requests or not. Instead, we prefer to fine-tune service behaviors by shaping any or all key metrics, thus significantly reducing policy risk.
Is agentless. Solutions to address overload and similar production issues must be able to intervene anywhere in the landscape. They must be universal. Like air traffic controllers, SRE teams must see everything in the landscape to operate successfully. Agents make this difficult and are an expensive way to achieve this.
Hodor is LinkedIn’s response to the need to manage overload conditions at runtime. Load shedding effectively addresses such capacity issues and prevents them from cascading through the landscape. However, in practice, load shedding has issues of its own: It is difficult to adapt to innate differences between service behaviors and requires continuous experimentation to do so effectively. Shedding load by dropping requests is also a relatively crude measure. Finally, Hodor requires agents, which is costly and limits its universality.
In contrast, managing capacity issues with Glasnostic allows for much more nuanced responses. Glasnostic improves on Hodor in four critical ways: It easily adapts to specific needs of services, assesses situations based on external behavior, unlocks a whole set of control primitives, not just load shedding and, finally, is entirely agentless, thus radically simplifying the adoption of a runtime control solution throughout the landscape.