We’ve all been there. We’ve all gone down the rabbit hole, spending hours searching for that root cause, only to end up with a number of circumstances that contributed to our issue but no good way of doing something about it. There may be too many places to tweak, the fix may be unclear, its effect difficult to predict, or it may be simply out of our control.
Sound familiar? These experiences are becoming more common as we compose ever larger architectures. We connect ever more services, using ever more technologies, across ever more diverse environments—multi-cloud, hybrid, edge, on premises. The business loves the speed with which we can spin services up and stitch them together in the cloud. But every service, every technology and every environment has its peculiarities. Every component brings its own quirks to the mix. No wonder the architecture that results from this stitching-together—quite literally a landscape of connected services—can behave erratically.
When these behaviors get out of hand and threaten to bring the business down, it is not time to hunt for root causes. When chaos reigns, we must take control. We must “focus on fast detection and response.”
A few years ago, a large retailer experienced a cascading failure that lasted for hours. The company had long adopted a culture of innovation and thus ran a vast and heterogeneous architecture that continued to evolve over time. This architecture included a set of Kubernetes clusters running containerized services and an OpenStack environment that hosted VM-based workloads, including several Kafka clusters.
One day, disaster struck. Systems went down and, like always, the question on everybody’s mind was, “what just happened?” And so, the hunt for a root cause started.
8:30 am: Slowness alerts flood in. Teams are dispatched to investigate.
9:00 am: A DevOps team notices that Kubernetes seems to be “thrashing.” Some pods appear to be continually rescheduled to run on different nodes.
9:30 am: The team determines that the thrashing is not related to any pods. Eventually, all pods are rescheduled.
10:00 am: Upon deeper inspection, it becomes evident that the behavior only affects some clusters, not others.
11:00 am: Teams agree that the Kubernetes behavior seems only to affect the largest clusters.
12:30 pm: Epiphany: the affected clusters are not just large; they run large nodes and thus many pods per node. Every now and then, some nodes run out of CPU. This starves the node’s Docker daemon, as a result of which Kubernetes declares the node “unhealthy” and schedules its pods onto other nodes. But why is the node running out of CPU?
1:00 pm: It becomes evident that the logging sidecars, which are automatically injected into every pod to standardize logging, create a “thundering herd”: they all try to log at the same time. Usually, logging doesn’t use much CPU at all, but when all sidecars try to log simultaneously, CPU spikes. Why are they all logging at the same time?
1:30 pm: The sidecars log to Kafka clusters that run on OpenStack and are hosted in a different part of the infrastructure, and these clusters are only intermittently available. There are extended dropouts, and every time Kafka comes back online, sidecars try to log.
2:30 pm: The OpenStack team confirms that it applied a routine upgrade to its Neutron stack. This should’ve been a “30-second” upgrade, but they are still battling networking issues.
This “30-second,” “routine” upgrade snowballed into hours of application outage. It took the operations teams the better part of a day to reach the bottom of the rabbit hole, only to realize that the resolution was entirely out of their control.
What if the teams had focused on remediation instead of looking for a root cause? What if they had had the ability to detect interaction behaviors between services and systems quickly and to restore order right away?
Here’s how that alternative timeline might have looked like:
8:00 am: Slowness alerts come flooding in. Using their dashboards, operators examine current system interactions and discover request spikes, dropouts and general “choppy” interaction behaviors between logging sidecars and Kafka clusters. In itself, such “choppiness” is already a sign of trouble, but it is the recurring pattern of prolonged dropouts followed by large spikes that alerts the team to the thundering herd of loggers that cause Kubernetes to thrash.
8:01 am: With a few mouse clicks, the team applies backpressure against the loggers to flatten these spikes. (They still want everything logged, but it is critical that the loggers don’t try to do it all at once.) And to make sure, they also create a circuit breaker to kill any long-running requests that have no chance to complete anyway.
8:02 am: Everything is running smoothly, and the cascading chain of events is averted.
Focusing on rapid detection and response at the service interaction level instead of going down the rabbit hole of searching for a root cause saved the company eight hours of outage. The architecture didn’t come crashing down, and the teams can still ferret out any root causes on their own time—or not.
With companies moving more and more towards an agile, organically evolving landscape of services, operators must detect unexpected interaction behaviors rapidly and respond to them in real-time, as they happen. Without this ability, outages are bound to get worse as the architecture grows.
As a service landscape grows and evolves, observability is no longer enough. Operations teams must invest in the ability to observe rapidly, at the right level, and to take control in real-time.