With today’s increasingly complex and evolving cloud architectures, monitoring and observability attract significant attention but are at times still poorly understood. When should I rely on monitoring and when on observability? What should I monitor, and what should I “observe”? What even is the difference between monitoring and observability?
Typical definitions view monitoring as tracking key metrics, while observability is seen as its more powerful cousin that lets users “slice and dice” metrics and ask arbitrary questions. Also, there is an awareness that observability involves large amounts of data having many dimensions. But beyond that basic understanding, there is little consensus about what represents “best practices” regarding monitoring and observability. This leads to unsatisfactory advice such that there is “no one size that fits all” and that teams “need to do what’s right for the organization.”
However, looking at monitoring and observability from a “how deep can I go and how much data can I collect?” perspective obstructs a much more interesting and much more important question: should I look at functional aspects of code execution, or should I look at aggregate behaviors of services?
Which side of this question we come down on depends critically on how complex our environment is and how rapidly it evolves.
Traditionally, application environments were either simple, consisting of few application components or static, i.e., evolving rather slowly, if at all.
For instance, because they enter their market with only one product, startups also tend to have only one application. This application is architected, designed and coded as we’ve always architected, designed and coded applications: based on a single and complete architectural blueprint. On the other hand, many legacy environments in enterprises have matured over the years and tend to change rarely, if ever.
As a result, functional defects in simple (yet potentially dynamic) application environments are best diagnosed based on correlating logs and examining traces, while static (yet potentially complex) application environments tend to have a layer of monitoring agents that are tuned to capture their key performance indicators and can be tracked readily.
But today’s business needs and cloud technologies have led to complex and dynamic environments that consist of many interconnected applications and change rapidly if not continually. Operators face an explosion of services that run in ever more locations, on premises, in hybrid topologies, across multiple clouds, at the edge and on ever more diverse technologies.
At the same time, development is now widely parallelized to maximize feature velocity and avoid the complexity of tightly coupled releases. “2 pizza”-teams deploy independently of each other, many times a day and, increasingly, directly to production. The result is a web of applications that changes continuously: an organically evolving service landscape.
Disruptive interaction behaviors in complex and dynamic environments can’t be prevented with “better code.”
Service landscapes are prone to unpredictable and disruptive interaction behaviors between their connected applications and services that can manifest in various ways, such as noisy neighbors, retry storms, “thundering herds,” feedback loops, random ripple effects or cascading failures. These behaviors are large-scale in nature and involve non-linear chains of events that make them highly unpredictable and disruptive.
The unpredictability of these behaviors also makes them impossible to prevent with “better code”—they must be controlled at runtime, as they occur.
Because simple or static application environments have fundamentally different needs from complex and dynamic ones, it is vital to adjust what gets observed and what not.
The “first” axis of observability is the horizontal or x-axis. This is the traditional axis of observability, where we focus on transactions or, more generally, threads of execution. It is the axis where requests travel “from left to right and back” that developers are most interested in. As developers, we write functions, compose them into a thread of execution and ultimately into transactions. The x-axis represents our “happy path” that we want to see work correctly in production, and this is where tracing, debugging, performance, error codes and stack contexts matter. Consequently, the x-axis is what a large part of the monitoring industry focuses on.
The second axis of observability is the vertical or y-axis. This is where we focus on aggregate behaviors of components and subsystems, not individual threads of execution. It is also where we look for feedback loops between services, pathological behaviors and cascading chains of events.
Complex and dynamic environments fail no longer due to code. They fail due to interactions.
Failure in complex and dynamic service landscapes does no longer primarily occur due to defects in code. Service landscapes fail predominantly due to unpredictable interaction behaviors between systems. In other words, it is the environmental factors—how all components behave together—that matter, not the individual thread of execution.
This is not to say that thread-of-execution concerns don’t matter. Like an orchestral score consists of many parts, a service landscape consists of many transactions. And like orchestras use conductors to ensure all parts work together in concert, a service landscape requires attention to how all its transactions execute together. It requires “y-axis” observability.
Managing aggregate, “y-axis” behaviors is fundamentally an operational concern. Unfortunately, because the need for “y-axis” tooling typically isn’t felt until a certain amount of architectural complexity is reached, there is little tooling available to aid in detecting and responding to disruptive aggregate behaviors.
Simple or static environments do best with traditional, developer-focused “x-axis” observability as such environments exhibit few if any unpredictable environmental behaviors. As architectural complexity and rate of change increase, however, failures occur increasingly due to environmental factors. Such complex and dynamic environments need operations-focused “y-axis” observability and, crucially, corresponding “y-axis” control.
It is important to point out that traditional “x-axis” observability continues to be relevant in service landscapes at the local level of individual development teams. As far as the services an individual team is responsible for represent a unit of development, complete with shared blueprint and code repository, traditional developer observability such as tracing, logs and service-specific performance metrics do matter. Beyond that horizon of responsibility, however, “y-axis” observability is what matters.
The discussion around monitoring and observability centers too often on their definitions and the desire to get more and ever “deeper” data from our systems. In comparison, advice on what to focus on when collecting data is much more challenging to come by.
There are two fundamentally different classes of data to collect: horizontal, “x-axis” data around transactions and threads of executions to support developer efforts in runtime debugging and vertical, “y-axis” data around aggregate behaviors of services to support operational efforts to detect and control them. This article argues that which axis to focus on depends crucially on how complex and dynamic an environment is. Simple and static environments are best served with traditional, developer-oriented “x-axis” observability, while complex and dynamic environments require modern and operations-focused “y-axis” observability.