Today’s dynamic languages and powerful frameworks have enabled developers to write powerful business logic faster than ever before. Which, when combined with a “microservices-first” strategy, means that code can find its way into production quickly. These new services are often deployed “on top of and next to” existing applications and services, eliminating the need to reinvent existing logic and services. After all, why reinvent functionality when you can simply call upon it?
This approach to application architecture inevitably leads to a landscape of federated services that keeps on evolving organically. And while it affords a great deal of agility to the business, it can also prove to be very challenging to operate in a stable and secure manner. This is because, with more services coming online all the time, change in the environment is constant, with a corresponding explosion in the number of service-to-service interactions. Each of these new service interactions creates new potential for failures, security breaches or performance bottlenecks.
As a result, operations, SRE and security teams can quickly become overwhelmed and lose control over the service landscape. This is an inherent problem that developers, who are successfully taking advantage of microservices end up pushing onto the operations team. When change is constant and interactions become unpredictable, there is a greater chance of running into previously innocuous or even unknown operating limits. Architecting for resilience and coding with compensation strategies does help avoid some of these limits at the local level, but can’t prevent them from occurring at the global level.
In this post, we’ll explore the hallmark characteristics of such global, complex emergent behaviors in organic architectures, discuss why it is essential to be able to detect and control them, and how a cloud traffic controller can help operations, SRE and security teams to do so.
Emergent behaviors manifest themselves as dynamic and large-scale communication pathologies such as unpredictable slowness, intermittent burstiness or load distributions that are out of balance.
Emergent behaviors are rooted in the dynamically changing interactions among shared components of an organic architecture. When architectures grow organically, changing load patterns can bring corresponding services out of alignment. This has the potential to push some participating services against subtle operating limits such as maximum connection pool sizes or memory thresholds. These pathologies in turn affect connected services in various, often seemingly unrelated ways, ultimately causing grey and real failures to ripple through the architecture in random, non-linear ways.
For example, assume a storage cluster serving a Kubernetes environment experiences intermittent availability failures. These failures in turn cause a large number of dedicated logging containers to consume enough aggregate CPU to cause Kubernetes to mark the affected nodes as unhealthy and move their associated workloads to other nodes, only to move them again as soon as the phenomenon repeats itself there. This scenario results in large-scale thrashing as Kubernetes “flip-flops” random nodes between healthy and unhealthy states. One potential solution to remediate this behavior requires SRE teams to detect the intermittent availability of the storage cluster and apply a temporary timeout or circuit breaker policy against the logging sidecars while the development team fixes their CPU behavior.
Or in another example, a service update may have swapped the order of two dependency calls for performance reasons and inadvertently triggered eviction pathologies in intervening cache layers, thus causing large-scale contention. To remediate this behavior, the change in fan-out characteristics needs to be discovered rapidly, e.g. by observing bandwidth spikes on the yellow link, and backpressure needs to be exerted against the issuing service until the development team can resolve the issue.
These behaviors happen because every part of a service architecture has a large number of inherent operating limits. These limits are rarely encountered in traditional, mature application architectures simply because trial and error have caused their components to settle and align themselves over time. And since mature applications rarely evolve, if at all, the underlying components rarely get out of alignment. But because organic architectures evolve continually through organic federated growth, change, not stability, is what is constant. With shifting load patterns becoming the “new normal,” connected services getting out of alignment will happen all the time.
Everything is finite, in particular if regarded as otherwise.
Emergent behaviors exhibit a number of characteristics that make them difficult to detect. Below are some of the tell-tale signs to watch out for.
Emergent behaviors are large-scale in nature and therefore easy to mistake as “normal” occurrences in their early stages. Seeing a whole bank of service instances behave in essentially the same way makes it difficult to observe that they are all exhibiting the same problem.
Emergent behaviors are unhealthy, pathological behaviors that cause serious ill-effects such as slowness or burstiness. Even if they don’t bring down parts of the architecture, emergent behaviors will still compound to degrade the service architecture, thus elevating the risk of subsequent emergencies.
If and when an operating limit is hit in a hyperconnected service landscape, how the pathology will ripple through the architecture is difficult to predict and expensive to model. As a result, emergent behaviors are often and for all practical purposes unpredictable.
Because the exact chain of pathologies depends on a multitude of factors, many of which are rooted outside any thread of execution such as the noisiness of neighboring services, the state of instance replication or very directly the load imposed by other applications on shared services, emergent behaviors tend to be complex and non-linear.
Emergent behaviors are grounded in the interaction between systems and triggered by the changes in the characteristics of interactions. As such, they are an intrinsic, inescapable characteristic of an evolving, organic architecture. Any sufficiently dynamic architecture will trigger operating limits in its constituent services and will thus be susceptible to complex emergent behaviors.
Complex emergent behaviors are not limited only to microservices. They can occur in any compound architecture that exhibits a sufficient amount of dynamism, be it environments consisting of multiple applications with shared services or a managed services cloud that is based on dynamically shared infrastructure.
Given the potential for damage that complex emergent behaviors have, it is of critical importance that operations, SRE and security teams are able to detect and remediate them quickly, with the ultimate goal of controlling them moving forward. Complex emergent behaviors can bring down or affect the availability of any dynamic architecture. When their large-scale and unpredictable nature is coupled with the “not always obvious,” non-linear relationship between the symptoms observed, it makes them fiendishly difficult to detect. In addition, engineers often default to a “depth-first” mode of investigative exploration that causes them to miss the “forest for the trees,” in particular if the behavior under exploration involves less dramatic grey failures. As a result, complex emergent behaviors go often mis- or even undiagnosed.
To detect complex emergent behaviors, it is crucial that operators, SREs and security engineers resist the temptation to go down the rabbit hole of logs and monitoring data, and instead focus on service interaction patterns, which form the basis of these behaviors. And, while looking at these service interaction patterns, engineers need to focus on a small set of uniformly measurable, “golden signals” that can serve as the vital stats indicating the health of an interaction.
Just as important as being able to detect complex emergent behaviors is the ability for operations, SRE and security teams to control them once they are diagnosed. While some engineering teams will view such complex emergent behaviors as something that should be guarded against in code, this is another rabbit hole to be wary of.
For one, because patches are often created and applied under pressure, the risk of a patch to cause another, subsequent complex emergent behavior is extraordinarily high. For example, patching a service to time out calls to a dependency deemed unreliable may mitigate this particular vulnerability, but at the expense of adding a new operating limit to the service.
More importantly, introducing patches is neither a quick, nor a general solution. In many cases, large parts affected by the complex emergent behavior are not under the immediate control of the teams. These parts may include external dependencies, managed services, applications in other business units or at partner sites. In other words, operators, SREs and security engineers need to be able to control what they don’t own. This means they need to be able to control the interactions that are the basis for complex emergent behaviors directly.
Fortunately, they can do this with the help of a cloud traffic controller such as Glasnostic.
Glasnostic is a control plane for federated, organically evolving service landscapes that lets operations, SRE and security teams control the complex interactions and emergent behaviors among microservice and service architectures at scale. By gaining control over service interactions, teams can control emergent behaviors, prevent cascading failures and avert security breaches.
For instance, consider the following interaction pattern between a set of logging services and log data stores (figure 2): logging requests occur in high volume (about 1 M per second) but generally complete reasonably quickly, in about 100 ms. However, a sudden request spike causes latency to go down instead of up, even though both concurrency and bandwidth remain positively correlated. At the same time, two of the 5 data store instances are woefully underutilized and have been so throughout the past 15 minutes.
Although neither the full extent nor the “root cause” of the behavior is understood at this time, the SRE team decides to guard against future negative latency correlations by exerting backpressure against the largest spikes—those of 2 M and more requests per second. The policy is applied within seconds and without complex YAML configuration, all while maintaining a full audit trail of what was done.
At this point, the SRE team can continue to follow the behavior in a number of directions, layering additional policies as needed, in any order and for any period of time, as they see fit. For instance, they may want to “zoom into” the interaction pattern to determine whether excessive client stickiness may be responsible for the data store imbalance or they may want to “turn around” and examine where the spikes might be coming from and add additional backpressure against those consumers.
They are able to do this because Glasnostic allows them to detect and remediate emergent behaviors based on “golden” interaction signals, in real time and independent of any underlying technology stack.
Emergent behaviors are dynamic and large-scale communication pathologies that are rooted in the dynamically changing interactions among shared components of hyperconnected, organically evolving architectures. Because these behaviors are complex and non-linear in nature, they are difficult to predict. They are pathological interaction patterns that have the potential of significant harm to an architecture.
Emergent behaviors are intrinsic to organically evolving, hyperconnected architectures and need to be detected and controlled in real-time if the business is to realize the promise of such architectures.