This post is the second of a two-part series on the benefits of service meshes. In part one we covered what a service mesh is, how it works and the benefits it provides. In this post we’ll explore why, where and when to use one and what lies beyond.
Interest in service meshes is growing as more enterprises embark on the journey towards deploying microservices at scale and therefore the question on many people’s minds is, ”Should I use one?”
At Glasnostic, our answer is nuanced: service meshes solve real problems at the highly elastic, fast-moving and containerized end of the architecture spectrum that will only grow in importance as that end’s footprint increases in the enterprise. They are less valuable in environments that are characterized by a significant amount of heterogeneity, be it structural, functional or technological in nature. They are also less well suited for specialist environments such as high-performance applications. But more importantly, we believe service meshes will ultimately become a transparent feature of container schedulers such as Kubernetes, making the significant learning curve involved in deploying and mastering the technology today a likely unwise investment.
Service meshes will ultimately become a transparent feature of container schedulers such as Kubernetes, making the significant learning curve involved in deploying and mastering the technology today a likely unwise investment.
Astute operations teams can sidestep the complexity of a service mesh, and instead focus on how to enable innovation by composing applications in an agile manner. Since an innovative business will require increasingly connected applications, operations teams should focus on how to run an organically growing federation of applications and how to control the complex emergent behaviors that such application landscapes exhibit.
Components in microservice-based applications can be coupled quite tightly, with many diverse and frequent calls among them. They are also often smaller, stateless workloads that are designed to scale elastically and, on an orchestrator, tend to move around continually. In such applications, service discovery, routing and load balancing quickly turn into non-trivial challenges as service locations, instance utilization, proximity and status information are in constant flux.
These challenges call for the “rules of the network” to be redefined. Instead of resolving service names to network endpoints and then calling them directly, clients should be able to simply call a service by its logical name and leave it to the network to figure out which instance to route the call to, based on proximity, status or capacity criteria. Like objects in a monolith, service instances should be managed by a service linker that ensures that calls are always routed to the optimal instance, no matter where it may be running at the moment. For microservice-based applications with frequent calls between mostly stateless and elastic services, this “SDN for code” role is the fundamental benefit of using a service mesh.
Naturally, once such a layer is in place, it might as well take on additional cross-cutting development concerns such as monitoring, tracing, logging or encryption and this is what service meshes do. The problem with broadening a feature set like this is, of course, that it limits the types of applications that are able to extract the full benefit from it. Let’s look at what characteristics applications need to have to benefit the most from service mesh.
The primary value of service meshes is to establish service-to-service connections reliably and make them resilient in the face of constant change. This linking together of services appeals above all to containerized microservice applications because their lightweight, ephemeral and closely interacting services make it difficult to maintain reliable service-to-service connections at scale.
But microservice-based applications also tend to be self-contained. More often than not, microservices are implemented as custom components for a specific application, not as general business capabilities. In fact, their various and frequent interactions are a direct result of this kinship and co-dependency. This is the reason why so many microservice-based applications resemble more of a distributed application than a service architecture. As a result, self-contained microservice applications have a strong affinity for secondary service mesh functions such as comprehensive logging and mutual authentication.
Correlated with this characteristic of being self-contained is the inclination of microservice-based applications to use complex and transactional logic. Unlike federated service architectures, which build new functionality on top of existing services, microservice-based applications are often designed to function as a whole. It is this undivided ownership over the entire blueprint that tempts developers to resort to complex, multi-step logic, which in turn makes observability features such distributed monitoring and tracing all that more appealing.
While microservice-based, self-contained applications with somewhat complex logic will likely benefit from a service mesh, they are not the only ones. Because the functionality that service meshes encapsulate must be injected in a way that can require significant tooling and automation, they are best used in environments with a substantial degree of uniformity. Such uniformity may be the result of a common deployment model such as containers running on Kubernetes, a common framework such as Spring Cloud, a common language such as Scala, or any combination thereof.
Service meshes also require a fair amount of Layer 7 uniformity. For tracing, intelligent routing and health checks to be most beneficial, it is best if service endpoints follow a similar style. That is, endpoints should have similar weight and granularity, and be similarly multi-tenant. If endpoints are too diverse in these regards, generated metrics are less comparable across the mesh, observability suffers and control functions cannot be applied universally. Uniformity is also required because service meshes thrive on network effects. That is, their value decreases rapidly if not all services of an application participate in it.
In summary, the application environments that benefit most from using a service mesh are self-contained microservice-based applications with complex logic and substantial uniformity of deployment, language, framework and API design. For two reasons, this means that applications that stand to benefit most from service meshes today are likely to be relatively small. On the one hand, this is because very few, if any, larger applications have been written in this style to date. On the other hand, because large environments haven’t been tested extensively yet, the ramifications of using service meshes in complex environments is still unclear.
These are the reasons why, in practice, the sweet spot for service meshes today is a self-contained microservice-based application running on Kubernetes.
By contrast, service meshes are not well suited to most other environments. These include:
- High-performance environments such as those involving scientific, financial or carrier-grade applications,
- Functionally diverse environments combining, e.g. realtime requests, transaction processing, data analytics and streaming services,
- Structurally diverse environments like those serving a multitude of stakeholders and
- Technologically diverse environments that mix different generations or owners of systems.
Even though service meshes are meant to integrate cleanly with existing environments, particularly with microservice-based applications on Kubernetes, applications may in practice require significant changes. For instance, applications that already encrypt conversations will need to remove their encryption to use service mesh authentication. Likewise, subtle assumptions regarding service states such as session affinity are bound to collide with the way service meshes expect statelessness. Finally, applications outside Kubernetes will require a significant investment in deployment tooling and automation.
As a result, it is best to adopt a service mesh for new (“greenfield”) applications that are designed to run on Kubernetes.
Another good opportunity to adopt a service mesh is when applications are scheduled to receive a significant overhaul. For instance, when a VM-based application is migrated to Kubernetes or when an already containerized application is refactored for a major version upgrade.
At all other times, retrofitting a service mesh is likely too disruptive and costly to warrant the effort. Incremental retrofits are particularly dangerous due to their inherent complexity and thus near-infinite potential for friction.
Even though service meshes are a new, exciting and at times even necessary technology, there are a number of reasons why you should not adopt one.
Service meshes may be too opinionated for your target environment. If you have a diverse technology stack or otherwise cannot benefit from all features a service mesh has to offer because you want to keep your options open, adopting one may simply not be worth it.
Similarly, if controlling how applications and services talk to each other is strategically important to your organization, using an existing service mesh makes little sense. Adopting a service mesh lets you benefit from a rising tide, but doesn’t allow you to control your destiny.
Adopting a service mesh to trace requests across services is not always as valuable as it first appears. If your target environment combines applications and services from different owners, for instance, teams typically lack the context needed to interpret traces across such ownership boundaries.
Finally, service meshes address tactical developer concerns around service-to-service connections. They do not help control the complex emergent behaviors that organically growing architectures exhibit.
When enterprises set out to achieve large-scale agility at the product level, they organize around parallel, independent teams, each of which owns a separate set of services, potentially running on separate infrastructure, with separate deployment pipelines and separate lifecycles. This style of producing software leads to an organically growing and continually evolving landscape of federated services that, together with existing applications, form a fluid, organic architecture.
Organic architectures are service topologies that compose applications functionally to create new products and services in a fluid, evolutionary way. They are how systems are built in the presence of change.
Value in organic architectures is not created by modifying existing applications, but instead by reusing them as existing business capabilities and then building next to or on top of them. As a result, services become increasingly intertwined and exhibit continually changing traffic patterns.
Organic architectures optimize for product agility, not raw performance. Microservice-based applications are typically based a static architectural blueprint and implement transactional logic using rigid, multi-tier call trees that require a high degree of reliability and consistency, and are sensitive to latency. Meanwhile, an organic architecture is a decentralized federation of applications that execute in parallel. It takes a relaxed, eventually consistent approach to processing that makes use of compensation logic and event propagation. This results in shallow call trees with moderate demands for reliability and low sensitivity to latency. It is this focus on lightweight composition of parallel executing capabilities that makes an organic architecture fundamentally robust.
Because organic architectures are built on composition and the reuse of existing applications, they don’t lend themselves to tracing and observability. The individual components of a microservice-based application are all part of the same blueprint, which makes tracing requests through the application desirable and observability across components valuable. Services in an organic architecture, on the other hand, are opaque business capabilities that are neither designed to be part of a specific blueprint nor are they owned by their consumers. They are black boxes that make distributed tracing and comprehensive observability pointless.
Organic architectures and microservice-based applications tend to also differ in how they relate to and integrate with security solutions. Because microservice-based applications are held together by their blueprint and likely deployed uniformly, for instance as containers on Kubernetes, they lend themselves to a uniform security model that governs each component equally. This uniform security model will most likely be provided by a service mesh. An organic architecture, on the other hand, is about the fluid composition of existing applications and services, some or all of which may have different security, compliance and governance requirements. It therefore does not force an opinion on the security design and is fully interoperable with existing encryption and authentication implementations as well as tools such as firewalls and DoS detectors. This interoperability is of great benefit to organizations whose environments have diverse security requirements.
In summary, because microservice-based applications tend to be built on complex logic that is monolithic in nature, yet distributed across several application components, they face the non-trivial challenge of executing this logic across the network reliably. Service meshes help solve this challenge to some extent by addressing the tactical concerns of making service-to-service connections reliable, resilient and secure. This is especially true for applications that are built on Kubernetes. In contrast, the strategic challenge of managing an organic architecture is that, when running an ever-evolving landscape of federated services, it requires the ability to quickly discover, and then immediately control the complex emergent behaviors that an agile composition of services invariably brings about.
Imagine you discover that responses from your storage cluster suddenly take 10 ms longer than usual. Worrying about further degradation of availability, you decide to shed some load, starting with a few of the less critical analytics batch processes. This returns latencies closer to their long-running normal. Investigating further, you then discover that a significant part of the heightened load comes from the instances of a single service. This leads you to view the incident not as a performance problem of the storage cluster, but instead as a denial of service attack playing out against it. As a result, you rate limit traffic from the attacking service and restore the analytics batch processes before proceeding to investigate the reason for the attack.
Scenarios like these illustrate how vital it is not to lose sight of the “big picture” when running an organic architecture. This “big picture” is a matter of seeing communication behaviors. Operators need to be able to quickly detect, and react to, the large-scale behaviors that emerge from the interactions of their services. In particular, operators need to stop chasing down mythical root causes. Root causes in organic architectures are more often than not a confluence of many factors that are not only too difficult and expensive to track down. They are also likely to never occur again in this form. Operators need to detect and react first before turning to prevention.
In another example, as an operator, you may want to insert a bulkhead between availability zones to guard against mishaps in one zone affecting another., You may also want to exert backpressure against bursty request patterns to level out traffic to protect your infrastructure. Alternatively, you may want to quarantine new workloads with unclear characteristics until you are confident they won’t wreak havoc in your architecture. Similarly, you may want to circuit-break temporarily, change the way you segment your network or establish quality-of-service rules, and so forth. All of these show how essential it is to be able to control communication between arbitrary sets of endpoints.
Complex behaviors emerge in an organic architecture because there is no single architectural blueprint that covers the entire landscape. An organic architecture consists of numerous applications and services that are built independently and come with their own, independent deployment schedule. While this affords unparalleled speed to market, it also makes load forecasts and capacity planning inherently unpredictable and impossible to plan for.
As a result, organic architectures need more than the “below-ground,” tactical upgrade to point-to-point connections that service meshes represent. They need a control plane for the “above-ground,” large-scale, dynamic and complex emergent behaviors. This control plane needs to be able to cover the entire service landscape, provide crucial, high-level visibility into systemic issues, and offer effective and immediate control.
Glasnostic is a control plane for organic architectures. After seeing first hand how enterprises struggled with their increasingly federated organic growth, we created a solution that enabled them to allow this growth to evolve organically. By maintaining control over the complex emergent behaviors that their connected applications exhibit, enterprises can grow and evolve their product portfolio in a more rapid and agile manner.
At a more technical-level, Glasnostic is a service traffic controller that inserts cleanly into the network data plane without affecting developers, processes or stacks. It uses no agents, sidecars or similar voodoo. It plays nice, works with every platform, orchestrator, service mesh or technology stack, and runs everywhere.
Service meshes are an opinionated and invasive, hence powerful but also expensive solution to make service-to-service connections more reliable, resilient and secure. This fact, coupled with their related benefits of comprehensive logging, monitoring and authentication is particularly applicable to microservice-based applications consisting of lightweight, ephemeral and closely interacting services that are run on an orchestrator, such as Kubernetes. However, the larger, more strategic problem of digital innovation, the operational challenges of running an organic architecture successfully, are not solved by tactically addressing local development concerns. An organic architecture delivers the key benefit of dramatically lowered time to market but also exhibits large-scale, dynamic and complex behaviors that the volatile interaction patterns between applications create. To run an organic architecture successfully, these behaviors need to be controlled. Glasnostic provides the perfect solution to this problem.