Welcome to part 2 of our blog series on the benefits and operational limitations of service meshes. In part 1, we saw how developers can benefit from a service mesh’s ability to provide added observability, traffic control and security capabilities to microservices architectures. In this post, we are going to look at these same three dimensions, but instead of focusing on developer concerns, we are going to dive in and explore a service mesh’s limitations from the perspective of the operations team.
"Routing traffic through a series of proxies can get painfully slow as the mesh grows and routing tables balloon in size."— Ryland Degnan (@rjdegnan) April 24, 2019
Time to acknowledge #ServiceMesh for what it is? A series of proxies bound in an overly complex architecture that doesn't scale. I've seen where that goes.
Observability consistently tops the wishlist of distributed systems engineers. It is therefore no surprise that service meshes try their best to cater to this need. However, the observability that engineers desire and that service meshes provide does not aim to support traditional operations activities such as capacity planning: it focuses on the development activity of runtime debugging.
Runtime debugging, of course, requires that metrics be interpretable In the context of a request’s thread of execution. This is at odds with today’s federated, organically evolving service architectures whose metrics are increasingly unpredictable and inconclusive.
Observing a distributed system makes sense if its services make up a single application whose design remains static over an extended period of time. Such systems can be baselined and reasoned about and, as a result, metrics collected from them can be interpreted—in particular if the architecture is predominantly synchronous.
But interpretability goes away with the kind of federated, organically evolving architectures modern enterprises are running today. For one, baselines—and thus the “baseline understanding” of practitioners—have become obsolete in a world of organically evolving architectures. And without baselines, the conclusive interpretation of metrics can prove challenging. Also, the federation of services reduces the area of responsibility of individual development teams. Federated, organically evolving service landscapes are created by parallel, self-managing teams that develop in rapid decision and learning cycles. In other words, development teams are only responsible for a handful of services: they have a limited horizon of control. Because there is no point tracing into a dependency that teams don’t own, observability only makes sense within a development team’s horizon of control. The only global horizon of control in federated, organically evolving architectures is that of the operations team that is responsible for the entire service landscape, not just a set of related services or an application—in other words, the mission control operations team.
Observability also becomes data-heavy at scale. As federated, organically evolving architectures grow, the volume of data collected in telemetry and traces grows exponentially while the importance of individual service instances declines. In other words, observability with the goal of runtime debugging causes practitioners to collect more and more data that is less and less important. As a result, hardly any metric collected is actionable.
As these architectures grow, observability needs to “move up in the stack.” Instead of collecting pet metrics developers can understand, operators need to focus on higher-level KPIs that allow them to detect and react in real-time. These KPIs need to be meaningful globally. This is where the observability provided by service meshes falls short as well. Due to their opinionated nature, service meshes tend to be deployed insularly in the enterprise, typically in environments that run on Kubernetes. Operational observability, on the other hand, requires high-level, golden signal metrics that work across bare metal, virtual machine and container deployments and across multiple regions and clouds.
In summary, service meshes provide observability for runtime debugging. This is valuable within the developer’s horizon of control but requires metrics that can be interpreted within the context of a request’s thread of execution. However, in today’s federated, organically evolving service landscapes, the lack of baseline metrics and a reduced the horizon of control spoils such interpretability.
Observability for runtime debugging is also data-heavy, leading to the collection of ever more data at an ever higher cost, yet ever lower value. To escape this downward value spiral, observability needs to “move up the stack,” collecting higher-level, global golden signals to enable mission control operations teams to detect and react in real-time. The observability provided by service meshes is unsuitable for this goal not just because it aims to support runtime debugging, but also because golden signals need to be global and service meshes are too opinionated and invasive to be deployed everywhere.
Service meshes evolved as a solution to the problem of how to route service calls to the best target instance, i.e. the instance that can serve the request fastest. This is why service meshes are developer- or “routing-oriented”: they serve the perspective of the developer, who is looking to call a service without having to deal with the intricacies of remote service calls. Because of this, service meshes prove to be not well-suited for managing workloads in an architecture that involves dozens, if not hundreds of microservices which communicate with each other across development teams, business units and even corporate firewalls, i.e. federated service architectures with shifting service-to-service interactions and dependencies that evolve organically over time.
For instance, while it is relatively straightforward to express a forward routing policy with a service mesh, expressing policies that control the flow of traffic backwards, against downstream clients to e.g. exert backpressure or implement bulkheads, is much harder, if not impossible to achieve. While it is in theory possible for a service mesh data plane to make traffic decisions based on both source and destination rules, the developer orientation of control planes such as Istio keeps them from providing traffic control over arbitrary sets of service interactions.
This lack of ability to apply policy to arbitrary sets of service interactions also makes it fiendishly hard to layer policies. For instance, when a bulkhead is in place between two availability zones, but a critical service needs to be able to fail over when necessary, it is near-impossible to figure out the correct thresholds service mesh rules, in particular if deployments auto-scale.
Perhaps the most significant problem service meshes present for operators, however, is their limited deployability outside of Kubernetes—a direct result of their “opinionatedness.” Modifying deployments and deployment pipelines to correctly include a data plane sidecar is often impossible and adding a virtual machine to a service mesh is convoluted at best, yet still does not enable operators to capture inter-VM traffic. Worse, to integrate existing, non-Kubernetes workloads in a Kubernetes-based service mesh requires operators not only to adapt application code—the resulting deployment is then dependent on the Kubernetes mesh.
Lastly, traffic control of current service mesh implementations is configured via YAML deployment descriptors. Deployment descriptors are an excellent way to store configuration in version control and thus can be used to reconstruct a well-defined initial state, but they are not very well suited for the continual, real-time changes that operations teams need to make during times of distress.
In summary, while traffic control provided by service meshes supports a number of developer-oriented control mechanisms like destination rules and virtual service definitions, it does not support non-routing-oriented operational patterns like backpressure or bulkheads. Service meshes policies are impossible to layer predictably in the face of architectural change and are very difficult to deploy outside of Kubernetes. Service mesh configuration is typically based on deployment descriptors that are bound to get in the way of operations teams when time to remediation is at a premium.
By virtue of proxying service-to-service calls, service meshes are in a great position to provide the core set of developer-oriented application security features such as authentication, authorization, accounting, secure transport and service identity. While providing these features out of the box can be a time-saver for application developers, configuring them using YAML deployment descriptors tends to be difficult and error-prone, which obviously detracts from their goals.
From an operational perspective, these service-call-based security features provide limited security at best and do nothing to mitigate the systemic security issues that operations teams care about, such as impacted availability, denial-of-service attacks, intrusions or segmentation violations.
Due to the opinionated, invasive character of service meshes, their application security features break down in heterogeneous environments that, apart from Kubernetes, also consist of bare metal, virtual machine, PaaS, plain container or serverless deployments. Similarly, service mesh security features break down in Kubernetes environments when not all services have sidecars, as is the case in “server sidecar” deployments, where only the target service has a sidecar injected for performance reasons.
The platform-oriented, opinionated approach of service meshes to application security also has the effect that most meshes don’t integrate well with other security solutions—something that operations teams deeply care about. Istio has the ability to use alternative CA plugins and external tools could conceivably call
kubectl with a YAML deployment descriptor to apply security-relevant policies, but because service meshes don’t support policy layering, it is impossible for external tools to apply such policies correctly and safely.
In summary, services meshes provide a number of application security features that are valuable for developers but contribute little to the more challenging operational security concerns. Because service meshes are opinionated platforms as opposed to being an open tool that collaborates with external security solutions, even the application security provided by them tends to break down quickly in heterogeneous environments.
For development teams building microservice applications, service meshes provide many benefits that abstract away the complexities that distributing services brings about. Some of these benefits such as encryption, “intelligent” routing and runtime observability help with operating such applications, but quickly prove to be too limited as applications grow, services become increasingly connected and the business adopts a federated, organically evolving service landscape.
Operations teams need control over more than just service-to-service calls. They need to be able to apply operational patterns to arbitrary sets of interactions. They also need to be able to layer policies so they can be applied without affecting each other. Operations teams need to be able to control their service landscape in real-time, without having to manage hundreds of YAML descriptors. To do all that, they don’t need opinionated platforms, but instead tools that integrate with existing tools and tools that apply to the entire service landscape, without affecting any deployment.
So, if service meshes are, at their core, a technology for developers creating stand-alone applications with limited complexity on top of Kubernetes, not for operations teams that are responsible for ensuring the correct operation of an entire, heterogeneous and dynamic service landscape, how can we address the necessary operational concerns?
Solution 1: Wait Until Service Meshes Support Operational Concerns. The naïve answer for those of us who see service meshes, in particular Istio, as an all-in-one solution to every distributed problem is to simply wait until service meshes support these concerns. Of course, this is unlikely to happen. Service meshes are designed around developer concerns like service linking and smarter instance routing and would have to change considerably to support operational patterns, which generally can’t be addressed by managing point-to-point connections.
Solution 2: Throw More Engineering at the Problem. The engineer’s answer would be to, well, throw more engineering at the problem. Developers could write a policy engine, glue code to integrate service mesh security with other security tools, data aggregators to collect the high-level metrics that operators need, and so forth. Obviously, this would be quite costly and more than unlikely to work satisfactorily anytime soon.
Solution 3: Adopt a Cloud Traffic Controller. The best alternative is to simply leave service meshes to the development teams and to let operations teams adopt a cloud traffic controller. That way, operations teams can detect complex emergent behaviors, remediate them in real-time and create the automations they need to effectively apply the operational patterns necessary to keep the architecture under control.
Glasnostic is such a cloud traffic controller.
Glasnostic is a control plane for service landscapes that helps operations and security teams control the complex interactions and behaviors among federated microservice applications at scale. This is in contrast to service meshes, which manage the service-to-service connections within an application. Glasnostic is an independent solution, not another platform. It requires no sidecars or agents and integrates cleanly into any existing environment.
By gaining control over service interactions, teams can control emergent behaviors, prevent cascading failures and avert security breaches.
Glasnostic was founded after learning first-hand that successful architectures are allowed to evolve organically as opposed to being rigidly designed upfront. It uses a unique network-based approach to provide operators with the observability and control they need to detect and remediate emergent behaviors in a service landscape.