This post is the first of a two-part series on the benefits of service meshes. In this post, we cover what a service mesh is, how it works and the benefits it provides. In part two we’ll explore why, where and when to use one and what lies beyond.
As we decompose applications into microservices, it becomes apparent rather quickly that calling services over the network is considerably more difficult and less reliable than anticipated. What used to “just work” now needs to be spelled out explicitly, for every client and every service. Clients need to discover service endpoints, ensure they agree on API versions, and pack and unpack messages. Clients need to monitor and manage call executions by catching errors, retrying failed calls and timing out when necessary. They may also need to ensure service identity, log calls and instrument transactions. And finally, the entire application may be required to comply with IAM, encryption or access control requirements.
Much of this is not particularly new of course, and the technologies to help with the mechanics of message exchange such as SOAP, Apache Thrift and gRPC have been around for a long time. What is new, however, is the proliferation of containers and their corresponding explosion of service calls, the degree of horizontal scale-out and its corresponding transient nature of service endpoints. This new degree of complexity and volatility drives a desire to encapsulate the complexities of network communication and to push them into a new network infrastructure layer. The most popular approach to provide such a layer today is known as “service mesh.”
A service mesh is not a “mesh of services.” It is a mesh of API proxies that (micro)services can plug into to completely abstract away the network. As William Morgan put it, it is a “dedicated infrastructure layer for making service-to-service communication safe, fast, and reliable”. Service meshes are designed to solve the many challenges developers face when talking to remote endpoints. However, it should be noted that they do not solve large-scale operational issues.
In a typical service mesh, service deployments are modified to include a dedicated “sidecar” proxy. Instead of calling other services directly over the network, services call their local sidecar proxies, which in turn encapsulate the complexities of the service-to-service exchange. The interconnected set of proxies in a service mesh implements what is called a data plane.
In contrast, the set of APIs and tools used to control proxy behavior across the service mesh is referred to as its “control plane.” The control plane is where users specify policies and configure the data plane as a whole.
Both, a data plane and a control plane are needed to implement a service mesh.
Although the service mesh space is still evolving, most projects seem to agree on a core set of features that should be supported:
- Service discovery: registry and discovery of services
- Routing: intelligent load balancing and network routing, better health checks, automatic deployment patterns such as blue-green or canary deployments
- Resilience: retries, timeouts and circuit breakers
- Security: TLS-based encryption including key management
- Telemetry: collection of metrics and tracing identifiers
Beyond these (and sometimes even on these features), however, service meshes can have quite differing opinions as to what might be valuable to developers, architects and operators of microservices architectures. For instance, while Envoy supports WebSockets, Linkerd 2.0 does not (yet). And while both Istio and Consul support different data planes, Linkerd works only with its own. Consul comes with an easy to use, built-in data plane that can be swapped for a more powerful one when performance matters. Istio is designed as a separate, central control plane while both Consul and Linkerd are fully distributed. Finally, of all the service meshes discussed, only Istio supports fault injection. These are just a few of the differences potential adopters must keep in mind.
Despite their apparent popularity and promise of attractive features, service meshes are not as widely used as one would expect. This is undoubtedly due in part to their relative novelty and the fact that the general space is still evolving. But they are also not without criticisms.
Typical concerns regarding service meshes include the net-new complexity they introduce, their comparatively poor performance and certain gaps regarding multi-cluster topologies.
Service meshes are opinionated platforms that require significant investments early on in build and operational tooling. Injecting a sidecar into a container may appear easy enough, but properly handling failures and retries requires substantial engineering effort. Such an investment can be difficult to justify for existing applications, applications with a short lifecycle or rapidly evolving applications.
Also, service meshes can have a considerable impact on application performance when compared to direct calls across the network and be difficult to diagnose, let alone remediate. And, since most service meshes target self-contained, microservice-based applications, not entire landscapes of connected applications, multi-cluster and multi-region support tends not to be well supported.
In short, service meshes are no panacea for architects and operators looking to run a growing portfolio of digital services in an agile manner. They are tactical affairs that represent a “below the ground” upgrade of technical issues that are predominantly developer concerns. They are not a game changer for the business.
Service Meshes overlap with, but are distinct from other architectural building blocks such as API and application gateways, load balancers, ingress and egress controllers or application delivery gateways. The primary purpose of an API Gateway is to expose services to the outside world as a single API while providing load balancing, security and basic API management. Ingress and egress controllers translate between unroutable addresses within a container orchestrator and routable addresses outside of it. Application delivery controllers, finally, are similar to API gateways but specialize in accelerating the delivery of web applications, not just APIs.
Service meshes also differ from other control planes such as Glasnostic. While service meshes address the technical developer concerns regarding service-to-service communication within a microservice-based architecture, Glasnostic manages the global behaviors of arbitrary connected applications and services that operators care about. And because Glasnostic is a tool, not a platform, it is much easier to install, configure and run. Unlike service meshes, it works in any environment, with no agents, sidecars or voodoo required.