This post is the first of a two-part series on the benefits of service meshes. In this post, we cover what a service mesh is, how it works and the benefits it provides. In part two we’ll explore why, where and when to use one and what lies beyond.

As we decompose applications into microservices, it becomes apparent rather quickly that calling services over the network is considerably more difficult and less reliable than anticipated. What used to “just work” now needs to be spelled out explicitly, for every client and every service. Clients need to discover service endpoints, ensure they agree on API versions, and pack and unpack messages. Clients need to monitor and manage call executions by catching errors, retrying failed calls and timing out when necessary. They may also need to ensure service identity, log calls and instrument transactions. And finally, the entire application may be required to comply with IAM, encryption or access control requirements.

Much of this is not particularly new of course, and the technologies to help with the mechanics of message exchange such as SOAP, Apache Thrift and gRPC have been around for a long time. What is new, however, is the proliferation of containers and their corresponding explosion of service calls, the degree of horizontal scale-out and its corresponding transient nature of service endpoints. This new degree of complexity and volatility drives a desire to encapsulate the complexities of network communication and to push them into a new network infrastructure layer. The most popular approach to provide such a layer today is known as “service mesh.”

What Does “Service Mesh” Actually Mean?

A service mesh is not a “mesh of services.” It is a mesh of API proxies that (micro)services can plug into to completely abstract away the network. As William Morgan put it, it is a “dedicated infrastructure layer for making service-to-service communication safe, fast, and reliable”. Service meshes are designed to solve the many challenges developers face when talking to remote endpoints. However, it should be noted that they do not solve large-scale operational issues.

How Service Meshes Work

Service mesh architecture with control and data planes
A typical service mesh architecture with data plane proxies deployed as sidecars and a separate control plane.

In a typical service mesh, service deployments are modified to include a dedicated “sidecar” proxy. Instead of calling other services directly over the network, services call their local sidecar proxies, which in turn encapsulate the complexities of the service-to-service exchange. The interconnected set of proxies in a service mesh implements what is called a data plane.

In contrast, the set of APIs and tools used to control proxy behavior across the service mesh is referred to as its “control plane.” The control plane is where users specify policies and configure the data plane as a whole.

Both, a data plane and a control plane are needed to implement a service mesh.

Key Players: Envoy, Linkerd, Istio and Consul

What is a service mesh: Envoy logo
Envoy is an open source proxy server developed at Lyft that forms the data plane in many service meshes today, including Istio. Envoy had quickly displaced other proxy servers due to its convenient configuration API, which allows control planes to adjust its behavior in real-time.

What is a service mesh: Linkerd logo
Linkerd is an open source project sponsored by Buoyant and the original service mesh. Initially written in Scala like Twitter’s Finagle, from which it evolved, it has since merged with the lightweight Conduit project and relaunched as Linkerd 2.0.

What is a service mesh: Istio logo
Istio is arguably the most popular service mesh today. It was launched jointly by Google, IBM and Lyft and is expected to ultimately join the Cloud-Native Computing Foundation (CNCF). Strictly speaking, Istio is a control plane that needs to be paired with a data plane to form a service mesh. It is typically paired with Envoy and runs best on Kubernetes.

What is a service mesh: HashiCorp Consul logo
Consul is a newer addition to the ecosystem of control planes that works with multi-datacenter topologies and specializes in service discovery. Consul works with a number of data planes and can be used with or without other control planes such as Istio. It is sponsored by HashiCorp.

Core Benefits and Differences of Opinion

Although the service mesh space is still evolving, most projects seem to agree on a core set of features that should be supported:

  • Service discovery: registry and discovery of services
  • Routing: intelligent load balancing and network routing, better health checks, automatic deployment patterns such as blue-green or canary deployments
  • Resilience: retries, timeouts and circuit breakers
  • Security: TLS-based encryption including key management
  • Telemetry: collection of metrics and tracing identifiers

Beyond these (and sometimes even on these features), however, service meshes can have quite differing opinions as to what might be valuable to developers, architects and operators of microservices architectures. For instance, while Envoy supports WebSockets, Linkerd 2.0 does not (yet). And while both Istio and Consul support different data planes, Linkerd works only with its own. Consul comes with an easy to use, built-in data plane that can be swapped for a more powerful one when performance matters. Istio is designed as a separate, central control plane while both Consul and Linkerd are fully distributed. Finally, of all the service meshes discussed, only Istio supports fault injection. These are just a few of the differences potential adopters must keep in mind.

Criticisms of Service Meshes

Despite their apparent popularity and promise of attractive features, service meshes are not as widely used as one would expect. This is undoubtedly due in part to their relative novelty and the fact that the general space is still evolving. But they are also not without criticisms.

Typical concerns regarding service meshes include the net-new complexity they introduce, their comparatively poor performance and certain gaps regarding multi-cluster topologies.

Service meshes are opinionated platforms that require significant investments early on in build and operational tooling. Injecting a sidecar into a container may appear easy enough, but properly handling failures and retries requires substantial engineering effort. Such an investment can be difficult to justify for existing applications, applications with a short lifecycle or rapidly evolving applications.

Also, service meshes can have a considerable impact on application performance when compared to direct calls across the network and be difficult to diagnose, let alone remediate. And, since most service meshes target self-contained, microservice-based applications, not entire landscapes of connected applications, multi-cluster and multi-region support tends not to be well supported.

In short, service meshes are no panacea for architects and operators looking to run a growing portfolio of digital services in an agile manner. They are tactical affairs that represent a “below the ground” upgrade of technical issues that are predominantly developer concerns. They are not a game changer for the business.

Service Meshes overlap with, but are distinct from other architectural building blocks such as API and application gateways, load balancers, ingress and egress controllers or application delivery gateways. The primary purpose of an API Gateway is to expose services to the outside world as a single API while providing load balancing, security and basic API management. Ingress and egress controllers translate between unroutable addresses within a container orchestrator and routable addresses outside of it. Application delivery controllers, finally, are similar to API gateways but specialize in accelerating the delivery of web applications, not just APIs.

Diagram of Glasnostic managing connected applications and services
Managing connected applications and services with Glasnostic.

Service meshes also differ from other control planes such as Glasnostic. While service meshes address the technical developer concerns regarding service-to-service communication within a microservice-based architecture, Glasnostic manages the global behaviors of arbitrary connected applications and services that operators care about. And because Glasnostic is a tool, not a platform, it is much easier to install, configure and run. Unlike service meshes, it works in any environment, with no agents, sidecars or voodoo required.


What’s Next?

Control your Services

The ease with which applications and services can be developed today is of great benefit to businesses. But the proliferation of applications, microservices and serverless functions creates a jungle of services that is impossible to control and results in considerable loss to the business if left unmanaged. Glasnostic is a realtime operations solution that lets digital enterprises control the complex emergent behaviors that their connected applications and services exhibit so they can innovate faster and with confidence.