Today’s application architects have effectively abandoned monolithic designs in favor of cloud native microservices so they can respond more quickly to changing business needs, accelerate developer agility and, of course, take full advantage of the elasticity of the cloud. Naturally, there is a cost associated with adopting microservices. With many more moving pieces than a monolithic application, a microservice architecture requires considerably more management, monitoring and security. Service mesh technologies like Istio and Linkerd have appeared in recent years on the promise that they will make managing, monitoring and securing microservices easier. Aside from their fundamental benefit of managing service-to-service connections, i.e. routing calls from a source service to the optimal destination service instance, service meshes also provide developers with valuable benefits in the three key areas of observability, traffic control and security.
In part 1 of this series, we explore these developer benefits. In part 2, we’ll turn around and examine their limitations from an operational perspective.
A service mesh is a dedicated infrastructure layer that aims to “make service-to-service calls within a microservice architecture reliable, fast and secure.” It is not a “mesh of services” but rather a mesh of proxies that services can plug into to completely abstract the network away. In a typical service mesh, these proxies are injected into each service deployment as a sidecar. Instead of calling services directly over the network, services then call their local sidecar proxy, which in turn manages the request on the service’s behalf, thus encapsulating the complexities of the service-to-service exchange. The interconnected set of sidecar proxies implements what is referred to as the data plane. This is in contrast to the components of a service mesh that are used to configure the proxies and collect metrics, which are referred to as the service mesh control plane.
In a nutshell, service meshes are designed to solve the many challenges developers face when talking to remote endpoints. Service meshes are particularly useful for “greenfield” applications that run on a container orchestrator such as Kubernetes.
Right now, the service mesh with the most developer buzz is the Istio project originally developed by Google, IBM and Lyft. Although, not as popular as Istio, Linkerd by Buoyant is the “original” service mesh and still widely used today. Several Web-scale companies with large microservice deployments have developed their own in-house service meshes based on the Envoy proxy, which coincidentally, Istio also builds upon. Other service meshes have come onto the scene as well, including AspenMesh by F5, Consul Connect by HashiCorp, Kong, AppMesh by AWS and Microsoft started an initiative to standardize the various service mesh interfaces, dubbed SMI. For an overview of the current state of the microservices ecosystem, check out our post, “2019 Microservices Ecosystem” for more details.
A technology as ambitious as a service mesh, which aims to solve many, if not most of the problems that result from running a sprawling microservice architecture is not going to be immune to criticism. Some of the criticism centers around service meshes having the potential to introduce the following undesirable effects:
- Added Complexity: The introduction of proxies, sidecars and other components into an already sophisticated environment dramatically increases the complexity of development and operations.
- Required Expertise: Adding a service mesh such as Istio on top of an orchestrator such as Kubernetes often requires operators to become experts in both technologies.
- Slowness: Service meshes are an invasive and intricate technology that can add significant slowness to an architecture.
- Adoption of a Platform: The invasiveness of service meshes force both developers and operators to adapt to a highly opinionated platform and conform to its rules.
Despite these limitations, service meshes are clearly not without their benefits in the right environment, especially small decomposed applications running on Kubernetes. While operations teams remain cautious, developers are diving in head first, drawn to the promise of seemingly comprehensive observability, traffic control, and security features. In the upcoming sections, we’ll explore each one of these benefits in detail.
Decomposing an application into a number of microservices doesn’t automatically turn it into a network of independent services. The application still acts as the single, stand-alone application it was before—it has merely become distributed. Its microservices typically share the same code repository and are part of a single architectural blueprint. They are less like services shared across multiple applications than components of their parent application.
Because such microservice applications still act as individual, stand-alone applications, not as a network of independent services, development teams itch to troubleshoot them just as they would with a monolith. Of course, debugging such microservice applications has become harder because the application components are now distributed. This is precisely the reason why engineers very much desire the ability to trace requests across remote services for debugging purposes. The term often associated with such distributed debugging is “observability.”
Because a service mesh is a dedicated infrastructure layer through which all service-to-service communication passes, it is uniquely positioned within the technology stack to provide uniform telemetry metrics at the service call level. This means that, for better or for worse, services are monitored as “black boxes.” Service meshes capture wire data like source, destination, protocol, URL, status codes, latency, duration and the like. This is essentially equivalent to the data that web server logs can provide, but of course, service meshes capture this data for all services, not just the web layer.
Once captured, metrics and logs are collected by the service mesh’s control plane and passed along to the monitoring tool of choice. For companies that rely heavily on open source technologies, Prometheus and Grafana are popular choices for storage and visualization, respectively.
Aside from metering service-to-service calls, some service meshes also support tracing. With effective tracing, engineers are able to troubleshoot compositional problems like sequencing issues, service call tree abnormalities, and request-specific issues. Tracing is possible with service meshes because of the use of span identifiers and forwarded context headers. Of course, to make tracing work, every service needs to be modified to read tracing headers upon input, pass them along to all the related threads of execution and then add them to every call to other services.
It bears pointing out that collecting data is just one part of solving the observability problem in microservice applications. Collecting and storing metrics needs to be complemented with capable mechanisms for analyzing the data and then acting on it them through alerts or procedures such as auto-scaling routines or the application of an operational pattern like a circuit breaker.
When it comes to meeting service level objectives like latency and uptime, the ability to manage traffic between services is critical. This is because it allows the operations team to implement operational patterns like circuit breaking or backpressure to compensate for poorly behaving services.
Service meshes can provide this type of traffic control. Because their primary function is to manage service-to-service communication, they are able to provide such features rather easily. However, because they are designed to effectively connect a source call to its optimal destination service instance, these traffic control features are destination-oriented. In other words, service meshes are well suited to balance individual calls across a number of destination instances, but rather unsuitable to control traffic from a number of sources to an individual destination or to control traffic across an entire service landscape, for that matter.
Service meshes provide control over service-to-service calls through their control plane, which ultimately configures the proxies that make up their data plane. Although the exact functionality can vary from service mesh to service mesh, most support smart, i.e. latency-aware load balancing (also called “intelligent routing”) and routing rules based on request properties. And because service mesh control extends from Layer 4 into Layer 5 and above, some also offer development teams the ability to implement resiliency patterns like retries, timeouts and deadlines as well as more advanced patterns like circuit breaking, canary releases, and A/B releases.
For example, using timeouts, developers can limit the amount of time a microservice will spend waiting for another service to complete the request. If this proves too crude of a threshold, timeouts can be used to kick off a circuit breaker instead. When the circuit breaker is “tripped” that way, it will remain “open” for a while until the service mesh deems the service available again. That way, downstream clients are protected from excessive slowness of upstream services and services, in turn, are saved from being overloaded by a backlog of requests. Or, if a specific client’s request behavior threatens the service level of other clients making requests to the same shared service, developers can rate-limit the high-volume client, so others are not drowned out. Finally, service meshes can help with traffic control by enforcing quotas. For example, operators can charge clients by the request or may want to limit client requests within a given timeframe.
To some extent, monolithic applications are protected by their single address space. Once a monolith has been broken up into microservices, however, the network becomes a substantial attack surface. More services means more network traffic, which, for hackers, means more opportunities to attack the flow of information. This is the reason why service meshes, provide the ability (and infrastructure) to secure network calls.
The security-related benefits of service meshes revolve around three core areas:
- The authentication of services.
- The encryption of traffic between services.
- Security-specific policy enforcement.
Istio, for example, provides developers with a certificate authority to manage keys and certificates. With Istio, you are able to generate certificates for each service and to transparently manage their distribution, rotation and revocation. With these capabilities, services can authenticate each other and implement proper access controls. Ususally, this takes the form of both white lists and black lists, so a service knows whether or not to accept an incoming request. Regarding encryption, service meshes are able to lock down data plane traffic using mutual Transport Layer Security (mTLS), making service-to-service communication more secure. Finally, some service meshes are able to enforce various security policies that apply to either the pod, certain namespaces or specific services.
Service meshes are a novel, but beneficial technology for developers itching to solve the many problems that result from running a containerized microservice architecture. Fundamentally, service meshes solve the problem of discovering, and routing calls to, the best service instance. Aside from “linking” services, however, service meshes can also provide developers with valuable observability, traffic control and security benefits.
Service meshes are also limited. For one, they are complex and opinionated pieces of technology, which largely limits their applicability to “greenfield” applications on Kubernetes. They can also be slow, which limits the scale and complexity of applications they can support. As a result, they are best suited for fairly small, containerized microservice applications that run on a container scheduler like Kubernetes.
The value service meshes provide in the observability, traffic control and security areas, however, is limited, too. Observability really only matters in self-contained, distributed applications that are governed by one common blueprint (and hosted in a single git repository). Traffic control is limited by its routing-oriented design, which makes it practically impossible to express policies between arbitrary sets of endpoints. Finally, security is limited by the insular nature of service meshes. Automatic encryption is a big step towards a zero-trust environment, but, without large-scale support in heterogeneous architectures, ultimately of limited value.
In part 2 of this series, we’ll examine the limitations of these three value areas: observability, traffic control and security from an operational perspective.