This is the second of a two-part series on canary deployments. In part one, we covered the developer pattern and how it is supported in Kubernetes, Linkerd and Istio. In this post, we explore the operational pattern, how it is implemented in Glasnostic, a comparison of the various implementations and finally the pros and cons of canary deployments.
There are two perspectives on the canary pattern: a narrow view that developers take when they ask themselves “will this update work as expected?” and a wider one that operations teams take when they wonder “will this update cause my service landscape to fail?” As a result, canary deployments as an operational pattern require control of traffic to the production and canary (and potentially baseline) clusters, plus controls to protect the parts of the service landscape that are surrounding the canary cluster. Operators also need the ability to create canary deployments independent of (and not interfering with) other, pre-existing and potentially overlapping policies. In short, operations teams want the ability to layer policies around the canary to protect the surrounding architecture.
Glasnostic is built from the ground up to support operational patterns. Analogous to a sound engineer’s mixing board, it is designed around the concept of grouping service interactions logically in channels, each of which then acts as a point of control for the interactions it applies to. Glasnostic supports the creation of any number of channels, for arbitrary sets of interactions. Once a channel is defined, operations teams may then control its underlying interactions by applying policies and operations. Channels are independent of each other and thus can be layered arbitrarily.
Glasnostic is a control plane for operations teams that controls the complex interactions among microservice applications in order to detect and remediate issues, prevent cascading failures and avert security breaches.
Let’s look at two examples of canary deployments from the operational perspective. First, we’ll look at a simple, undifferentiated deployment without client rules and then at a more involved deployment with client differentiation by source with an additional layered channel governing the canary cluster’s upstream interactions.
To implement a basic canary pattern in Glasnostic, operators can simply create a new channel for any traffic directed at the canary cluster and then apply a rate limit at whatever level is desired. While this does the job, it is often preferable to also create a channel to monitor the existing production cluster for comparison. Figure 1 illustrates this setup.
Creating a channel for the canary cluster allows for straightforward regulation of its traffic. This setup can be easily extended to one that includes a baseline cluster to compare against by creating a third channel around traffic to a subset of the production cluster that is of equal size to the canary cluster.
Figure 2 shows a more complex canary deployment around an inventory management service within an e-commerce application that makes use of four channels. As before, the first channel is created to monitor the production cluster. This time, however, the canary is set to receive traffic from a different, development environment. As a result, the second channel governs requests from the development environment to the canary. In addition, a third backstop channel limits how much load the canary is allowed to generate towards upstream services. Finally, all these policies can be instituted, adjusted or removed without affecting a blanket segmentation between users and inventory services using a fourth channel.
This example shows how layered policies give operations teams full control over how changes to the service landscape are introduced and how to not only ensure that new deployments work on their own but also to protect the overall architecture from potential fallout from such changes.
There are several key advantages to Glasnostic’s operational approach to canary deployments:
Containment of canaries. Like any change introduced into a complex system, canaries can negatively impact your architecture. Instead of merely focusing on whether a new deployment works in isolation, Glasnostic allows operators to ringfence them, thus protecting their existing architecture from any negative fallout.
Independent, multi-level control. Glasnostic is built around grouping arbitrary classes of traffic into logical channels and controlling them independently of each other. In the context of canary deployments, being able to define channels quickly not only provides a convenient way to establish a baseline cluster to compare a canary to, but also allows operators to further specialize individual traffic classes as needed by applying additional policies or operations. For instance, operators may use a canary deployment’s production cluster channel to backpressure against a sudden influx of bursty traffic or to ensure quality of service for tier one clients, all without affecting the canary pattern.
Unified operations. Because Glasnostic provides the same operational controls for canary deployments as for any other operational pattern, operations teams can work with a unified and cohesive toolset without having to contend with siloed solutions. As a result, operations teams are able to rely on a seamless operational workflow and stay in control of their service landscape.
Among the three projects we compared in part one of this series, Kubernetes has the least robust support for canary deployments. While ingress traffic can be subjected to some routing rules, the routing of intra-cluster (“east-west”) traffic is based on round-robin load balancing only and as a result, the share of traffic hitting a canary can be only influenced by adjusting the number of running production instances.
Linkerd 1.x was built on top of Finagle and as such brings significantly more flexibility to canary deployments. In particular, it supports fine-grained routing rules based on weights and HTTP headers. Istio adds support for explicit client rules, thus allowing canary deployments to be based on source differentiation.
However, none of these projects, approach canary deployments from an operational perspective. Round-robin balancing, destination rules, routing based on HTTP headers and client rules are all designed to balance traffic between production and canary clusters, not to protect the surrounding architecture from the deployment. As a result, these projects apply very localized, YAML-based configurations instead of helping operators approximate effective policies by presenting high-level metrics based on golden signals in a UI.
Ultimately, it is this localized application of static configuration that does not lend itself to creating the set of layered policies that an operational approach to canary deployments would require. This is the reason why Glasnostic was designed from the ground up around a UI that allows operations teams to “view and do“, to detect and remediate, with full support for policy layering.
Canary deployments help development and operations teams test each new deployment in production to see how it interacts with the “real world.” This is particularly useful in complex service landscapes with multiple microservice-based applications, where development teams introduce changes independent of each other and according to their own release schedules, or when upstream or third-party services over which the operator has no control are in the mix.
Fundamentally, observing the behavior of a new deployment in production (albeit with fewer users) will always be less risky than the alternative of “let’s just push to prod and see what happens, we can always rollback, right?” As a result, the main advantage of a canary deployment to operators is the ability to incrementally roll out new features and services while minimizing potential problems to not only a subset of users, but also a subset of the operating environment, which includes the network, compute and storage infrastructure.
However, canary deployments are not without their challenges. A big one is that without significant upfront investments in reusable automation, monitoring, tooling and rollback mechanisms, canary deployments will require a large amount of manual setup work every time the pattern is put in place. On the monitoring side, canary deployments require some observability into KPIs like HTTP success rates to decide whether to promote the canary or to roll it back. Apart from the work of setting up such monitoring, these KPIs have to be monitored manually unless tools such as Weavework’s Flagger or Spinnaker Kayenta are used. Finally, rollbacks can be challenging if incompatibilities between deployment versions and database schema changes are not managed correctly.
Canary deployments are also not a good idea in scenarios where even a small number of end users would not be able to tolerate failures of any kind or if the failures they experience may cause reputational harm. For example, services that could cause bank transfers to fail or services with the potential to fail very visibly are poor candidates for canary deployments if you prefer end users would rather not complain to your support department or on social media.
In part one of this series, we laid out the basic canary deployment pattern and summarized how some popular open-source projects support it. In this part, we discussed the differences between the developer- and operations-oriented variants of the canary deployment pattern and showed two examples of how canaries can be realized from an operational perspective with Glasnostic. Most importantly, we showed how this operational perspective requires an ability to layer policies. Among the projects discussed in this series, Glasnostic is the only product that supports such policy layering.
Canary deployments are a great first step towards deploying to production but require a not so insignificant investment in automation and tooling. They also require a fundamental readiness to “move fast and break things,” which does not lend itself to critical, transactional or highly visible workloads. Nevertheless, the benefits of being able to move fast where companies can afford to do so outweighs by far the cost of adopting canary deployments on a large scale.