Aerial photo of highway junction
Source: unsplash.com/@justjohnl

Microservice Routing

Earlier this month, I had the pleasure of talking to Jeffrey Meyerson of the “Software Engineering Daily” podcast about Glasnostic and the importance of bringing traffic control to microservices. In this episode, we touched on a wide range of topics, including:

Below is the transcript of the episode, edited for clarity and content.

• • •

Jeff: Microservices route requests between each other. As the underlying infrastructure changes, this routing becomes more complex and dynamic, and the interaction patterns across this infrastructure require operators to create rules around traffic management. Tobias Kunze is the founder of Glasnostic, a solution for ensuring the resilience of microservice applications. He joins the show to talk about microservice routing and traffic management and what he has built with Glasnostic.

Tobias, welcome to the show.

Tobias: Thanks for having me, Jeff.

The State of the Art for Traffic Management

Jeff: We’ve got all these different services. They’re running in containers across our infrastructure. How are we controlling the traffic that is being shuttled between these different services?

Tobias: That’s an excellent question! Today, you have only two places where you can control how traffic behaves. Number one is the old school technique of setting timeouts in code. What we used to do in the UNIX days. Number two is configuration files. For instance, you can use a service mesh to configure traffic behaviors or you can use plain Envoy proxies in-between to specify retries, timeouts and these kinds of things. But it’s fairly limited in how flexibly you can apply this over sets of services, and it’s very difficult to change over time.

Jeff: Tell me more about the modern state of the art for traffic management.

Tobias: There really isn’t much because the industry is so focused on observability, the control aspects kind of fall by the wayside. It’s all about “how do I get more transactional visibility?”, “how do I get more tracing done?” and, when I trace, “how do I get more data around what I’m tracing?” The traffic management aspects are almost orthogonal to these concerns. There’s a little bit of a disconnect in the industry today where the behavioral aspects of how my systems work together are not at the center of the focus.

And that’s mostly because, as engineers, we’ve grown up to write applications, write functions, think in terms of threads of execution. That’s the “happy path” we aim to implement! Then, when we deployed, we may have deployed 5 or 10 services at most. Most of the time, these were standalone applications, or at least as far as we were concerned as engineers. What happens in the real world, though, is that those 5 or 10 services connect to 20 other systems. The reality of what’s out there is an entire fabric of services, a fabric of systems, that are highly connected.

At that stage, the success of your architecture is almost 100% dependent on how all these other systems work. That’s where traffic management comes in. That’s where you need to be able to act as an air traffic controller. You need to have a way to insulate yourself, to control the pieces that you depend on, even though you don’t own them.

Why, How and When to Use Glasnostic

Jeff: You work on Glasnostic. Explain what Glasnostic does.

Tobias: At the highest level, Glasnostic is an operations solution. We take on those large-scale and continually evolving landscapes of services that we are running these days. Think 20, 50 or 200 teams, all deploying in parallel and all those services are connected, talk to each other. Then think about how to control and manage how these systems work together. It’s a classical management task. Our goal is that you won’t have to think about this in code and that you won’t have to go and put behavioral directives in static configuration. And you won’t have to do that because we provide operators with real-time control over how these services interact.

Jeff: Why is that useful?

Tobias: As I mentioned earlier, systems today are connected. No application is an island. If your application is connected to other applications, you depend on these other applications in some way, shape, or form. That means your fate depends on how these systems perform. Now, if you connect more than three of these systems, you get in a multi-body physics scenario, where things get chaotic very quickly.

And if you look at this a bit closer, application interactions become chaotic because all these applications are composed—they are stitched together—and each individual application is finite in many different and often unknown ways. So, as you stitch these finite systems together, you’re going to hit all kinds of limits.

If you listen to any site reliability engineering talk at an industry conference these days, the number one issue is such limits. Every single behavior of every component is limited, and you are going to hit some limit—whether it’s a replica limit, whether it’s a read count limit, whether it’s some other weird limit. And all of a sudden, your systems behave differently.

You need to be in a position to see this, to detect this very quickly, and then to do something about it. Our standard model of doing something about an outage, a failure, a degradation these days is really to diagnose. Diagnose for five hours, eight hours, and then fix something in maybe five minutes. We change this model. In our model, you can detect issues within 20 seconds and do something about them in 10.

Jeff: You see yourself more as an SRE tool.

Tobias: We are an operations tool or, even more, an operations solution. We change the way you operate these large scale infrastructures, these large scale architectures. I see us as an enabler for operations to run 10x more systems, interconnect them more quickly and evolve them more rapidly.

Jeff: Can you describe the usage of Glasnostic in more detail?

Tobias: Sure. It’s straightforward, really. We deploy into the network, where we then act as a “bump in the wire.” In other words, we neither touch any of your workloads, nor do we install agents or anything like that. We deploy into the network. Then, essentially, we look at who’s talking to who, how much and when. That’s what we then make visible and, more importantly, controllable.

Jeff: Give me an example of deploying a service and how Glasnostic would work with that.

Tobias: Very easily. You have a piece of code running down a CI/CD pipeline. Typically, this will get deployed into a pre-production environment, a staging environment or something like that. That, of course, is not really representative of what’s happening in the real world.

Ultimately, you want to deploy into production because staging is expensive and not really that useful anymore in larger installations. But deploying to production is risky enough even if you tested the deployment thoroughly in staging. If you could just ringfence what you deploy, or if you could just quarantine it to some degree—canary deployments help, obviously, but if you had a more general way of ringfencing things—you could massively mitigate deployment risk.

Glasnostic is a way of controlling, at runtime, how your services interact. Because if you have that ability, you’re not going to have an outage. You may run in a degraded form, you may run at, let’s say, 90% or 95% capacity but the benefit is that you can avoid an outright outage.

Think back to the beginning of COVID, when Robinhood went down, very publicly, for three days, during most of which nobody was able to trade. In other words, when markets tanked by 25%, nobody could trade their shares. The official cause was, “Oh, sorry. Our DNS services got overwhelmed.”

Now, the very obvious question that springs to mind here is: how could this even happen in the first place? How can you expose yourself to that kind of load and you have no way of exerting even classic backpressure? It’s almost negligent, right?

With us, they could have remediated the situation in, like, 20 or 30 seconds. All you need to remediate is to see, well, there’s a lot of traffic hitting our DNS services and then to take action, e.g., to exert classic backpressure, and the situation is taken care of! At least that buys you enough time to think clearly and without time pressure about what you want to do about it instead of picking up the pieces after everything has collapsed and you’ve been down for hours, and your phones are ringing, and your pagers are going haywire.

It is a fundamentally smarter approach to operations—control application behaviors at runtime, when they happen, not ahead of that, at coding time.

Jeff: How do you visualize the traffic management of different services and how is that advantageous?

Tobias: Think about what happens when you have some form of failure in your architecture. Again, keep in mind, we’re talking about service landscapes—continually evolving, pretty complex environments. What happens in these environments is that all your failures are suddenly non-linear. Something may start out as a CPU issue somewhere but becomes a latency issue upstream before it turns into a retry storm on the next system, and then you end up with wide-ranging, asymptomatic slowness that affects maybe 250 systems.

All the while, all these systems throw arbitrary alerts. And, of course, they only throw alerts on the metrics that they see. And now your engineering teams jump on the first alert that speaks to them. This is headless-chicken mode, of course. They run around and try all kinds of things. We got this alert, we’re out of file descriptors here, and nobody knows what’s really going on. Everybody is looking at the trees, nobody is looking at the forest.

Failures and breaches in service landscapes are large-scale, non-linear chains of events that threaten to bring the business down.

Our value here is that we only focus on the big picture. We only look at how everything works together. And by looking at how these services interact with each other and how these interactions change over time, and how these individual golden signals that we capture correlate to each other, we can provide a very good sense of what’s wrong.

And then you can do something about it by managing what’s wrong and controlling how these systems behave, from the outside. As an operator, we allow you to act like an emergency medical team. There’s a victim at the side of the road, you arrive at the scene, you look at the golden signals: pulse, temperature, pupils, whatever you look at when you assess a patient. And then you stabilize the patient.

The first order of business is always to stabilize. It’s not about diagnostics, not about root cause analysis—these kinds of things. As you know, most of the time, when you go to the doctor, root cause doesn’t really matter. It’s too complicated, there are too many factors involved, and the condition is not necessarily coming back, so time spent on diagnostics is largely time wasted. It is more important to treat the patient. That’s what we’re all about, and that’s why what we do is so effective. Rapid, high-level visibility: here is something that’s not quite right, then do something about it. Doing something about it doesn’t have to mean shutting things off. It only means shaping a little bit, adjusting, allowing it to interact a little bit differently.

How to Handle Failure In Complex Environments

Jeff: When a failure occurs in infrastructure, what kind of tools do we want in place? What types of visualizations and SRE tools do we want in place to make that failure easier to deal with?

Tobias: Well, of course, I would say Glasnostic! There’s a place for observability, for Datadog, for New Relic, for AppDynamics, traditional observability tools. But, they address increasingly local concerns. If you think about it, if you’re a “2 pizza”-team of five or six engineers working on four or five, maybe two handfuls of services—that’s your horizon of responsibility. That’s what you’re really interested in. That’s where you’re looking at “how can I trace through these requests?” that’s where you want to have more runtime debugging, and that’s where you may be looking at specific traces through these services. In all those cases, these tools are super helpful.

However, as your services become part of a larger whole, these tools are not that useful anymore. You don’t want to trace into the 40 different dependencies that you have. Those are just that: they’re just dependencies, and you don’t even own the code! You don’t need visibility and how many sockets there are in CLOSE_WAIT. Those are services that you need to be able to depend on, not services you should debug, period.

If these services misbehave, you need a way of insulating yourself as far as you can from these failures. This may involve circuit breaking, or it may involve inserting some kind of bulkhead or even segmentation—any of these control primitives may be helpful in that situation. And that’s precisely what we focus on. Ultimately, in these environments, everything comes down to how systems interact. It does no longer matter what these systems do internally, whether there’s a defect in code. What matters is how they interact with other systems. That’s why we focus exclusively on that aspect.

Jeff: Tell me more about remediating a failure that might occur.

Tobias: Sure, going back to the backpressure example, the classic situation is: you get overwhelmed. Some services in your environment get overwhelmed. Most of the time, you don’t notice this until it spirals out of control and hits systems that are three, four hops, upstream or other systems you share dependencies with. With us, if you have the ability to quickly see that you’re running a little bit out of capacity and your auto-scaling rules or whatever you have in place to deal with capacity are not quick enough to react: if you see that, nothing matters other than the ability to do something about it. Don’t watch it spiral out of control. Do something about it, briefly, for a few minutes, while your other system spins up a replica, chooses a primary, whatever it does to create or restore capacity. Take care of the situation.

To give an example: you have things replicated in two different availability zones. Of course, you want to separate them out, but at the same time, you want to make sure these business-critical servers can fail over to the other zone. This is where you want to Insert an operational bulkhead in-between that takes everything that’s going on in one zone and isolates it to some degree from the other zone, while at the same time, allowing a fail-over for specific systems. Those are the critical capabilities if you run anything at complexity.

So, in summary, you need to be able to manage much like a manager in a large organization. If you have 50 teams, you restructure them to suit your goals for the next quarter. Then you’re not going to lean back and not do anything, right? At that point, your job is day-to-day management to make sure your teams work together and that they’re not blocked on anything. That’s your job. We bring that capability to the operational side of running complex architectures.

Diagram showing Glasnostic integration points
Glasnostic can insert itself in multiple data planes, such as VM-based environments, Kubernetes or Istio.

Glasnostic as a “Bump in the Wire”

Jeff: Tell me about the deployment model for Glasnostic. Isn’t this something that you would generally have a sidecar deployed for?

Tobias: Yes. We eschew sidecars, though, because they’re invasive and tend to become quite expensive over the course of their lifecycle. Also, we eschew them because we don’t need them! As I mentioned earlier, we are a “bump in the wire,” as the networking folks like to say. Although, I like to think of us more as a brain in the wire,” as we add a little bit of intelligence to the data plane.

By virtue of being a “bump in the wire,” we get to see everything that flows across it. We resolve endpoints, detect pathological interaction patterns between arbitrary sets of endpoints, and then we can do something about them. Now, this has obviously an environment-dependent part of how we insert ourselves. And then there is the environment-independent part, our virtual network function.

We can insert ourselves in a number of different ways. For instance, in a standard VM-based or on-premises model that’s host-based, we will just spin up a new VM that acts as a router in the network. On Kubernetes, we have a number of ways we can plug in, such as as a DaemonSet or as a network filter. And if you’re running Istio, for example, we can plug into the Istio proxies or any Envoy proxy and simply use those as our environment-dependent routing infrastructure. In other words, we just plug our traffic controller (our “virtual network function”) into Envoy.

Each environment is slightly different—plain Kubernetes, Kubernetes with Istio, Kubernetes with other service meshes, VM-based environments, multi-access edge computing environments. Those are very different environments, but the idea is always the same. A small function, a little “brain” that can regulate traffic, act as a traffic cop and regulate how applications interact for stability and security purposes.

How are you different from Service Mesh?

Jeff: I thought this functionality was handled by Envoy or was handled by Istio.

Tobias: To some extent, although it’s not easy to manage this well. Also, it seems as if the policy piece is slowly moving out of Istio. We work in a more general way and thus can control traffic between arbitrary sets of endpoints, which is not really a focus of service mesh at all. We need not only a VirtualDestination—we need a much more general way of dealing with these policies, and a way to manage them much more rapidly and, above all, in “WYSIWYG,” if you will.

We see policies in a service mesh as developer-focused abstractions. As a developer, I live in my IDE and, essentially, I just want to talk to a particular service. I don’t care which instance of the service and am more than happy if someone can take that decision off my shoulder and simply connect me to the best instance there is. So, service mesh fulfills essentially the role of a network “class loader.” And, of course, while it is doing that, you might as well throw in metrics and encryption. That’s really it. Service mesh is about “link me to that other dependency, to the destination service.”

As a result, expressing policies “in the other direction,” from destinations to sources, is difficult. It is very difficult to specify a policy that says, “Well, here I have a federation of systems, and I want to shield them from a denial-of-service onslaught,” or something like that. You would have to go in and create YAML and then debug that YAML and make sure it works. There’s no visual aid for this, and it’s very difficult to maintain. At the same time, it is devilishly easy to forget about these YAMLs in the configuration. And as a result, you are looking at something that is bound to be the cause for a future outage. At Glasnostic, we pull policy management out, centralize it and make it a first-class citizen so we can work with them on the operations side.

Another advantage we have over service meshes is that you cannot keep worrying about how to keep your policies from affecting each other. It’s absolutely crucial that you can have a generic bulkhead somewhere and, at the same time, have a policy that overlaps it without any unintended consequences. It’s actually a fairly complex piece of code and logic that allows operators of large-scale application landscapes to layer these policies in any which way. That’s how we are fundamentally different from service meshes, although our customers think of us sometimes simply as a “better service mesh,” and that’s fine, too.

Adapting Policies with Glasnostic

Jeff: How would I make a policy change in Glasnostic?

Tobias: Everything is UI-driven. Of course, everything you can do in the UI, you can also do via an API, so other tools can integrate with Glasnostic easily. But to an operator, everything is UI-driven, WYSIWYG. So, as an operator, let’s say you’ve put a rate limit between two sets of services. Then you discover that your original limit can be improved upon, so you just go back and change it or pull it out. We show you a full overview of which policies are in effect at any given time. It allows you to work your infrastructure, your architecture, much more like an air traffic controller.

Air traffic control screen
Runtime control.

Dev and Ops Views of Running Software

There are really two views on what you’re running and how you operate things. On the one hand, if you are a developer, you care about the “happy path,” the thread of execution. As I mentioned earlier, that’s where you want to have full visibility, tracing and all that. You’re much more like the pilot in an airplane, where that flight is the only thing that matters to you. Of course, there you need full visibility. You need to have a cockpit full of gauges, switches and controls. You want to know what the oil pressure is on, like, engine three.

On the other hand, as an operator in a large organization, you have thousands of these flights. You need air traffic control. You can’t let these flight plans take care of themselves, no matter how well planned they are. There’s always unpredictability. You have to deal with these things at runtime. Air traffic control, though, doesn’t go deep. It doesn’t look at what the oil pressure on a particular engine is. It looks at very simple, “golden signals.” It needs a call sign, position, altitude, direction, speed.

Glasnostic operates the same way. We take care of the entire architecture, how everything works in that shared environment, which is a finite resource, and make sure nobody steps on each other’s feet. This requires high-level signals of everything that goes on there. There can’t be a single service we don’t see. Our “call sign” is the service name, and then we need to know: how many requests are coming from a service? How long do they take on average? How many run at the same time, i.e., what’s their concurrency? And finally: how much bandwidth is involved? That’s the kind of visibility we give our operators, and that kind of visibility enables them to work at that “forest” level of visibility instead of at the “tree” level.

Integration with Traditional Monitoring Tools

Jeff: How does Glasnostic integrate with monitoring infrastructure?

Tobias: Remember, monitoring is a local concern. If you take care of one or two handfuls of services, yes, you’ll want to monitor them, no air traffic control required. Each team wants to monitor its services. But we live in a stitch-together world! A lot of systems are getting put together today—literally stitched together, wired up very quickly. At that level, monitoring is pretty useless. There’s simply too much detail, and relying on AI to sift through that detail has proven unconvincing so far. That level of detail makes it very difficult to see the big picture. Obviously, that’s where we come into play. We are all about seeing that big picture.

Monitoring is a local concern.

Now, of course, if we discover some degradation—let’s say latencies are blowing up in some corner of our environment—then we want to hand that over to whichever team is most intimately familiar with the services in question. But first, we want to do something about that latency—maybe we circuit-break tier-3 services to shed some load or whatever the operational pattern is we want to apply. Then we want to hand it off to the team that owns the services in question and say, “Look, this is what we’re seeing. We’re handling the situation, but maybe this is something you want to take a closer look at.” That’s when we’d create an incident ticket, post in Slack or whatever it is we do when we document an issue.

So, at that point, teams would look at their monitoring data and—maybe—find out what happened, why we suddenly saw way more requests than before, and so forth. But, because ops has already contained the situation, this can progress in a very efficient way as opposed to the headless-chicken mode I mentioned earlier. So, the integration between Glasnostic and existing monitoring and observability tools is important, but it’s also important to realize that it’s an integration between the global, big-picture view and the local, tactical one.

Glasnostic and Security

Jeff: What about security? How does Glasnostic work with security problems?

Tobias: That’s an excellent question because we’re used to thinking of security as a space in its own right, and it really is not that at all. I always encourage our customers to think about a governance spectrum. It’s essentially the spectrum of things we depend on but we don’t own. So, because our fate depends on these dependencies, we need to be able to control them.

On one end of the spectrum, there are performance issues. You may be limited by another component of the infrastructure or otherwise suffer a compounding effect. Then come stability issues. A poorly performing system tends to become unstable, be in a degraded state or fail outright. So, some stability issues become availability issues. Next to availability, we have security issues. Is it a DoS? Am I looking at a breach? Am I getting access violations because some backup route in the system is suddenly activated? And, at the other end of the spectrum, there is this specialized form of security, compliance.

All of these are on the spectrum of things we depend on but don’t own. So we need to control them. The interesting thing here is that, when you talk to some security professionals these days, you sometimes hear that they don’t want a security solution anymore. They’ve seen that a “pure-bred” security solution with zero benefits for the operational side is simply not effective. What you increasingly hear from these professionals is that they want essentially an operations solution that is also relevant to security.

Here’s an example: let’s say a system is breached, or let’s say you have a vulnerability scanning tool running. All of a sudden, this tool discovers a machine that hasn’t been patched. There’s a zero-day on it. Typically, that tool cannot connect this insight to a remediating action. There’s somebody who needs to look at it and do something about it. With us, you can simply connect this tool to our API and have the machine in question shut down. Because we are an operations solution that can not only see but also do, we are highly relevant to security—mostly on the access control side of the house but also in other places.

How to Avoid Cascading Failures: Real-time Control

Jeff: If a failure occurs—under dire circumstances that can create a cascading failure. Can you talk about how to avoid cascading failures and how to alleviate them?

Tobias: This is what we are all about! As you scale an architecture, as you compose more and more systems, these cascading issues quickly become the norm. Your system will always run in a state of degradation. While most of these degradations balance themselves out after a while, some of them end up spiraling out of control and then you have a massive outage—going back to the Robinhood example. Of course, there are hundreds of other examples—just listen to any SRE talk these days. It’s critically important to see these degradations and then be able to do something about them very quickly.

As I said earlier, these environmental factors, the cascading failures, noisy neighbors, ripple effects, feedback loops, etc., are like dominoes, where each system affects the next one in a very particular and highly unpredictable way. That unpredictability is the reason why you need runtime control.

You can no longer run these environments with the old DevOps lifecycle model where you’re going to monitor something, then learn something, then go back to writing code to address the issue and then release that as a patch. That takes too long! Even if you could address the issue you saw with smarter, better code—and that’s a big “if”!—it simply takes too long. Things will spiral out of control before you get a chance to release the fix. That’s why real-time control is so essential. That’s what we’re all about—real-time control.

The Inspiration Behind Glasnostic: Unpredictability in Composite Architectures

Jeff: What was your inspiration for starting Glasnostic?

Tobias: I’m a platform as a service guy. I was a co-founder of the company behind Red Hat OpenShift. I spent a few years exclusively looking at how we can optimally support the building of applications. Essentially, how to let people just add code and then run their applications. What I discovered later on, though, is that nobody writes these applications anymore!

As a developer, you may think you write an application, but once you deploy it, it gets connected to many other systems. And this connectedness and the complexity of it is only accelerated by the cloud. Everything is as-a-Service. You just summon an API, and voilà! you got a bunch of new machines or data, or you just transacted a payment or even started shipping your ware—all these kinds of things.

At the same time, the business needs to move faster! The only way to move faster as a business, write more code and get code to market faster is to parallelize your development. Ten years ago, you still had large development teams. Now they are very independent “2-pizza” teams, all deploying independently of each other, all the time. As a result, you’ve got a thousand deployments in your infrastructure every day. That means not just complexity but also highly dynamic environments. Now, I can deal with complexity if it’s static. I can iterate on it and perfect it over time, so I can be reasonably confident that it will work reliably. Similarly, I can deal with rapidly evolving architectures as long as they are not too complex—if they are simple. As complexity and rapid change come together, however, things get difficult quickly.

Complexity and rapid evolution create unpredictability, which leads to cascading failures.

As it happens, of course, market and technology forces have been conspiring for a while to push everyone up and to the right, into the quadrant where things are complex and dynamic. And complexity and rapid evolution create unpredictability, which in turn gives rise to those cascading failures that you mentioned earlier.

The only way to deal with unpredictability—and it’s really the “unknown unknowns” that we’re talking about here—is to be able to detect and respond to them in real-time. There is no way to prevent these unpredictabilities and the cascading failures they enable with “better code.” That’s the idea behind Glasnostic. All these systems we are stitching together today in the enterprise—with workloads on premises, across clouds and at the edge, and some workloads even migrating between locations—those increasingly diverse and relentlessly growing systems create an unpredictability that must be managed at runtime.

How these workloads interact becomes the key behavior that you need to be able to control. If you can’t do that, it’s like running an organization with 50 or 500 teams without management. Of course, that would be insane. Every team would end up doing whatever they want. It would be total chaos. Yet, in IT, that’s how we do things. We deploy code and expect it to run at any scale and for all eternity. This works up to a point in simple and static architectures. It does not work for complex and dynamic environments.

We are in the business of enabling operations to run the show.

Learnings From PaaS: Operations is the Key Capability

Jeff: Tell me what else you’ve learned in this past journey over the last five or so years.

Tobias: I believe the PaaS to beat is ultimately AWS. Or Azure. Essentially, the idea of PaaS in my mind comes down to this: I am a developer, I want to code up some stuff that my business needs. How many Lego blocks can I use to make this as quick as possible?

Very simply: I’ll use a hosted MySQL service because I have no need to run my own. I’ll use an authentication service because I have no need to run my own. There are 5,000 such blocks on AWS alone! There are like thousands more of these Lego blocks on Azure, and there are more blocks on GCP. Essentially, to me, this is PaaS today.

I think we are way past the stage where we’re writing little two-tier applications that would benefit from a “PaaS for little two-tier apps.” There are a few companies that still run hundreds of these little two-tier applications independent of each other, but they’re a dying breed.

Everybody else I see in the marketplace runs essentially a service landscape. What is a service landscape? It’s a decentralized architectural style. It’s not a single microservice application. It consists of many microservice applications that depend on each other. And it’s not just microservice applications—it may include serverless functions, VMs, metal or even mainframes!

That’s what we call a service landscape. A network or a web of applications, if you will. Because everything today has an API, so it will become a dependency for something else. That’s where you, as an operator, as a business, deal with the complexities and dynamics of a multi-body physics problem. The only way to disentangle this is by rapidly observing what’s going on, at the highest possible level, at the air traffic control level, and then by doing something about it, again at the air traffic control level.

That gives everybody in the organization the guidance they need to then go down and diagnose deeply if needed. Most of the time, organizations discover that a detailed analysis is just a waste of time. Most of these issues are flukes, they’re random, and all that’s required is simply that you can do something about it. Obviously, some of these issues you want to learn from so you can prevent them from happening the next time around.

So, that’s my learning from my time on the PaaS side: the building of applications is considerably less important than your ability to run them. Think about it as “nature versus nurture.” Years ago, the expensive part of software was the writing of code. Capable developers were expensive and hard to get. There were not enough to go around. Today, though, because our tech stack has grown so tall and because languages, frameworks, libraries and other abstractions have gotten so much better and more numerous, and because today’s code builds on top of so many existing pieces, the actual amount of code represents an increasingly smaller piece of the entire “application.” So, it’s only natural that the importance of code goes down and the importance of how services are composed goes up.

As developers, we still think that the value of everything is in writing the code. That’s actually not true! We think of it like cars: building the car, that’s what creates $20,000, $40,000, $100,000 of value. Yes, there will be gas, insurance, maintenance etc., later on, but that’s negligible compared to the value of the car. That’s how we, as developers, still think of our code.

In reality, though, it is much more like a child. Making it is relatively easy. Raising it is hard! If I can reuse a number of existing building blocks, I can write meaningful code in half a day, and I can run it for a few dollars per day. The problem is: how will it behave next week when, all of a sudden, two other services depend on it? Now, does my scaling still work? Probably not. Once it is deployed, you need to continue to “raise” that code, like a child. The full lifecycle, the runtime behavior of everything you do as a developer, is increasingly the important piece. That’s the motivation behind Glasnostic, and there is nothing like it in the marketplace today.

Prediction 1: Service Landscapes Will be the Default Architectural Style

Jeff: I’d like to now get your predictions on the next 5 to 10 years of what PaaS and microservices look like.

Tobias: My prediction is that the trend to build new functionality by simply stitching more services together is inescapable. In every industry vertical, not just financial services, managed service providers, and so forth. It’s going to be everywhere. Ultimately, composing services in the enterprise, treating them as capabilities to be extended, to be built on, is the only way the business can get to market fast. The business wants to be agile, run an agile operating model. And the slowest part today is getting new code into production. It’s not just a matter of deploying code. It’s a matter of being able to control the chaos that all that composing creates.

We live in a post-distributed-systems world.

One of my fundamental beliefs is that we live in a post-distributed-systems world. As engineers, we tend to marvel at distributed-systems engineering. It’s beautiful and challenging. It’s the “black belt” of software engineering. That’s where our “man versus machine” fantasy is coming into its own. The important piece to understand here, though, is that it is also too slow in today’s world. It is too brittle and it is too expensive. And most of the distributed computing primitives I may need in my logic are already solved. I can just use a Dynamo, a Cassandra, Erlang or whatever else I may need.

Also, If you think about it, there is no distributed system in nature! There is no Paxos in nature. It just doesn’t exist. We like ACID compliance because it makes it easy for us to reason in simple logical terms and because it abstracts us from the underlying implementation complexity. But this gets difficult quickly when we try to do this across 2,000 machines. In my view, these things, as interesting and challenging as they are, are a niche problem.

In the marketplace, in real life, we’re going to see a general adoption of composite architecture, the stitching together I mentioned, the wiring up of systems because whatever new functionality the business wants, I already have 90% running somewhere! I just need to build on top of it, and that’s easy because everything has an API already. It’s inescapable. We’re going to see these service landscapes become the default architectural style. Not even just microservice applications. That’s the developer view of the five or ten services I’m responsible for. These are only one cog in a way more extensive system. And these service landscapes require a better way to be operated.

Prediction 2: Ops Teams Will Manage Dev Teams

The other important prediction I have for the next five years is that, culturally, our heads are still in a world where the developer is the kingmaker. We’ve been in it for probably ten or more years. Again, we overvalue developers because writing code used to be expensive. I believe that model is going to flip around as more and more systems are composed together. We are entering a world where the operations teams are the “kingmakers.” As a developer, I am too deep in code to know what the big picture looks like at the global scale of the service landscape. That’s what the operations teams know. The business needs to start talking to their operations teams. Today, the operations teams are typically below the development teams from an organizational perspective, which is odd because the business wants results, and the operations teams know how to deliver results!

Operators are the new kingmakers.

As a developer myself, I know I have 100 definitions of “done,” and I know the age-old complaints about developers in that regard. “It runs on my laptop!” “It’s ‘done’ in testing.” “The ticket is done,” whatever. As developers, we don’t know how to deliver things. Operations people do. My prediction is that the old model of the developer being the kingmaker will flip around. We will move to a model where the business will rely on operations for anything that’s delivery-related. The operations teams, of course, know every single developer, and they know the intricacies of development, how something is deployed, redeployed, patched, all these things. That’s really the best way of ensuring an agile organization, in my view. Have the business talk to ops.

Jeff: That sounds like a great place to close off. Anything else you want to add about microservices or Glasnostic?

Tobias: No, I think we touched on everything. It’s a pretty wide field, right?

Jeff: Okay. Well, thanks for coming on the show, Tobias.

Tobias: Absolutely. Thanks for having me.