Photograph of a plane flying into clouds
There are different flight rules in the cloud (Source: unsplash.com/@inesdanselme)

Getting Your Head in the Cloud

I recently had the opportunity to speak with Alexander Ferguson, host of the UpTech Report (video here). We discussed the growing complexity of the cloud and how runtime control is essential for enterprises to create resiliency and assure the customer experience. Below is part one of our conversation, edited for clarity.

• • •

Cloud is Complex

Alexander Ferguson: Welcome to the UpTech Report. Today, I’m excited to be joined by Tobias Kunze, CEO and co-founder of Glasnostic. Welcome Tobias, good to have you on.

Tobias Kunze: Thanks for having me.

Alexander Ferguson: Your platform is focused on making cloud applications resilient. What was the problem that you saw and set out to solve with Glasnostic?

Tobias Kunze: The root problem we address is the increased application complexity as we move to the cloud. Everything gets more complicated. We have more pieces running in more places. It’s not just cloud vs. on-premise. There’s multi-cloud, hybrid cloud, edge computing—and it’s not going to get simpler anytime soon. And frankly, the problem is always that our applications are not what they used to be 20 years ago. Today, the interesting data is always outside of the application. This means you have dependencies. The connectivity explodes. It’s not just a proliferation of application pieces—think microservices—it’s the number of dependencies you converse with. SaaS services, other services, components you depend on, cloud services!

Alexander Ferguson: After observing this, how did you develop that into the product itself?

Tobias Kunze: It became very clear that there is a natural limit to how far we can get with distributed systems engineering. As a developer, it is evident to me that you cannot build a system with 500 or 5,000 moving parts and think all that acts like a big in-memory process space that you own. You are not calling out into some abstract service. Everything you use is shared, everything is multi-tenant. It’s not like you own any of that. You just use it. And in aggregate across the architecture, this becomes very difficult to manage.

Going Beyond the Code

Alexander Ferguson: So the scope of complexity is definitely only increasing. But where Glasnostic plays a role is not prior to launching, right? You come in on “Day 2”, correct? Can you help me understand it?

Tobias Kunze: We used to think about applications or anything we do in software, that the value is in writing the code. It’s been true that if you write a small piece of code, yes, that’s where the value is. Once it’s deployed, it’s just going to run. But the more complex applications become, the more involved a piece is actually once the code is live. It’s kind of like the old nature-versus-nurture debate. Parents know that making the baby is not where the work ends. Raising it is the work! That’s “Day 2”. And we are in that stage in software today. Twenty years ago, writing the code was the difficult thing to do. Today, it’s all about keeping it up in the face of change and rising complexity.

Alexander Ferguson: Another analogy is that if you have a team of 20 people, the way you manage a team of 20 is different when you start to scale to 50 to 200. When it comes to the software side and managing an application, what are some of the complexities that come in when you’re trying to scale, and what should you be looking for?

Tobias Kunze: I think it really comes down to this natural limit of how much we can engineer deterministically. We can go into distributed systems engineering and try to do what we can to predict and prevent every single corner case—any single thing that could happen to us in production. But of course, there’s a natural complexity and cost limit to that. Beyond that limit, you simply need to deal with it: if you can’t prevent it, you need to manage it! That’s similar to when a company starts with a few people, and there is no team structure. Everybody is heads down. If you have a question, you yell to that guy or gal at the next desk. But as you grow, responsibilities divide, and you need to get some kind of management in place. The only purpose of management here is to make sure nothing’s in the way of the teams doing the work. Management is all about being situational, reacting to the situation. Now, as software engineers, we’ve been brought up in a world where we write code and then we simply let it run, so that idea of having to manage the way code is running is kind of foreign to us. But it’s absolutely essential. Before applications became really complex, that kind of management was done by the ops teams, who made sure nothing got in the way of the applications. But now, things that get in the way of applications are other services and other software components. And that is an exploding problem.

Controlling the Unknown Unknowns

Alexander Ferguson: Let’s get a little bit nitty-gritty on the details of the platform itself. How does it work, and who’s actually using Glasnostic?

Tobias Kunze: How does it work? If you really want to manage any of these unknown unknowns that happen in production, you cannot rely on agents. You can’t have extra code running on a machine or any given workload as you may not own it (and most of the time you don’t own it). So you need to look at how these things behave on the outside. And the way you define behavior is by looking at wire signals. You’re looking at outward behavior—here’s a load balancer, tons of stuff coming in, nothing coming out. That’s behavior! I don’t care what happens in the box. I look at the outward behavior. And that becomes the most important paradigm shift. I like to look at this as air traffic control. You’re no longer managing a single flight from within the cockpit, where you have all that observability and those switches and all that. You have hundreds or thousands of flights in the air—you are managing the airspace! You are no longer trying to ensure the correctness of a transaction; you are solving a management issue. So you need a different set of verbs and nouns. For the air traffic controller, those are the call sign, the position, altitude, direction and speed—maybe a little bit of weather data. For us, it’s very similar: we look at our “call sign”—the service name—and how many requests are running, what’s their latency, their concurrency—how many run in parallel—and their bandwidth usage. Crucially, not even an error rate at this point because that’s an application-specific signal. It’s not “physics.” If I want to manage the airspace of any complex cloud application, I need to see everything. There can’t be anything I don’t see! Therefore, Glasnostic is touchless, agentless and based on “golden” wire signals. We are a “bump in the wire.”

Alexander Ferguson: So who is that air traffic controller?

Tobias Kunze: The titles in software are really all over the place. I like to say: Glasnostic is for people who are responsible for deployments. That can be the platform developer, the app developer, the SRE that’s embedded in the app team, the platform team, the operations team, DevOps—it’s all over the place! Whoever is responsible for deployments. Whoever needs to make sure that nothing steps on something else’s toes, that everything is always available, always stable and secure. Now at that level, I’m not really concerned with bugs. Bugs in code are actually really easy to find. You can unit test your code and find them. The problem is if you have 400 or 500 systems that you put together and something doesn’t work that is not a code issue but an environmental issue. That’s what we are after, and that’s what we make controllable for our people—the people that are responsible for deployments.

Remediate First

Alexander Ferguson: It’s so interesting that you’re honing in the forest for the trees instead of the root cause. Why do you take that approach with Glasnostic?

Tobias Kunze: Absolutely, you need to start with remediation. How can we remediate the issue? You do not care about the root cause because you care about a quick fix, about how you can keep that customer experience up. You need to end that outage, and you need to keep whoever is sucking up your bandwidth from doing so—or whatever the case may be. Keep in mind, it’s all unknown unknowns most of the time, so it’s really important that you manage the situation before you then decide: Is this something you want to do something about long term. You probably want to look at different ways you might be able to fix it. The problem with root cause is really that once you think about it, there isn’t such a thing. Like there’s no source of the Nile, it has 1,000 different creeks, which is a source as a pure matter of definition and convenience. Ultimately, if you have a hammer, the root cause is the nail that you find. That’s really what a root cause is. And the big problem with root causes is: they take a long time. It’s a safari. That’s why outages take a day or even multiple days!

Alexander Ferguson: Can you give me a use case that illustrates what Glasnostic does?

Tobias Kunze: Yes. If you just look at the big outages that happen and the post mortems that companies publish all the time, I find it super interesting that 90% of the post mortem is the detail about how heroically they figured out the issue and what they did to contain it. And the real nugget of the story is always the first 10%. That story arc is always the same. It always starts with, “Oh, we’ve been doing something we’ve always been doing, but random circumstances conspired, and a component reached some weird limit that nobody could have ever predicted, and that was completely unforeseeable.” And then mayhem ensues because that limit triggers a whole chain reaction of events and results in a big outage that took a day or more to fix. That’s totally backward. When we discover something, we should automatically do something about it. And frankly, the vast majority of outages can be mitigated, reduced to a mere degradation if you could just apply classic backpressure against whatever causes that limit to be hit.

Alexander Ferguson: You’re saying you can prevent outages just by simply making slight adjustments versus everything goes down until you fix the root cause?

Tobias Kunze: Yes, absolutely. If there is a root cause at all and if you think that it is worth fixing.

It is a Shift in Mindset

Alexander Ferguson: Is everyone already attempting to do this, or is it a shift in mindset?

Tobias Kunze: It’s a paradigm shift. If you leave production to developers, their gut reaction is always to find out what went wrong. “Oh, let’s dig into this! And, by the way, we have all the time in the world. It’s important to get it right.” But it’s not. If you are in a car wreck and first responders come in, they don’t call in the oncologist or an orthopedic surgeon. They stop the bleeding first, and only then will they look at whether there is anything else they should do. And because we’ve been brought up as engineers to think methodically and procedurally, to write code—and code is the epitome of determinism—it is very difficult for us to jump out of that mindset. It’s definitely a shift in mindset.

Alexander Ferguson: Can you tell us more about who your customers are?

Tobias Kunze: Our target audience is people who live and breathe these problems. If I’m touching something in production, new deployment, one out of three times, something breaks somewhere else. It’s a matter of complexity. Those tend to be the platform developers, the platform operators, all the way up to managed service providers. The people who expose Lego blocks to their internal or external customers and run a ton of different workloads and their demand curves are pretty unpredictable. So, our customers are companies where everything is already in a fairly unpredictable place. On the other hand, if you have just five services, a small microservice or web application, you’re not our customer.

Alexander Ferguson: When you get to a scale of enterprise where there are just too many applications, there’s no way that you can dig into all of them. That’s where the real power and the purpose of your platform come into play. You need that bird’s eye view.

Tobias Kunze: Everything’s connected, and everything changes all the time. What are you going to do when something breaks? Are you going to push back and not deploy anymore while you figure it out? You need to invest in a capability to observe and control.

Looking Ahead

Alexander Ferguson: Looking ahead, how does the space change as we go forward?

Tobias Kunze: It’s very clear: applications are not going to become less complex. Our applications will be based on more data, and more data means more connectivity to other services and other sources of data. And because there are practical limitations to how much data we can process in a single step, there will be intermediate stages, proliferation of services, redundant services, there will be different clouds and edge locations. It’s only going to explode. We are reaching a limit where most applications I see today are not synchronous anymore, and everything is eventually-consistent. Even where that application landscape ends is not really clear because you’re somehow connected to a cloud region that has, say, a peering failure, which is why you are now seeing a retry storm, or a feedback loop, or some other behavior. Most companies today fly totally blind to these behaviors. These degradations happen all the time. At a certain scale, something’s always off. You just don’t know how much time and resources you’re wasting by not seeing it. We need to be able to control these behaviors and create resilience.

Alexander Ferguson: I really appreciate you being able to spend time with us today. For those that want to learn more, visit glasnostic.com to book a demo or sign up for an account!