Photo of different succulents
Resilient applications (Source: unsplash.com/@yendvu)

Resilient Applications: Why They’re Critical

Earlier this month, I had the pleasure to talk with Matt Wolach on his “SaaS-Story in the Making” podcast (the videocast is here). Among other topics, we touched on the challenges of today’s cloud operations for architects, operations and security teams and how to make them resilient through real-time observability and control. Below is the transcript, edited for clarity.

• • •

Matt: I have some questions for you. Do you want to be in control of your cloud architectures, or have you ever wondered how to make your application resilient? This is SaaS-Story in the Making. I am your host, Matt Wolach, and I am thrilled to be joined by my guest Tobias Kunze. How are you doing?

Tobias: I am doing fine. How about yourself?

What have you been up to lately?

Matt: I’m doing great. So, Tobias is the CEO and co-founder of Glasnostic. Glasnostic puts operations teams in control of their complex cloud architectures, so they can actively assure the digital experience. I’m really excited to learn more about what Tobias is doing at Glasnostic. He was also formerly the co-founder and CTO of Makara, which was eventually acquired by Red Hat. So tell me, what have you been up to lately and what’s coming up?

Tobias: We are at the tail end of COVID, and there’s lots of movement in the marketplace. We address this crazy explosion of applications and service architectures that really started with application platforms like Heroku and Red Hat OpenShift—my previous company was a Platform-as-a-Service, where we started out hosting enterprise applications behind the firewall, and that was then airlifted onto Kubernetes if you will. So, starting with that, we had this massive explosion of services and applications that are now increasingly connected. So, not just individual microservice applications. Imagine many of such applications running in the enterprise plus shared services and service reuse, so what we end up with is a lot of decentralized applications that talk to each other. And what’s different with these kinds of applications compared to what we’re used to as technologists—when we write things, architect things and design applications—is that it becomes pretty unpredictable how these systems work in production, in real life. So that’s what we focus on. We make these application behaviors visible and, more importantly, controllable—in real-time.

What does it mean for an application to be resilient?

Matt: That’s fantastic. I know that there are a lot of applications running around, especially in an enterprise. When you look at a tech stack, for all of them to try and talk to each other and work well with each other, it gets to be very difficult. But what I want to understand is—and I saw this on your website—what does it mean for an application to be resilient?

Tobias: That’s an excellent bridge to what we’re doing. You want to be deploying things fast, but once something is deployed, there’s nothing you can do about it, short of rolling back or adjusting the capacity. But those are very crude measures to apply! So what we do is look at applications that interact with each other and other infrastructure and make these interactions visible so you can tune them in real-time. Instead of merely providing sheer capacity and hoping that that will keep everything stable, we apply real-time control.

So, if you compare modern cloud architecture to air traffic, then we are air traffic control. You didn’t need air traffic control back in the olden days when you had ten planes in the air at any given time, and everybody flew by sight. You could take off and know where you would land. But with hundreds and thousands of planes in the air, you need to be able to take control over all the unpredictability that is bound to happen. It’s not just weather; it’s other delays, it’s emergencies that occur—not necessarily life-threatening, but you could be running out of fuel, slots opening and closing at an airport, and so forth—anything can happen.

The equivalent of that is: the fate of a service that you deploy in a service landscape today critically depends on other shared services or a database or is affected by some other service that clobbers the same API. So these environmental factors and effects become much more important than just the correctness of your code.

In technology, we have always been ultra-focused on tracing things and making sure that our code works correctly. And that is really difficult at a distributed systems level, where we build 20 or 50 components that actually need to work together in a very strongly coupled sense. But what’s happening today is that, as developers, we don’t create full-blown applications anymore. We create individual services that are more narrow in focus and that are then composed at runtime. So, as a developer, I provide a new service that implements some functionality, and now anybody else in the enterprise can use it. That’s an entirely different type of architecture than what we’re used to. We call this a service landscape. It’s essentially a landscape of services that everybody is running now. And once you run a service landscape, you need to deal with unpredictability, in real-time, in production, where things happen that are very difficult to track and discover. So that’s precisely what we do. High-level observability—at the air traffic control level, if you will—and real-time control of what’s happening so you can actively shape the customer experience.

Who should be thinking about runtime control?

Matt: I love that air traffic control analogy. I am an airplane buff myself, so it made a lot of sense to me. I think that’s really great that it’s controlling and making sure that everything’s working together. These days with the amount of traffic that’s going on, with so many different systems, I think that’s fantastic. So who in particular should be thinking about solving this problem? Who exactly are the people at these companies that should look into this and say, wow, we need to figure this out?

Tobias: That’s an interesting question because there has been confusion in the industry over the past decade about who is responsible for delivering services. Who is really doing the operations piece? What’s important to realize is that, on a small scale, if you’re a San Francisco startup with one product, you have one application, and you build it as you’ve always built applications. And at that scale, it is often the developers themselves who are responsible for service delivery. They “deliver” the application, and they’re responsible for keeping the lights on.

Larger organizations with 500 applications that are connected over shared infrastructure, shared services, or even directly—large enterprises, in other words—have dedicated teams that are responsible for not just keeping the lights on but also improving how they keep the lights on. So we get a lot of interest from these target audiences because they know they are running, whatever it is, 200, 500 services today. And it’s going to be twice as many next year because the pace of innovation is accelerating pretty rapidly.

Taking control of large-scale cloud operations

Matt: So how did you come up with this idea and decide to solve for this?

Tobias: It’s come out of the experience of doing OpenShift at Red Hat. Remember, back then, when we did Platform-as-a-Service, it was really about how to optimally support the building of applications. Just write your code, and we will run it for you. That works great if you have a two-tier application like a PHP or a Ruby or a Java app talking to a database. Then, what we realized when we ran OpenShift publicly, was that, frankly, nobody writes these applications anymore. That two-tier, PaaS-hosted application is a service now. It has an API. And what happens is that 20 other applications use that service and, as a result, it doesn’t make a lot of sense to treat this service or think of this service as an application anymore. It’s not a stand-alone deployment, it’s a dependency of ten other services, or it itself depends on ten other services, or both.

So we have this decentralized landscape of applications or services now. That’s the key insight, really. That, once you go beyond a single application blueprint, there are so many things that become unpredictable that you can no longer design upfront. When writing an application In the old world, I would ask, “How should I do this?” or “How should I design this?”, “What should the architecture be?”, “What components do I need?” and so forth. Then I would create the components or pull them from somewhere, I would stand them up, I would calibrate them until they work, and I would iterate on their functionality until there is nothing left to iterate on. And that would be it; it would be pretty predictable!

Also, what I’d really want to do in that kind of environment is to get a full runtime view of how transactions execute. I would want to trace them, and I would want to debug them in runtime. I would even stand the application up in a staging environment before deploying it to production.

Once you go beyond a single application blueprint, you can no longer design upfront.

Now, if I have even as few as ten of such applications, and these applications are connected to each other, I have to decentralize development—parallelize it if you will. I have to have ten teams instead of one large one that releases in a coordinated fashion every six months. I’m going to have continuous deployment; I’m going to deploy many times a day. So my entire balance of what’s running in this connected infrastructure and this connected architecture is changing every five minutes, or maybe every 10 seconds! So there’s very little architecture I can do. Basically, if I write an application today that has an API—so, by definition, is a service—I can no longer architect because I don’t know who’s going to call me next week or even five minutes from now. All I can do is ensure I can scale out if needed. Of course, that only works in certain contexts. I can’t scale infinitely because things change fundamentally with scale, but at least to some extent, I can scale.

Why is it critical to control what’s going on?

But beyond that, since I can’t predict what’s going to happen at runtime tomorrow or even five minutes from now, I need to have runtime control. Below a certain scale, all that matters is the correctness of my code and the correctness of my transactions. Just like it matters that your aircraft is airworthy and that you fly it correctly and that you reach your correct destination. As a pilot, you want to know what the oil pressure is on engine 2. But once you have hundreds of planes in the air, it’s no longer just about mechanical airworthiness and piloting the flight. What matters is the environment. How flights “work together,” how they share the airspace. You need to control the environment. You need air traffic control.

What matters is the environment.

So, as far as the whole, the environment is concerned, it is no longer about the inner workings of code. It is about outward behaviors. How many services are currently hitting this particular database? Is this “within spec”? Does the load cause another instance to spin up? Does it cause sharding to rebalance? Whatever it is that affects the consumers of that database. These behaviors make the service landscape very unpredictable.

And yes, sometimes it may be just a matter of scaling up in time. But at other times, I need to be able to, for instance, resist the demand and simply push back. For instance, I may have to be able to exert classic backpressure or to apply some other operational pattern.

As it turns out, almost nobody has this capability today. We used to be able to do this before cloud, but today it is very difficult to do, so we bring all these operational patterns, these control primitives, if you will, back to the operations teams so they can actually take on workloads in real-time, as many as they want, because now they have runtime control.

What were some of the best things you did to grow the business early-on?

Matt: What I want to know is, you got a great idea that you launched, but how did you go about growing the business in the early days? What were some of the best things you did along your path that really helped you get to where you are now?

Tobias: I think what really helped us—and what would help any startup—is that we built our technology team outside Silicon Valley. It is very difficult for a startup to compete with the salaries that, e.g., Google pays in Silicon Valley. Luckily, there are hungry developers elsewhere! As it happened, my co-founder moved to Taipei to build up an office for his previous employer, and so we built our R&D function there, which has worked out exceptionally well for us. Taiwan has excellent universities, and not everybody wants to work for Foxconn or TSMC. So, working for a software startup is appealing. That was one thing we did that proved to be crucial for us.

Another thing that was great for us is we got very large enterprise contracts early on.

Matt: Okay. So you went after the big ones early!

Tobias: Yes. We thrive on complexity, and that’s what large enterprises have. Remember, if you are a 10-person San Francisco startup, you have one product, you have one application. But things are different in larger organizations. They have a hundred different development teams and different business units and groups, and everything is transforming digitally. Everything turns into digital processes; there is a lot of movement going on. And clearly, the applications that these organizations develop rely on existing services. They don’t build things from scratch. They build on top of existing architecture. As a result, things turn into spaghetti architecture very quickly.

And of course, the cloud accelerates this problem because all of a sudden, I can stand these services up in no time. Think about it, even two years ago, writing a service in something like JavaScript at a reasonable maturity took maybe half a day at most because 80% of the functionality was already there. So building that service on top of existing services is super easy. Plus, I can run it for two dollars a day on some cloud infrastructure or in my company’s own on-premise environment. But the problem is, I’m talking to ten of my company’s tier-one services. So, as an operator, I can’t simply take this on. I need to tell the team, “Guys, great that you wrote this, but I need you to test this first.” And there’s a whole queue of applications to be tested. So you’re going to be deployed in a few months. That’s a real problem in the enterprise.

Solving the operational crisis in the enterprise

In other words, there’s a massive operational crisis going on in the enterprise that isn’t solved by throwing more traceability at it. Tracing and thread-of-execution observability helps the developer. Because now I can throw stuff over the fence faster than before. Because now I can see what’s happening in production, and I can actually fix it. But the problems that we are talking about are not problems of logic anymore. The problem is that environmental forces determine the success of your service landscape. The problem is that my code can be one hundred percent correct, and it gets deployed, and my application doesn’t work simply because I’m not getting the right service at the right time.

If you look at every single outage in any company that publishes a post-mortem, the story arc of these outages is always: “We did something we always do, but circumstances conspired, and we ran into an unknown limit that nobody could ever conceivably thought of, and mayhem ensued. This mayhem is then described in much detail, but, frankly, the details don’t matter. What matters is, if they had been able to just exert classic backpressure against whatever happened, those outages that lasted a whole day or longer could have been turned into mere degradations. And of course, it is much better to run at 95% with a little backpressure against requests than to be out for the day.

You can discover and apply these things from a very high level very quickly. It’s akin to telling everybody en route to an airport that there is a lot of fog, so please hold until we bring you in one-by-one. And, as a result, nobody crashes, and everybody gets to land safely.

The cloud chaos is real, but up to a certain scale, you could conceivably get by with “better engineering.” Above that scale, however, you need runtime control. Enterprises are above that scale.

If you build it, will they come?

Matt: I like the idea of getting to enterprise customers. How did you approach them with no customers already under your belt? What exactly did you do to convince them that Glasnostic was the way to go?

Tobias: That’s the million-dollar question! Are you going to build and then wait for them to come? Or do you work with a customer who will tell you she only needs a bit more tracing or whatever tactical thing it is that scratches her itch? It is a difficult balance to strike, and I am sure it’s different for every company and every product.

But I think my advice to founders would be that we always underestimate our ability to create reality. Yes, it matters what you do, but it matters no less that you do. Just showing up with a product that solves only a small part of what a customer thinks she wants already gives you a good shot at, if not convincing her, then at least learning a great deal about her true problems.

We always underestimate our ability to create reality. That you do matters no less than what you do.

We tend to forget that our products can shape reality, and, of course, the “lean startup” worldview has tried to argue against that, but I think the situation today is different from 10 years ago when you could sell just on the idea of software. I think the problem we face today, particularly in enterprise sales, is that there is already too much software in the world. Today, nobody will even talk to you until you are production-ready, until you have something they can use. So, you have to build first and then they’re going to be looking at it.

This obviously makes it more difficult for startups to sell into large organizations. The rest is just personal network. You know people in organizations; you talk to them, you interview them, you get early access to a project, and so forth. Not everything works out, often because you’re not yet ready, production-ready—to my prior point. Early enterprise sales is a mixture of these things.

But I would say doing something is incredibly powerful, and we tend to underestimate that. It won’t shield you from the risk that you may just “build something, and they won’t come.” But you have to do something. If you create a new category in the marketplace, you can’t simply hope to go risk-free and iterate with early design partners.

Breaking through “software fatigue”

Matt: I agree. There’s a concept I’ve been putting around there’s software fatigue. You know, everybody is so overwhelmed with how much software is out there and the hundreds of different applications that enterprise companies are dealing with. It’s just really difficult as a buyer to identify when I should open my ears and eyes and look at a solution that could be the answer. They’ve got a solution. Somebody else has a solution. So the software fatigue is real. How have you guys been able to differentiate and get your buyers or potential buyers to identify that this is a solution that we really need to look out for and consider bringing in?

Tobias: I think, and particularly in the early days, a large factor is luck. You need to stumble across a champion who’s been looking out for you already in some form, right? It’s impossible to convince anybody out of the gate. They need to be looking for something like you already. So it’s a little bit of a game of talking to enough people to increase your chances of finding that and then learning from these early conversions. And my personal sense is that that balance changes every quarter. Not just because you grow as a company, but also because people operate differently. How people and potential customers consume things changes all the time and dramatically so. So that balance is something that needs to be observed very carefully. I don’t have a recipe for that, and I think nobody does.

What’s next?

Matt: I think you have to understand how to balance the priorities within your own software company, what you’re trying to do. As a founder and leader, that’s what you’re trying to do all the time. Sadly, speaking of time, we have run out of it. I really appreciate all of this knowledge because it’s helped me understand a little bit more. How shall our audience learn more about you and Glasnostic?

Tobias: Visit Glasnostic.com. We also have a pretty extensive blog with use cases and case studies and talk a lot about operations and how active runtime control makes your large-scale cloud operations resilient.