Apollo lunar landing module blueprint and in-flight photo
(Source: nasa.gov)

From Development to Mission Control

Recently I was a guest on Alan Shimel’s “Digital Anarchist” show over at TechStrong TV, where we talked about the conflicting dev and ops concerns of DevOps, why architecture is largely obsolete in today’s cloud world and why Mission Control is the future of DevOps. Below is a transcript of the episode, edited for clarity.

• • •

Alan: Hey, everyone, thanks for joining us on this segment of TechStrong TV. I’ve got a new company to introduce to you and their founder, Tobias Kunze. Tobias, welcome to TechStrong TV!

Tobias: Thanks for having me, it’s a pleasure.

Alan: So, the name of the company is Glasnostic, which means “transparency.” What a great name, what a great meaning for a company! Tobias, give our viewers a little bit of background, both on you and on Glasnostic.

Dev vs. Ops is Nature vs. Nurture

Tobias: Absolutely. I was a Platform-as-a-Service guy. I was the tech co-founder of the company that became Red Hat OpenShift and, as such, fully focused on how to optimally support the development of applications. But as we were focusing on that, I quickly discovered that making applications truly successful is less about the engineering side, how you write it, the coding part: it’s about how you run it. The best code is worth nothing if I can’t control how it behaves in production, in the “real world.” It’s kind of a nature-versus-nurture debate, right? As developers, we think the value is in the code. But it’s actually not. Our code needs to go to kindergarten, and it needs to go to school. It has all these stages to go through. That is, it needs to be nurtured. That’s the operational side, and that’s what we focus on at Glasnostic.

Alan: Fair enough! Don’t tell my developer friends that, they’d be crushed! But it’s funny you bring this up: when I first got into DevOps 8 or 9 years ago, there was this push-and-pull tug-of-war about whether DevOps was too dev-focused. Are we too focused on developing and testing, and then, after deployment, developing and testing more? Or was DevOps too ops-focused? Are we too focused on that deployment and application performance management and all those traditional ops activities, when, really, the thing about DevOps was supposed to join dev and ops, to break down that wall between them? And we’ve made progress, I’m not going to deny it. But one of the things I think we’ve learned, Tobias, is that on both sides of this equation, all around, there’s so much more we could do better! We could do development better, and God knows we could do ops better. And some of the big trends we’ve seen are “AI Ops,” the idea of using machine learning to improve ops, and, more recently, “observability.” Why don’t you tell us which topics you focus on on the ops side?

The Two Axes of DevOps

Tobias: The framework I like to use when thinking about the DevOps space, because it makes things so very clear, has two axes. The horizontal X-axis is where I care about my thread of execution. That is the “happy path” I think about when I start coding, when I think in terms of functions that later form a thread of execution across systems. That’s where I care about tracing, that’s where I care about compound latencies for my specific requests, that’s where I care about errors, status codes, etc. It’s all about how my individual request is being served. That’s the developer perspective. All developer concerns are horizontal, X-axis concerns. Requests travel all the way from the left to the right, then come back and return my result, which is what I started this request for in the first place.

At the same time, we also need to make sure that everything works together. That’s just a matter of scale and complexity we are operating at today, with everything requiring software and software being increasingly hyper-connected. There isn’t a standalone application anymore. We need to make sure that these hundreds of services we’re running don’t bring each other down. I’m not talking about the 10-person San Francisco startup here. A startup has one product and, thus, only one application. I’m talking about enterprises, large organizations with generations of systems and integrations, where you keep on building new value on top of existing value that you already have. That’s the vertical, the Y-axis. Or, if you prefer the musical analogy, performers care about the X-axis, their individual parts, while the conductor cares about the Y-axis, the full score. Totally different concerns. At Glasnostic, we care about the score.

Alan: Excellent. And those are great analogies. The music one got it for me, but hey, that’s me! So, give us a little background on the company, though. The product is out, and people are using it. How is it offered? Is it a SaaS kind of thing, or is it something that just gets installed on your own stack?

SaaS or On-Prem?

Tobias: We do both. We tend to run pilots with our control plane delivered as SaaS but Enterprise installations are predominantly on-premises. Of course, “on-premises” doesn’t mean “in customers’ buildings.”

Alan: Yeah, not in their server closet, but it’s installed in their environment, it’s not delivered as a traditional SaaS, you know, from a third party where things run in a multi-tenant environment.

Now, because you mentioned earlier that Glasnostic is probably overkill if you are a startup with just one app: what is the right scale where Glasnostic starts to make sense for organizations?

Tobias: We run pilots starting with around 25 endpoints. However, we prefer around 50. There needs to be some amount of complexity for us to show value. Again, if you only have three people playing, you don’t need a conductor, right?

Alan: It’s just three people playing!

Architecture Becomes Ops

Tobias: Exactly. They can look at each other. And frankly, if it’s a smaller set of applications, if you have one Git repo, there’s no need for us. You’re going to build this like you’ve always built applications, you own all the code, you build a deterministic application, you may want to trace through it if you are not sure how it works. Where things get dicey is when you have 20 or 200 teams, different business units, joint venture partners—that level of complexity—all of which deploy independently from each other. Now you have a landscape of applications and services. It is a hyper-connected, continually evolving, and continually changing digital landscape. There is no more architecture!

Architecture is a fossil of the waterfall era.

I think the key insight we need to get to as a community is that architecture is an atavism. When you think about it, architecture is the last waterfall activity we do. We live under the illusion that we need architecture because we believe that certain decisions are too expensive to change later on. But this makes zero sense in a world where everything is continuously changing, which is really just a result of the complexity and scale we’re running these days. So, architecture has become this leftover waterfall thing that I’m doing under complete uncertainty. I don’t know how things will look like in production next week! So, why do I spend all this time making plans and making decisions—expensive plans and decisions—that I know will be immaterial once it’s running?

Glasnostic Mission Control UI
Glasnostic lets operations and security teams control how systems interact, in real-time and without YAML.

Mission Control Changes Everything

With Glasnostic, you can make all those decisions that are expensive and fallible if made up-front at runtime, because we give you control over how your systems behave. We give you Mission Control.

And it’s not just architectural decisions. If you have runtime mission control, you don’t need staging. You don’t have to run extensive systemic tests, which are very unlikely to be successful these days, anyways. You simply deploy to production, put a quarantine around the deployment—or whatever control primitive you want to use—and slowly release it “into the sea.” Once you have runtime control, the majority of deployment risks simply go away. You can work totally different because the dreaded “IT Ops Roadblock” disappears.

Without runtime control, everybody struggles to contain the cloud chaos. We run way too many services, we don’t know what to do about all these changes that are coming down our CD pipelines. Or, every 5 minutes, a developer asks us, “Hey, can you deploy this for us, please?” And we go “No, I need to test this first. And, by the way, there’s a 6-month queue for getting tested.” All this happens merely because, once something is deployed, we can no longer control it. Once something is released, there’s nothing we can do. Except for shutting the machine down…

Alan: …and that’s like saying if you want security, just unplug from the internet. Let me know how that works for you!

COVID and Digital Transformation

I want to ask you what we ask most people who come on the show: how have recent events with COVID-19 and so forth affected business? What are you seeing short-term? What do you think we’ll see long-term? You know, a lot of people say, “Look, as terrible as it sounds, it’s been a good thing for the cloud because it’s showing people they got to move, they got to adopt microservices, DevOps, cloud-native, cloud.” I am wondering what you’re seeing?

Tobias: I think, ultimately, it’s been good for the industry. “Good” in the sense of hitting a broken TV and suddenly it works again. A lot of change has happened, particularly in the cloud space. A lot of workloads have been moved from on-premise to managed service providers, to cloud providers. Ultimately I think Satya Nadella said it right: we’ve squeezed two years’ worth of digital transformation into two months (and counting). And that’s the new reality that every executive out there is feeling now. So, the question is no longer: “How do we do digital transformation and what is it in the first place,” right? It used to be a very abstract concept, but it’s very real now. Now the question is: “How can we keep up the momentum?” That plays exactly into what we are doing, and we’ve seen heightened interest in the past six weeks now. Operations leaders realize that, even if they can maybe control what they are running right now, they don’t know how they are going to do this 3 months from now. They know they’ll have 5 times as many moving pieces then.

Alan: Yeah, we had this in our own team here at MediaOps. I’ve tried to explain to people that it’s like in running—and I was never a runner, as you can tell from my body shape. There are sprinters, and there are long-distance runners. A sprinter cannot keep up that pace forever, and that’s where the long-distance runners come in, who can do that marathon at an 8-minute-mile clip or whatever a great marathon runner does. I think that’s gonna be an issue here: we can’t keep sprinting at a long-distance race. At some point, you need to transition to a sustainable pace, to a sustainable cadence, to sustainable processes and to the things you are talking about. There was this—I don’t want to say crisis, but there was this inflection point, where two years of transformation got squeezed into two months, and it’s like forcing it through the eye of a needle. We were able to force that through, but I think we’re gonna come to another reckoning of how do we sustain, how do we normalize it? It’s gonna be an interesting time!

Close-up of air traffic control screen
Taking control of the cloud chaos requires operations and security teams to focus on universal, "golden" signals that capture the essence of the environment.

Operate Like a Flight Director

Tobias: Exactly. It is a massive operational crisis. It’s not a development crisis, it’s not a technological crisis, we have all the bits and pieces we need. It’s about: how do we make sure everything works together, all the time, while it is also changing all the time?

As a result, ops is becoming more and more like air traffic control, where we are not concerned with what movie is playing on a plane but instead where we make sure the airspace works. If you think back at the beginning of COVID in the US, when Robin Hood went down for three days—or rather, a day and then a couple more outages over the following two days—because their DNS services got overloaded, as they claimed. They had zero ability to push back, to exert at least classic backpressure against their “thundering herd.” How can you run something like this without a basic capability to exert some runtime control?

Move Fast and Control Things

These systems are highly unpredictable, and to believe that we can “engineer” our way out of this unpredictability is a grave mistake. We all live in a world now, where we essentially stitch a lot of components together. It’s not distributed-systems engineering, that would be way too expensive, slow and brittle. It’s rapid composition of capabilities. And all these components are finite, they are limited in many dimensions, and we are largely oblivious about these limitations. And now, at scale and in production, these limits are being hit all the time, and the system becomes very unpredictable. It’s like in multi-body physics, where behaviors get chaotic very quickly as soon as three bodies are involved. As a result, these systems behave in a highly nonlinear way. What may have started as a request limit over there is now a CPU limit here and becomes a latency issue over there before it causes a thundering herd at some other place, and so forth. And monitoring is not helping at all. Monitoring is a local technology. It’s something where I want to know how many sockets are in CLOSE_WAIT on that particular host, what’s the garbage collection here, the heap size there, and so forth. All of that is totally immaterial for the questions at hand. If we want to really control what’s going on, if we want to keep the digital landscape stable, we need to have an ability to apply operational patterns. We need mission control.

Alan: Because the other thing to remember is they don’t exist in silos, they don’t exist in their own private vacuum tube. They are all interdependent. It’s domino theory: when one messes up, it gets messy across wide areas, and there are repercussions throughout the system. And it becomes very hard to isolate what the issue is.

You know, Tobias, we’re about out of time and what I realized is, we didn’t mention for our viewers what the website is.

Tobias: It’s Glasnostic.com. Like Gorbachev’s “Glasnost and Perestroika“ back in the day, meaning “openness” and “transparency.” We make system behaviors transparent so you can build more, faster and with confidence.

Alan: Fantastic! Thanks for joining us, thanks for introducing the company. I’d like to hear more, maybe we have you back on in a couple weeks? We’ll continue the conversation!

Tobias: Anytime!