Earlier this month, I had the pleasure of talking to Tyler Bird and Jeff Groman of the “Adventures in DevOps” podcast over at DevChat.tv about how we can control today’s rapidly evolving digital landscapes. In this episode, we touched on a wide range of topics, including:
- Service meshes and their appeal for developers
- The Ops Problem
- The death of the “application”
- Organic architecture
- The X and Y Axes of DevOps
- Why it is the environment that matters today, not the code
- The spectrum of dependence
- Why we need software-defined interaction control
- The “ops” and the “sec” roadblocks
- The importance of runtime control
- Why observability is a distraction
- Why composing “Lego blocks” is the way forward
- Our post-distributed-systems world
- How operations supersedes architecture
- The four stages of dealing with cloud chaos
- Robinhood’s outage
- The paradigm shift of Mission Control
- Why organizations should rank operations teams above development
Below is the transcript of the episode, edited for clarity and content.
Tyler: Hello everyone, and welcome to another episode of Adventures in DevOps. I am Tyler Bird, and I am your host today. I am here joined with Jeffrey Groman. How are you doing today, Jeffrey?
Jeff: I’m great! I know, Tobias was just saying it’s really hot there for us, we had a really hot day yesterday, and today it’s really humid but, you know, as long as the air conditioner is working and my office is in the basement of the house, that sort of stays cool. But yeah, things are good here!
Tyler: That’s good to hear. So it’ll be Jeffrey and me today. We’re joined by our guest Tobias Kunze, who co-founded the company that became Red Hat OpenShift and is now running another company that you might not have heard of yet, but that you’ll hear of today: Glasnostic. And I was looking at it earlier to prepare, and it seems really interesting. So, I’m super excited about our guest today. Tobias, how are you doing, and how are things going over at Glasnostic?
Tobias: Doing well, thanks! It’s pretty hot outside and also in the marketplace…
Tyler: Good, good! Well, let’s just get into it. What we’re talking about today is “controlling the digital landscape” and what that means in a kind of a post-microarchitecture world. We’ve all decided that breaking things into microservices is a good idea, that everything pretty much is now a service. But what does that mean for operators and large-scale enterprises?
One of the things that I was first interested in was service mesh, so let’s just define for everyone what a service mesh is. I’ll give a possible definition, and I would love to hear your definition as well. Essentially, a service mesh is something that allows for different applications to share information, metadata, or other information with each other.
What’s your definition of a service mesh?
Tobias: Yeah, the classical definition is always: it’s a dedicated infrastructure layer that takes care of intelligent routing, metrics, policies, and security in service-to-service communication. All those concerns that you don’t want to reinvent all the time, in every piece of code you write. These concerns should be centralized. They should be pulled out of the code, centralized somewhere, and made available in a separate infrastructure layer. And there’s absolute truth to that. It makes a lot of sense, in particular, from an engineering perspective. So, I would define a service mesh as the attempt to make service-to-service communication better, easier, to centralize certain aspects of that boilerplate code and the things you have to do all the time. Don’t worry about calling mechanics, for every single call, because, with microservices, you have way more network calls. What used to be function calls are now network calls to a large extent.
Tobias: So, actually, it’s interesting because, when I left Red Hat, I thought I would build something that, today, you’d call a service mesh. I took that perspective precisely: we’re developers, we live in our IDE, so I want to look outwards from my IDE and just call this service because I need that return, that result code, that data back and I don’t care which instance I am calling. So, someone please do it for me, connect me with something, and connect me with the best instance. By the way, while you’re doing this, might as well meter there or encrypt, so I don’t have to deal with that, too. Again, these are concerns that I really don’t want to have to deal with all the time as a developer.
So, service mesh makes perfect sense because I get shielded from all that network stuff, which can get difficult and tiresome. Retries, what happened to that call, how long am I supposed to wait, timeouts, etc. All this stuff gets complicated quickly. But when I started building something like a service mesh—that was the time when Finagle came out of Twitter—I realized very soon that the real problem is not how we write the code: it’s how we operate everything. Because, as you can imagine, everything gets disintegrated, everything gets composed more and more. Because we’re not writing applications anymore, we are writing services! At that point, the question is: how can I still ensure that everything remains stable and secure? And that’s not a developer concern. I can’t code in my IDE that this or that call will be stable and secure at all times. There are certain aspects of that, yes, but other aspects depend on how it’s being operated at runtime.
We are not writing applications anymore.
So, we are focusing entirely on that runtime side of things. And where the runtime control becomes important or, I would say most important, is when you’re going past the single application. If you’re a San Francisco startup with 10 people, you have one product, you have precisely one application. You’re going to build this precisely the same way we built applications 50 years ago, just with different technology. But, essentially, the logic is all in one repository, and it’s under one ownership. Enterprises don’t work that way. They are much more complicated. You are looking at generations of integrations, and the people who wrote the code that generates their money may not even be alive anymore.
Tobias: Still, everything builds on top of that, well, “heritage” world. You have layers and layers of code, layers of systems, crazy integrations, no documentation, and everything needs to be continually maintained and updated. It’s like a city you look at from a bird’s eye perspective. Everything changes all the time, there are 500 cranes in New York City building stuff at any given time. That’s how the enterprise works. That’s fundamentally different from the small-application startup.
Tyler: Yeah, and day-to-day, you wouldn’t necessarily see the change, but then you look at a time-lapse, and you realize that things have changed quite a bit over time. And so, part of the feeling here is getting that different perspective of what seems not changing is actually changing a lot.
Tobias: Absolutely. It is constant organic growth. And when I say “organic,” it used to be pejorative, right? When someone said, “Our architecture is ’historically grown’,” it always meant the architecture was riddled with issues. But actually, it can mean a very positive thing! An organically grown architecture can mean an architecture that can readily adapt to whatever the business needs at any given time. And it can mean an operations team that can deal with any issues, as they arise. By the way, that’s the only way we build anything large. Anything we do grows organically. We start with small experiments, we expand them, then we put more things together, we outsource other things—it’s always a complex project. So, organic growth is the key ingredient in today’s infrastructures, in today’s landscapes.
Tyler: I like that! And I come from a PaaS background myself. I started at EngineYard, which I imagine you’ve heard of.
Tyler: For those who might not have heard of it, it was essentially a competitor to Heroku, which is a developer’s playground for rapidly prototyping apps. And one of the things you wrote about was that OpenShift had the “soul” of Heroku, but for the enterprise. So, coming from that same platform background, from Engine yard and Cloud Foundry, I started reading about Glasnostic and was like, “I should have more time to prepare!” Because I’ve just been preparing today, and I probably have a billion questions for you. So maybe we can continue this online or do a second episode. But, yes, I think platforms are underrated in the sense that, just like most operational systems, when they just work, they’re invisible. So what visibility do we get from Glasnostic that makes fully operational things fully visible?
Tobias: Yeah, so that’s another really great and important question that is the key difference between what we do and what almost everybody else in the Cloud Native ecosystem is doing. So, I’m a simple guy, I like to see the world in simple terms. Think of it as having two axes. On the horizontal axis, we have all our developer concerns: we build code, we care about requests calling other services. It’s a thread-of-execution view, and typically that’s where we want to have tracing, that’s where we want to debug things, that’s where we care about how much latency there is, what the cumulative latency is, what the error codes are, and so forth. That is totally important at the local level, as in “writing code.” I want to see how my code fares in production, how it executes, how it is utilized, and these kinds of things. Again, as a developer, I am growing the understanding of my code organically. I’m starting with the “happy path,” then adding more and more conditions, making it more and more robust, and so forth. That’s the natural way of writing code. And then I want to see it execute. That’s the left-to-right-and-all-the-way-back perspective of transactions ricocheting through the architecture. And that’s how many of us think about operating things.
That model, of course, falls flat on the face if I have many teams deploying in parallel and if more and more services are deployed and change at an ever-faster rate. Going back to the New York example: if things sprout all over the place and change all over the place. Because, none of today’s applications are islands, right, they are all connected! So, we live in a world of constant change, thousands of deployments across the enterprise over the course of a day—not all that important, but there are many, many deployments, and every change introduced has the potential of causing an outage. In that situation, the vertical axis matters. The vertical axis is about how everything affects each other: noisy neighbors, ripple effects, compounding failures, you name it. All these environmental factors that threaten to bring the application landscape down.
And those are infinitely more important than fixing bugs in your code. Ultimately, fixing a bug in code is very easy. There are lots of tools. What’s really difficult is control, the management task: how can 50, 100 or 500 applications work together yet remain independent of each other, even though many things are shared. So, anytime we have shared services or shared execution environments, the operational task becomes crazy complex. That’s precisely what we’re tackling: making these dependencies visible and, more importantly, controllable.
Tyler: Interesting. Jeffrey, do you want to jump in here?
Jeffrey: Yeah, I’ve sort of been just listening. My background is purely on the security side and, having worked with a lot of development teams over the years, I think one of the challenges that we have from the pure security perspective is getting in front of a body of code and trying to understand it from a threat modeling perspective. What do I need to be concerned about? Where are the real threats? And from a tooling perspective, if we’re using our standard toolset, like static tooling, our concerns are always: can we find possible security defects? How do we analyze that? How do we get our arms around that? And we typically do this is by scanning web applications dynamically.
Now, as you were speaking, Tobias, I was thinking that, clearly, from a security standpoint, this product is the nuts and bolts and something we want to get our arms around. And yet, as you said, it’s continually changing and that certainly begs the question about how do we do this continuously, not just at some point in time? Because, unfortunately, we see it too often that security gets injected at specific milestones, like “we’re doing it annually,” and thus it just becomes a calendar event as opposed to something that’s intelligently scheduled.
Sorry, that was a colossal preamble, but I’m sort of wondering how you envision—whether the security team in this is part of your dev team or whether it’s apart from them, regardless—how are we tooling, instrumenting and trying to detect where those issues could be, even within the product?
Tobias: Yeah. I love that question. I love it because in the field you typically find people concerned with security and people concerned with performance and they don’t understand each other. I see it as a spectrum of concerns that depend on things we don’t own but, precisely because we don’t own them, need to control. We all live in a world today where we need to be able to control things we don’t own. Take, for instance, security. We very clearly don’t own any of the code we are looking at, but we’re responsible for making sure it’s used properly. Towards the other end of the spectrum, as you get away from security, the next step is availability. Then you get to general stability and then to performance. And the common thread here is that, to properly tend to any of these concerns, you need to shield yourself from external influences, from your dependencies. You need an ability to control what you don’t own. For instance, in a decentralized scenario, where my code depends on 20 or 50 other services, my performance is 100 percent dependent on how these other services perform. If I can’t control how these services affect my performance, I am not in control of my destiny.
So, how do I govern that stability and performance side of the spectrum? You govern it the same way you govern at the security end of the spectrum. Not at the code vulnerability level—security is a vast field—but at the software-defined perimeter level as the security community is talking about. In today’s decentralized architectures, the ability to define perimeters in software should be table stakes, but few organizations have that ability. The Cloud Security Alliance just came out with a paper on software-defined perimeters and Zero Trust that contains great best-practice recommendations, and we are working with a governmental security agency on precisely that topic because it is so fundamental. On the stability side of the spectrum, the idea of a software-defined perimeter becomes the idea of software-defined interaction control.
Going back to what you said about security teams being often separated from the deployment or operation teams: that’s, tragically, true. These teams have fundamentally different skill sets, and, too often, there is no understanding between them. So, many enterprises end up with not one, but two roadblocks. In the industry today, we’re mostly working on accelerating development, that is, on the “left side” of the process. And with shift-left testing, everything is sped up on the left side. So, now you have more things traveling ever faster down the deployment pipelines until they hit the IT roadblock. The IT roadblock is when the ops team says “No, you can’t deploy this, we need to test it first and, by the way, there is a whole queue of stuff we have to test that’s six months long, so, unless you have executive sponsorship, go to the back of the line.” And then, when it is finally tested, the security team steps in, creating a second roadblock.
Both of these roadblocks could be eliminated if we could just de-risk the changes to production that these deployments introduce. And the only way to de-risk these changes is by having runtime control. If I can deploy something and I get to see how it behaves within the first couple of seconds, and I can do something about it right away, I can deploy it. If I can’t do this, I need to go back and test it for six months. That’s why it’s so important to not just accelerate development and Day 1 operations with DevOps. For the architectures we are running today, DevOps is way too slow. We can’t afford to plan, build, test, package, deploy, operate and monitor—and then go back to the planning phase with whatever bugs we’ve discovered and now wait for a patch to be deployed that hopefully fixes the issue. We can’t afford to wait on a patch. We need to have the ability to operate in real-time, at runtime. We need a ninth phase in the DevOps lifecycle, which is the phase of real-time Mission Control.
That’s what Glasnostic all about: Mission Control.
Tyler: Awesome, and I recommend everyone go to Glasnostic and book a demo. That’s what I’ll be doing next to get more into it. The question regarding that is, what can Glasnostic not do? I mean, have you encountered any legacy app that has no API or other interface that you guys can’t control? Because the claim on the website is that you guys are agentless, and we don’t have to install anything. So, is there anything you haven’t been able to add to your “select star all” and get all the results?
Tobias: Yeah. We don’t control how cell phones talk to their tower, that’s the area we haven’t gotten into so far. What we can control, though, is anything from mainframes to serverless functions. We can see and control anything that presents itself as an IP endpoint. And the way we do that is by being a bump in the wire. In other words, we are agentless, we do not need to insert ourselves into the deployment or even into the workload, because, ultimately, we are not a monitoring solution, we don’t need agents.
With today’s digital landscapes, monitoring is an entirely local concern. It concerns the code I write. That’s my ownership, and that’s my horizon of responsibility. Yes, I want to debug, yes, I want to see exactly what the value of this or that variable was at whatever time and probably the stack at that time as well. So, as a developer, I always want in-depth data. But, as operators, and at least since Nagios arrived on the scene, we know that looking at 50 differently colored timelines is meaningless. It tells us nothing. Even worse, it adds to the noise. When was the last time you cared about how many sockets are in CLOSE_WAIT? I don’t even remember! This stuff is completely irrelevant in today’s complex cloud architectures. Google started a similar revolution in the 2000s when they began using commodity hardware and just had a forklift drive around the data center once a week to pull out the dead machines. It is a result of everything slowly moving up the stack.
Monitoring is a local concern.
We live in an automation-rich world. There’s so much tooling, so much technology, so many languages, frameworks, and cloud services. Everything has turned into Lego blocks, which we are mostly just stitching together. So the stack gets higher and higher, and every Lego has its own automation. And we need to remind ourselves that the point of automation is not to not have to do something “manually” anymore. Rather, the point of automation is to enable a new vocabulary. We’re automating our deployments so we can treat them as a unit, as a new noun in our language, which we couldn’t do if the process were manual. We automate so we can work with new verbs and nouns, so we get a new Lego block and so forth, all the way up the stack.
So, to close the circle here, the monitoring piece needs to move up the stack to where the action is, where our new language operates. We need to talk about forests and stop obsessing about the trees, even though trees are what we grew up obsessing about. And because that’s how we grew up building systems, our gut reaction is always, “oh, let’s look at file descriptors!”
Our analogy is always: we care about the airspace. If you have many flights going on—not right now, because of COVID—but if you have many flights going on, no matter how sophisticated your flight planning is, you need air traffic control. And, air traffic control doesn’t care about what kind of food is served on the plane or what movie is playing, or whether some passengers are freaking out, or even how much fuel is on the plane. Air traffic control cares about its golden signals only: position, altitude, direction, speed. Of every single object in its airspace, plus weather data. Their concern is the safety and stability of the airspace, not an individual flight operation.
That’s the level of Mission Control. APM, tracing, “observability” is not Mission Control.
So, we are building these organically growing, decentralized architectures, these digital landscapes, where we essentially stitch Legos together, because everything the business does requires software. The business comes up with ideas in the morning and asks, rightfully, why can’t I have the software run in the afternoon? I already have 90% of what I need, so I can build on top of it. All I need to do is to sprinkle some pixie dust on top, right? Why can’t I have this? It’s because of the operational roadblock. It’s an operational problem, not a software engineering problem. So, you need to take the global, systemic view, look at golden signals, and then have the ability to apply operational primitives in real-time. That’s the whole recipe for success.
Jeffrey: In a previous session, we were talking about the “Unicorn project,” the book by Gene Kim, and it’s interesting because the basic premise there is that we’re really looking at how to enable an organization to be agile. Which is sort of moving away and not thinking about the operations perspective, but there is so much more that plays into it when we look at it more holistically, from a cultural perspective, right? How do we get an organization to be a bit more agile? How do we get an organization to be more focused on core versus context, which is a fascinating concept, I think, that’s been brought up in the book. Again, I’m not a platform guy, so may be totally off-base here, but it almost feels from what you are describing, this way of enabling your organization, at least from the technology and platform side, that you enable organizations to get all their pieces in place and be more agile simply by focusing on what’s more important and getting out of the way of the stuff that’s not.
Tobias: Yes, absolutely. It’s all about accelerating MTTV, mean time to value, and every acceleration you manage to achieve massively benefits your company. Because if everything is difficult, it takes a week to implement a feature. If my code is so complex that I need four hours of maximum concentration a day to make sure I’m not breaking anything, I can’t be agile. And there are knock-on effects: everything slows down now, because “This change request? Oh, it’s going to take me two weeks!” Because now, first, I need to swap all this thinking back in, which takes me a week. Then I need eight hours of uninterrupted work, and so forth. The same thing happens on the operational side. Anything complex kills you.
I am a big fan of the notion that we’re living in a post-distributed-systems world. We grew up thinking that distributed systems are the black belt of software engineering, but it’s not. It’s cancer. There is no need for distributed systems almost anywhere. Yes, some niche applications require distributed-systems primitives, but, frankly, those primitives have been solved for a long time. Just grab something that provides them.
If you run a business and notice your people building a distributed system, all alarm bells should go off. Because it means essentially that you’re paying your teams to build a Ferrari engine where everything is hand-machined for performance. And yes, the engine will probably be awesome once it is built, tested, and refined—but not only have you spent an insane amount of money and time, you now also have to pay a mechanic to rebuild the engine every 500 miles because of the stress on it. It is so optimized for performance, it’s gonna break all the time. That’s a killer for your business.
There are no distributed systems in nature.
So, you need to be much more nimble. And nimble means you need to compose, stitch things together, work with Legos. You need to realize that there are no distributed systems in nature—for a reason! Everything is reactive, “evented.” You need to build compensation strategies in your code, or you are going to die—very soon. You need to start out with a flexible and resilient code base that you can compose organically with other flexible and resilient components to build up your digital landscape. That’s how you gain speed. Only that. So, treat everything like a Lego block, across the entire organization. I think Amazon spearheaded that, Jeff Bezos with his mandate that every meeting requires a memo so others can build on it, that every application has an API so others can build on it. In other words, everything becomes a Lego block.
And, by extension, this also means that I can no longer architect. Providing an API means, by definition, that I don’t know who’s going to call me tomorrow, and that means, again by definition, that I can no longer architect.
Architecture is an atavism. It is the last remaining waterfall activity we do.
We are not recognizing this. We’re not acknowledging that architecture is the last waterfall thing we do and that it should be done not up-front but at runtime. Because, if you can’t predict how your blueprint will look a week from now, you need to deal with it at runtime, when it is happening. So, the notion that we can build a system, plan it, blueprint it, build it, and then it’s going to run for any amount of years and at any scale is, of course, patently wrong.
Tyler: So, agility is super important, and I think we get a lot of agility from the layers that have been built so far, the proverbial shoulders that we stand upon. Do you think Glasnostic could have existed before this post-distributed-systems revolution? Would it even have emerged as a problem?
Tobias: No, I think this market is still emerging. Our customers are at the forefront of this trend, and there are plenty of organizations that still grapple with what I would call cloud complexity or cloud chaos.
There are several stages of dealing with cloud chaos. Not quite as many as there are stages of grief, but still several.
The first stage is denial. This is where you talk to a VP of Engineering and hear, “Yeah, there’s chaos, but it’s just because we have so many bugs. The previous guy didn’t know what he was doing. I’m gonna get my people to not write bugs anymore. I’m the new guy, and I’m gonna win.” Obviously, that guy is out six months later with the company in a worse place than before because, well, you can’t prevent chaos with code.
The second stage is bargaining. It’s another natural reaction, and it’s interesting for the two of us, Tyler, because that’s where platforms come in. That’s when you hear, “Yeah, there’s a lot of chaos, but we’ll fight it with re-platforming and standardization.” Of course, that’s a fallacy. It is “golden master” thinking. Like 20 years ago, when we thought, “Oh, virtual machines! We need golden masters!” So we released golden master images, and a minute later, a new package comes out, and the golden master is obsolete. The same thing happens when we build platforms. Technology moves so fast. As soon as the platform is built, a crucial piece of technology comes out, and of course, it is not in the platform, so immediately the platform is worth only half of what it was worth before, because now I have to install my own software before I can use it, and so forth. So, as promising as platforms and standardization appear, in practice, their value is limited because of the massive burden associated with their upkeep.
The third stage, then, is despair. We realize that re-platforming didn’t help or, if it did, that it helped only for a short amount of time and ultimately failed to solve anything. So what are we going to do now? That’s when we come to our customers. That’s where we pick them up and say, “Yes, there is chaos, but it’s normal, and you can deal with a lot more chaos if you learn how to control and manage it.” We help you control all that chaos, in real-time.
So, that’s the fourth stage, acceptance. And, mind you, is always nonlinear. Something may start out as a CPU problem somewhere, but then quickly turns into a latency problem somewhere else, which then all of a sudden becomes an availability issue some other place before it turns into a retry storm at yet some other place. It’s chains of events, and most of these failures are like icebergs: most of it is under the surface, then, suddenly, a piece comes up and that throws alerts. And then you think that’s the issue, but it really isn’t, it’s a much bigger thing underneath. And that’s the issue. So we try to make these things visible, so you don’t have to stress out about them.
A couple of weeks ago, if you think back, right when the COVID-19 lockdown happened, Robinhood went down. The markets crashed, everybody tried to trade, but Robinhood was down. The official claim was that a “thundering herd” brought down their DNS. So, why are we not all wondering how such a thing could ever happen? If your landscape is so complex that you can get thundered, how can you not have something in place that lets you do something about it? Well, they couldn’t do anything about it and, as a result, they were down for a full day the first day, with further outages on the second and third day. They were unable to exert classic backpressure, they had zero ability to do that. If they had had something like Glasnostic in the network, they could have seen the surge of traffic within seconds and could have exerted a little bit of backpressure right away: “One at a time, everyone!” And they would have had no outage. Yes, slower service, but no outage.
So, a shift in thinking towards runtime control needs to happen as we build out these digital landscapes.
And, mind you, every company will build out these digital landscapes. Because there’s no other way. Distributed systems are way too expensive, they take too long to get right, and they are too difficult to maintain.
Tyler: Eventually, you hit the complexity wall, as you’re saying, and the only way through the obstacle is the wall. The Obstacle is The Way is a great book that I recommend, and that will be one of my picks for today. The Obstacle is The Way talks about when you hit a wall, you just have to go through it. So, yeah, I can really see how Glasnostic helps to do that. Now, when I say I see this, I want to dig in as an architect at an enterprise, and I want to do a proof of concept. Right now, it says “book a demo,” I don’t see a download button. So, what challenges have you had? What strengths or successes have you had from the demo wall versus an installable helm chart kind of thing? Not getting into proprietary stuff, but giving them enough access to give them that first “aha” moment of what Glasnostic does?
Tobias: Yeah, we are very consumable, but a little bit of a mental shift needs to happen. The problem we’re tackling is not immediately apparent if you’re working in code, because your horizon is comparatively small and you probably look to observability, to tracing, you think that’s going to solve your issues—which, at a local level, it does very much—but it doesn’t help you at all with dependencies you don’t own. You don’t want to trace into that other business unit’s service because you don’t care about it. It is just a service! So there’s a little shift that needs to happen and we do this by training people on the product. That’s why we do not want to have a download at this point. There’s a little bit of training we want to do. It doesn’t cost anything, but we want to make sure our users understand why we do what we do and don’t come back and complain about “hey, where can I see my file descriptors.”
Tyler: A lot of times, companies deal with the sales-y part of it. Meaning, relying on the sales team to move people through the stages and the salesperson is just putting up a human interface to a script that leads to frustration on the customer side because the salesperson doesn’t fully understand the technology and will say “yeah we can do that for you.” The problem with that is that they don’t realize that this is basically saying, “Yeah, the hospital is down the road on the right,” but it’s actually on the left, and it’s essential to know the difference. So, I can see the advantage of what you’re saying, and I think it’s a good thing to give people that training. Given that Glasnostic targets enterprise users, I wonder, when you get people in the door, and they see what Glasnostic can do and how much it might cost, how many of these people are like “Yes, we have to do this” and how many are not? You don’t have to give specific numbers, but how many people just get it right away after they’ve gone through training and had their paradigm adjusted?
Tobias: It is an excellent question because the general space of what we do has a lot of people immediately interested. When these people then reach out to us, we often realize within a few minutes that they probably thought we’d be doing something different, like monitoring or tracing. And then, when they understand what we do, a typical reaction is, “Well, I was thinking about this or that project, but now I am thinking this could well apply to the entire company.” Of course, we don’t sell to the entire company right away. We want to start small and ensure that initial deployments are successful before we expand. Still, there is often the insight that other groups should use us, too.
Regarding downloadable software, in my previous company, we had thought about open-sourcing for a long time. And when we became Red Hat OpenShift, of course, with Red Hat being an open-source company, open-sourcing was always on our minds: when should we open-source? When can we possibly open-source? There was so much noise around the open-sourcing part, it impacted our development velocity. It’s similar for us now. We actually don’t have a license manager monitoring our binaries—we price by volume as SaaS or by environment size on-premises—and that model has worked well for us. We haven’t had any request for downloadable open-source software yet, frankly. My guess is that’s because the nature of open source has changed. Customers today typically don’t want to work under the hood anymore. They want to consume a solution, not gain a new hobby. Besides, it’s pretty advanced underneath, and I certainly don’t want to get out a wrench and do something there. So it’s kind of like “Take the images and if you need something done, talk to us. That’s a pretty inobtrusive story.
Jeff: Tobias, I’m curious about what you’re describing, and this foreshadows my pick, Seat at the Table, which got me thinking: in your case, who is picking up the phone and calling you? Is it the classic case of the CIO/CTO who’s trying to make this decision for the company—my guess is that it’s probably not, but I’d really like to hear your point of view—or is it more of a Dev manager or even an Ops manager who is thinking, “Wow, the complexity is killing us,” or “The time to create a build is killing us,” or whatever it is that is killing them and “I need this for my little four corners. Maybe other teams can use this as well, but I know I need it for my project”?
Tobias: It’s a really interesting question because I would say that all our customers have one thing in common and that is that the technical side of the company is, to a large extent, headed up by leaders from the operations group. Yes, there are the CIOs, who immediately see the value we provide, because they think about these issues every day: MTTR and MTTV, “How are we going to run this explosion of services 3 months from now?” Sometimes it’s the CTO, sometimes the CIO, the role changes from company to company, but all our customers have a forward-thinking operations or security person right underneath the CIO or CTO. And they have learned that, once the business starts talking to its technology organization through the operations group instead of through development, which is what we all are used to, that’s when progress happens.
And by the nature of things—I am a developer, I don’t want to talk down the importance of development at all—as a developer, you’re a couple of miles below the ground, in a mine, so there is no easy way for you to get the full business context. You don’t know what version of “done” the business needs at which time and how your code is being exercised in production. You don’t even know who’ll be using it today or next week! So, our customers talk to their operations people, the operations people, in turn, know every developer by name, they know how each team codes, they can give you an excellent estimate of when it’s going to be live and what needs to be done to keep it alive. That’s the interesting part: because running a decentralized digital landscape means that ever more development work is done at runtime, our users are putting operations above development.
Engineering management for digital landscapes is operations management.
So, I would say, if you run a digital landscape—and I believe everyone will run one in the future because it is the only way to support an agile business—the key step is to layer operations above development in the organization.
This organizational change is just a function of scale. Again, if you’re that small San Francisco startup I mentioned before, you’re creating your initial technology, and, of course, the technical founder is running everything. But as you scale the organization, the distance between idea and code execution gets too large. That’s when no single person understands how everything works anymore, so you need to devolve the code work a little bit further down. The more you break this up and deploy incrementally—when you run hundreds of deployments a day—the operations piece becomes the delivery piece. That’s why we see Delivery Excellence roles: they are called upon to engineer how everything works together.
Tyler: That’s really interesting, and I am thinking we need an extra episode because I have more questions than before, and we’re running out of time! I like what you’re saying, that the people who get it instantly are CIOs. Now, I instantly got it, but I’m not a CIO yet, so there must be some broken corollary there I need to work on! But, I like the idea of those forward-thinking people a layer or two down in the operations organization.
Jeffrey: Do you see your future, Tyler?
Tyler: Yes, I do see a lot of futures! I’ve been predicting this for a while now with some other people because I feel like a lot of times people try to install a pane of glass, but it’s not just about a pane of glass and being able to unify federated platforms that are all in different places. It’s about what you do with that information once you have that pane of glass. And that’s what excites me the most about Glasnostic, that it is not just a pane of glass, it lets you take control right there. I think we should go into greater detail in a second episode, so stay tuned for that.
Tyler: So, let’s transition into our “picks of the week.” What we usually do is we ask everyone what they’ve been working on or reading lately, books, software, that kind of thing. And for me, my pick this week is a book that I enjoyed called The Obstacle is the Way, which talks about stoicism and just kind of surrendering into the obstacle that’s in front of you, the thing that you have to do, that you have to overcome. And I personally have the belief that, if I don’t feel resistance at a job, then I’m not growing. You have to have that resistance, and when you don’t feel that resistance, that’s when you know you need to find a new job. So it actually excites me, being able to modify me. If I get a little frustrated with what’s going on, that signals to me: “Okay, you have the proper amount of resistance right now.” And it’s actually made me more of a fan of security people myself, so The Obstacle is the Way is my pick this week. What about you, Jeffrey?
Jeffrey: Yeah, as I mentioned earlier, A Seat at the Tablee is something I started reading. And it came about because we went through The Unicorn Project and what I found so valuable about the book, in general, is all the references. And even more so when you go to Gene Kim’s website where he blogs about just the references, the resources that drove him as he was writing the book. And A Seat at the Table is just one of probably a dozen or so references that he talks about, but I just think it’s so interesting getting Mark Schwartz’ take on having been a CIO and talking about what is the future. It struck me because we were talking about enterprise IT today, and so often I’ve looked at a CIO or CTO and wondered, what is their role once they’ve set up their system? And let’s say you know you’re going forward with some form of agile development process and hopefully you’re doing some kind of DevOps or trying to automate. But, now what? So, if the whole idea is to be agile—and Tobias was describing it earlier—if you have autonomous teams and they’re working directly with the business, so now what? What is IT leadership, and what is their role? I’ve just gotten started, so there are no spoiler alerts here, but I think the premise is so interesting. There is so much legacy thinking that goes on at the higher levels of IT, so much technical debt, cultural debt, so I’m excited to see where this goes and what ideas come out of it.
Tobias: It’s a great book, and so is Gene Kim’s DevOps Enterprise Summit series of conferences, terrific talks, and great thinking there. I love it. On my end, I run a startup, so there’s not a lot of time for reading, but The Obstacle Is the Way definitely resonates with me as a long-time reader of Marcus Aurelius’ Meditations.
Two other great books, and very much pertaining to this discussion, are Tracy Kidder’s The Soul of a New Machine, which recounts how the team at Data General stretched themselves every single day, how nothing worked initially, how they were fiddling with CPU boards, hooking up oscilloscopes to debug the boards and these kinds of things—everything they did they did for the first time. An excellent and truly inspiring account.
The second great book that I think everyone in engineering should devour because the mindset behind it is so inspiring and so pertinent to what we talked about is Gene Kranz’ Failure is Not an Option. Gene was the flight director of the Apollo program, and it is just amazing what these guys were able to do with just a couple of amperes and just a little bit less than what would be a Texas Instrument calculator today. Ring binders and cigarettes were everywhere. It is mind-boggling to me how they even dared to attempt what they did, their respect for details, their respect for project management, and their respect for dealing with the unpredictable.
And again, it always comes back to what I think we need to apply to systems as well. If you think of Apollo 13, the crew made it back to earth, not because the mission was well-engineered. It returned because it was operated well. That’s what makes the difference. That’s why all successful projects are ultimately run, managed. If you build something bigger than you, you need not just planning, you need to manage. Likewise, when you build large, complex systems, where everything is essentially bric-a-brac, you need to manage. And bric-a-brac is a good thing! If you want to build something that is supposed to work under any conditions, then you’re back to a really slow velocity because everything needs to be tested—you’re essentially pouring concrete. Alternatively, if you need results fast, you do bric-a-brac, you stitch things together as good as you can and manage the unpredictable. That’s the continuum that’s available to you and you as a business need to decide where you want to be: do you want to work in the Stone Age, pouring concrete, or do you want to be nimble and agile.
Tyler: Well, that’s awesome, and I think that’s a good closing for this episode of Adventures in DevOps. I want to thank our guest Tobias Kunze and our fellow panelist Jeffrey Groman.
Jeffrey: It was a pleasure!
Tobias: Any time. Thanks for having me!