Photo of Jack-o’lanterns at night
Source: unsplash.com/@rooney

Operational Monsters in the Cloud

Just in time for Halloween, Mike Valladao and I talked on Gigamon’s Navigating the Cloud podcast about that growing problem in cloud operations: monsters. Those Large-Scale Events (LSE’s) that tend to hit us when we are most unsuspecting. We touch on a number of topics around operational monsters, including:

Below is the transcript of the episode, edited for clarity.

• • •

Mike: Today, we begin episode four with a quote from Pirates Of The Caribbean, the movie, which reads: “You’re off the edge of the map, mate. Here there be monsters.” Why a quote like this? Because today we’re going to focus on some of the operational monsters that lie in wait for us in the cloud.

My guest today is Tobias Kunze. Tobias is a serial cloud entrepreneur. His first startup company was an early innovator with Platform-as-a-Service (PaaS). In 2010, that company was acquired by Red Hat, which has since rebranded the portfolio into the OpenShift Platform. So, for those of you running Docker containers managed by Kubernetes, you may already be using some of this guy’s earlier work.

Today, Tobias is the CEO and founder of Glasnostic based in Menlo Park, California. His latest company deals with cloud operations, services, and application control. But before I start talking about how all this ties with operational monsters, I want to welcome Tobias to our program today. Glad to have you on the show!

Tobias: My pleasure. Thanks, Mike.

Mike: Great. I’m really delighted to have you here. I think you’re working on some interesting technologies, and that’s what we’re here to talk about. It’s the technologies that run the whole cloud operations.

In the Cloud, it’s always Day-2

Mike: And I did mention your company deals with CloudOps. That sounds distinctly like Day-2 operations. What does that mean?

Tobias: It is absolutely right. We simplify Day-2 operations. And what “Day 2” means is: whatever happens, once software is running. You may recall Day-1 is really the configuration, the deployment, the initial setup of software. And then, once it’s running, the only thing we really do today is to monitor. Maybe we trace things, too. But if something goes awry in whatever way, if we catch it at all, the only thing we can really do today is to roll back.

Mike: How long does Day-2 last? It depends, I guess. How long does it last in a normal situation?

Tobias: In a small environment, Day-2 starts on deployment and ends when that deployment is torn down. But if you think about it, if you have 10 or 20 or even 400 teams deploying in parallel, Day-2 is really always. You deploy something new, and it’s the same Day-2 because that new deployment is connected to everything else that’s running. There’s not really a single application anymore; it is a landscape of applications and services.

Mike: Okay. So just to say what’s going on here: Day-0 is when you’re setting up what you’re going to build, and you build the stuff that you’re going to build. Day-1 is when you release it and deploy it into production. And then Day-2 is everything that happens until that application is redeployed or gotten rid of.

Tobias: Exactly. Day-0, Day-1 and Day-2 means build, deploy, run.

What is “Cloud Application Control?”

Mike: You focused this company on application control. What brought you in this direction?

Tobias: A very simple thought, really. We used to think that operational success means providing resources and fixing software. But our tech stack grows taller by the day. So, the resource level is pretty far removed today from what applications do at the top! You need to control what’s going on there. Think about it. You are running many pieces in many places today. They are all connected directly or indirectly…

Mike: …But haven’t we always done that? I mean, we’ve always had three-tier applications, right?

Tobias: Yes. But these three-tier applications are very small. You know who they are. You know what the tiers are. You build these just like you’ve always built applications: you create a blueprint, you develop it, you release it, and you know exactly how you’re going to iterate on it.

If you have a landscape that consists of many of these applications that are connected, however—because, guess what, the interesting bits are always outside of your application, so you have to call into dependencies—if you have such an application landscape, the situation is very different because now you’re running in different clouds, you’re running across different VPCs, you’re running on premises, in hybrid scenarios, maybe on edge—and all these pieces need to work together in a predictable fashion. And that is where the application level really matters. Here we can’t really try to fiddle with things on the resource or code level anymore.

Mike: Why Not? Why can’t you just fiddle with it to make it work? And besides, you can always change the code, can’t you?

The New (Fast) Way of Developing and Updating Apps in the Cloud

Tobias: That’s one of the things that engineers always like to think, right? Something happens, so you go, “Oh, let me look at it, let me spend a couple of hours and figure out what’s going on, and then I’m going to release a patch.” So, my fix will go into production hours later at best. That, of course, doesn’t work anymore if you have hundreds of these things running, potentially hundreds of fixes being deployed all the time. And keep in mind, this is a very dynamic environment: it is rapidly evolving! We’re not merely scaling out; we are releasing new versions of something, we are putting new pieces next to existing pieces, then turn the old ones off, and so forth. All these cloudy processes that really create and amplify the dynamism of the landscape.

Mike: I like the cloudy infrastructure here because that’s exactly what it’s doing. Because it has gone to such an “n-tier-plus” architecture that there are so many other things taking place. And because of that, the applications are more difficult to figure out where things are going.

Meet the Cloud Monsters

Mike: So where do the monsters come into play? What monsters are out there? Help us understand.

Tobias: You’ve heard about the “Noisy Neighbors.” There are things we don’t see. The old Unix “top” utility even shows you “Steal Time”—time you didn’t get from the hypervisor. Our code runs in a virtual world, and outside of that, there are all kinds of things happening.

Mike: What is a “Noisy Neighbor?”

Tobias: Oh, very simple. I’m calling a database, and somebody else’s hammering it. That’s a Noisy Neighbor. I don’t get the quality of service I expect because somebody else hogs the resource.

Mike: What can you do about it?

Tobias: Well, as a developer, you really can’t do much about it, but operators, of course, could balance the resource allocation properly.

Mike: Okay. So, before we get to what you do with some of these, let’s go through some of the others then—the other monsters here. You mentioned Noisy Neighbors. What other things are out there?

Tobias: Well, you’ve heard about “Thundering Herds,” right? Mass effects that happen when a lot of processes do the same thing at the same time. One of those brought Robinhood down a year ago. Then there are other monsters with flowery names around, like “Retry Storms,” which everybody has probably seen more often than they would like.

Mike: Let’s talk about Robinhood. What happened in that case? Because you’re right. Robinhood, as everyone knows, is a huge fintech company that allows you to buy and sell stocks, among others. What happened to them?

Tobias: At a very high level—and I’m sure we probably don’t know the entire picture here, but that was the public message—their DNS services were overrun. As in “too many services trying to access them at the same time.” So, DNS went down, and it took a long time to get them back up. That’s always the story, something like that.

Clearly, if they had had the ability to exert backpressure at this point, i.e., if they could have just slowed things down a little bit—not drastically, just a little bit—they could have recovered their service in no time.

Mike: Isn’t that counter-intuitive? Don’t we always want things to go faster? Why apply backpressure?

Tobias: More important than speed is control. There is no point in stepping on the gas pedal and ending up in the wall. What you really want is speed and control, have services communicate with each other speedily but in a controlled way, without killing each other.

And that control is really a top-down capability. It is a management perspective, where we say, “Well, we are managing to a certain Service Level Objective. We want to make sure the services over there have enough headroom, and there’s enough capacity over here,” and so forth.

We also need to be able to control feedback loops. Developers are familiar with the term “Circuit Breaking.” That same concept happens on the operational side. Circuit breaking is a form of control, where I disengage a direct coupling that is about to bring the business down. These are the things that lead to feedback loops, cascading effects, compounding ripple effects, and so forth.

Mike: Those are starting to sound the same. It sounds like they’re all convoluted here. Is that true?

Tobias: Well, if you look at the tip of the iceberg, you’ve got a couple of names. Below that, though, there aren’t a lot of names because, frankly, these events get complex really fast. If you’d analyze them, you’d discover that they tend to be highly non-linear chains of events. It’s always something that starts out, say, as a CPU issue here, then becomes a latency issue over there, then a retry storm somewhere downstream from there and then finally turns into general slowness somewhere else. Meanwhile, of course, your systems throw thousands of alerts…

Cloud Operations Is Like Air Traffic Control

Mike: I think you said something here that’s very important because, as you shine a light on it, you may find that it may not be the monster you think it is. It may be something cute and furry under your bed, but the fact is the monster may be a little bit different or in a different place. And doesn’t that all come down to having the proper visibility, knowing what’s out there and being able to see what’s there?

Tobias: Exactly. What happens in Day-2 is largely unpredictable. So, the instrument cluster in your cockpit, like our tracing and debugging tools, doesn’t really help us here. You’re only looking at one plane’s cockpit. You need a radar screen. You need to get the full, global visibility of who is affecting whom, who is talking to whom. You need to answer critical questions like, “Should they be talking at all? Is what we are seeing here what we expect to see? Is it plausible?” These kinds of questions need to be answered to guide us to where we actually want to exert control.

Mike: Okay. Let’s take this analogy that you’re using here because I am a pilot. And so, if I’m up flying from one place to another, traffic control really doesn’t care why I’m flying from one place to another. They just want to separate me from the other airplanes. Is it the same thing that you’re talking about here?

Tobias: Exactly. As a pilot, your responsibility is to get the plane from the ground at point A to on the ground at point B. And that’s your only and top priority. Air traffic control, on the other hand, is not responsible for that at all. Their responsibility is the airspace. Glasnostic works just like that. We are not looking at what a single application, what a single thread of execution does because our responsibility is the entire application landscape, the monsters outside your thread of execution. We are looking at the cloud operation as a whole. We have a holistic view and need to make sure that all the services happening there don’t step on each other, don’t fall over each other.

Mike: Now, again, we do have visibility into a lot of this stuff. But you just said “services.” That’s a little bit different because we have packet visibility; we have information about what’s going on—that could be observability, log information, and so on. We’ve got all that at our fingertips, and that helps us. But what you’re talking about, the interaction is at the service level. Is that correct?

Tobias: Yes. I say “services” because, increasingly, applications are really just a handful of services. They may loosely “belong” to each other and maybe share a Git repository. But I am avoiding the term “application” here because the key difference today is that these services are connected with many other services. In the old world, when we had our three- or two-tier applications—when we needed to update their data, we would have an ETL process on the back. But in today’s applications, data does no longer come from a database. It comes from five or maybe even 500 other applications. And those applications have different technologies, different life cycles, belong to different departments, come from joint ventures, or could be at external partners. They’re all over the place.

Visibility Into CloudOps

Mike: So, I have to ask the question: how do you see this kind of information? What technologies are you using? What technologies are available? It sounds like there must be a huge, huge database if you’re going to interconnect every service? It’s like taking everything that’s in every airplane. Is that the way you’re doing it?

Tobias: Back in the day, if you wanted to understand how a complex J2EE application worked, you’d look at how classes interact—you’d instrument the class loader. Because our applications today are connected, that class loader is the network. That’s what ties everything together, and the lingua franca is the wire data, the wire exchange. So, we are looking at the network, but we’re not looking at the network in a ThousandEyes or Kentik sense. That is, we don’t look at network peering, routing and so forth. We look at how services interact at the highest level of the SDN—as close as possible to the actual service endpoint. That’s where we measure component behaviors.

Mike: It sounds more like policy-based routing or something like that?

Tobias: Absolutely. At the service level, though. If we detect something that’s pathologic, we may do policy-based routing in the sense that we backpressure against that service.

How Does Backpressure in Cloud Environments Work?

Mike: Okay, backpressure. Talk about backpressure.

Tobias: Sure. Backpressure is the Swiss Army knife for reliability engineers. It is always some form of reduction or shaping of capacity. But with backpressure, you have your eyes set on the downstream systems you want to backpressure against—typically to relieve pressure from some upstream system or to calibrate resource utilization between downstream services.

Backpressure is the Swiss Army knife of resilience engineering.

Backpressure is to cloud architecture like a resistor is to electric circuits. Once you avail yourself of the ability to exert it, you can’t live without it. That’s why it is often the first step we see customers make towards runtime control. And the basic truth is: If you leave workloads unrestrained in your environment, they can do whatever they want. They can completely kill your performance because all of a sudden, they may go into a tailspin or start spewing tracing logs, or, you know, killing your bandwidth. These things happen all the time. So backpressure in this context means to surgically limit what a set of endpoints does. You literally push against them. You rate-limit what can happen. That’s a stabilization technique, and that’s what makes it the Swiss Army knife. In fact, I make it a habit to read every public postmortem on outages I can find, and virtually all of them could have been solved immediately with just backpressure.

Mike: Really? That’s a big statement. So, in the banking industry, when everybody comes in to look at their paychecks, sometimes what happens is the bank thinks they are in a Denial-of-Service Attack because everybody’s hitting the bank at the same time. And in your case, what you’re saying here is things can be done differently?

Tobias: Absolutely! A DoS attack means you get swamped with so many malicious requests, you can’t serve the legitimate ones. Now, of course, you could detect where the bad requests come from and cut them off. But that’s a game of whack-a-mole. Much better to shape those requests down to a minimum, so the attacker thinks they are DoS’ing themselves.

If it’s not a DoS attack, though, you should still exert backpressure to shape the demand, though, because it is much better to serve everybody a bit slower than not to serve some. And while you are shaping the demand, you can take whatever steps are necessary to scale up. That way, your scaling actions don’t create additional ripple effects.

Mike: So, you ease into the situation.

Tobias: You ease into it by shaping your destiny. As soon as you have runtime control, a lot of these monsters are not really a problem anymore.

Mike: But to have runtime control, I think the big part of that is the C part, the control. If you’re going to do this, you’re going to have to do it actively, correct? So, does that mean that you’re all inline?

Tobias: Yes. Absolutely.

Insert Bulkheads to Prevent Drift

Mike: What else can be done other than backpressure? What do you do?

Tobias: You can—and probably should—insert bulkheads into your cloud architecture. As I mentioned earlier, application landscapes can be unpredictable. And by inserting well-defined bulkheads…

Mike: …Explain these terms because not everyone on this podcast will be familiar with them. What do you mean by “bulkheads?”

Tobias: Sure. The term “bulkhead” comes from shipbuilding, where you don’t want your ship to sink when something punctures your hull. To prevent the ship from sinking, you partition the hull by adding walls so if the hull is punctured, only that compartment is flooded, but the ship stays afloat. It is a very simple yet very effective preventive technique.

Mike: And how do we apply that to the cloud?

Tobias: We can do the same thing in the cloud. For instance, there is configuration drift all over the place, and most of the time, you don’t even realize it. You run something over there in this zone, you run something in this zone, and they shouldn’t be talking to each other. But of course, you don’t segment zones because critical services should still be able to fail over. And then configuration drift happens, and some services start talking cross-zone. In order to prevent such issues, you may want to insert a bulkhead between them that only allows a certain amount of interactions between these zones. Everything above that amount would then be automatically slowed down so much that the drift would be immediately apparent. So, this is a symmetric way of protecting each zone from each other. Because if you can’t do that, what typically happens is you notice the cross-zone interactions when you get the bill at the end of the month, and that’s too late. Meanwhile, your engineers will have spent a lot of time trying to figure out why things are so slow.

Risks of Application Ops in the Cloud

Mike: Okay. So that’s the key. What are the risks there? There have to be risks because now you could bring down something you don’t want to bring down or slow down something you don’t want to slow down. How do you mitigate that?

Tobias: Obviously, there is a risk that people could be doing something wrong. Ultimately, though, that is a permission issue: Do you want to allow your operators to do this or that? This is not really a technical issue. You need to decide how much you trust your operators. I would say: “Trust and help them with everything you have!”

Mike: Tobias, I want to back up a second because we’ve been talking about these “monsters.” However, in reality, the whole cloud infrastructure has given us so much. It allows us to scale. It allows us to roll things out so much quicker. These are all positive things. How big are the monsters in the big picture here?

Tobias: It depends on the complexity and velocity of your cloud architecture. These monsters grow with your landscape—linearly at best, exponentially at worst. There are three characteristics you need to look out for that drive this: complexity, utilization and rate of change. Crucially, those are all things you ultimately want—that are signs of success—because if you can handle complexity and velocity at a high utilization rate, you can move fast.

Mike: Those are the three primary constructs here. Okay. Continue on.

Tobias: So, complexity means you’re running a lot of services. Very likely with independent life cycles. That’s simply a reflection of what you do as a business. More services are a reflection of your business doing more. Yes, some of these get refactored over time, and you will turn some of these services off, and so forth, but that also increases your rate of change.

Also, if these systems are really active and highly utilized—that means they’re doing something, which is also something you want. You don’t want systems idling and maybe just waiting on the next quarterly report. And the more active they are, the more load they put on the rest of the architecture.

So, that’s really it. The number and the scale of these “monsters” depend on the complexity of your landscape—how large and connected it is—the amount of utilization, and how rapidly it evolves over time.

Mike: And it sounds like the monsters will be bigger or more prevalent in the larger infrastructures. So, if you have a smaller infrastructure that you’re using for some cloud-native work and you have a handful of developers, that won’t be leading to the big issues.

What Types of Organizations Need to Pay Attention to Cloud Monsters?

Mike: What type of infrastructures are you seeing, where you need to exert backpressure, where you would benefit from bulkheads? Where are you seeing that really take place?

Tobias: In an abstract way, anything that has, say, 50 or more logical services that are fairly highly connected, and you have at least about five teams deploying into that landscape independent of each other. Now add scaling to that, multi-region setups, cross-region hybrid setups, etc. and the unpredictability of your landscape will increase dramatically.

Unpredictabili​​ty also depends on utilization, as I mentioned. If you are only looking at 5% utilization in your landscape, nothing ever is going to go wrong; you can basically do anything. But if you are close to maximally utilized, it becomes exponentially more difficult to get operations right. So, utilization is a very important aspect. Of course, utilization is also directly linked to the cost aspect of cloud. Low utilization means high cost, so the pressure from the business is to get high utilization.

Mike: What companies would most likely fit these scenarios? Give me some types of companies.

Tobias: Our bread and butter are service providers: managed service providers, cloud service providers, hybrid service providers—essentially anyone who runs a lot of other people’s workloads.

Mike: Highly complex, very large…

Tobias: Highly complex, very large and unpredictable. Our customers’ customer workloads do all kinds of weird things, often without their customers even knowing. So how do you control that? How do you run this safely in a multi-tenant environment? Or, even if the environment is dedicated, there will be shared infrastructure underneath in some shape or form, which creates additional complexity and dependencies.

Mike: What about government entities like the Defense Information Systems Agency? DISA runs some huge environments. Are those areas that should be of concern?

Tobias: Yes. You have all these governmental infrastructures that are multi-region because they built their own clouds. They tie data centers together and make them look like one, so you get unpredictability right off the bat because now I don’t know where my pieces run, but they behave very different physically. So yes, absolutely. The additional issue with governments is that you are often required to have tenant or client separation between entities and classification levels. So it gets complex very quickly.

Mike: Just to recap here, it sounds like we need to be most cautious of having these types of “monsters” when we go to scale. And that could be any large company, any telco, any governmental entity—it really doesn’t matter. But it comes down to scale and complexity.

Hey Developers: In the Cloud, You Can’t Fix Monsters With Code

Mike: So, Tobias, we’ve talked about how this works for the applications. How does it really change things for developers?

Tobias: Developers are an interesting breed because they’re insulated from what’s going on in Day-2. By and large, they’re working in Day-0: they’re building things. They’re sometimes taking part in the release cycle, which is “Day 1,” but whatever happens in Day-1, they only hear from the monitoring people, right? “Oh, something’s gone wrong here. Can you diagnose it?”

Mike: They’ve got their heads buried in another area. So how do you bring that together?

Tobias: Yes, what they feel though, very intimately, and what very directly affects them, is that, when they release some piece of code that works, has been unit tested, and is deployed correctly—one out of whatever, two, three or four times, something else breaks. So, someone else gets mad because your code broke something, for whatever reason. You stepped on something, and it’s not in your IDE, it’s not in the staging environment. It’s in production, as it happened today.

Mike: But it worked fine in testing!

Tobias: Exactly, it works until it breaks. And that is an experience that every developer is deeply familiar with. Now the key question here is really: do you think you can fix this in code? And the very clear answer to that is: no, you can’t. And if you would fix it in code, you would cause another problem tomorrow because fixing code means getting your code more specific which means it’s less resilient, and now it’s causing another outage or another degradation tomorrow.

The key question for developers is: Do you think you can fix things with “better code?”

So that is the paradigm shift where we need to realize that, in the old world, we really had to deal only with two things. If we had correct code and enough resources, our production would be fine. But now, in the new world, there’s this whole new component: my dependencies. Services I depend on and other services that depend on me. The universally high degree of connectedness, the entanglement. And almost all of these dependencies I don’t own as a developer.

Mike: Tobias, this is a vital point that could easily be missed. In cloud environments, programmers no longer control everything with code. Years ago, when I was in charge of a healthcare master provider system, I was personally responsible for every change to every doctor, pharmacy, or hospital across the system. And just like you’re saying, I did it all in code because it was a closed environment. However, what you’re saying is today’s cloud environments intermingle cloud resources, and services that are all working together and there could be thousands of these. So just as you stated, individual developers no longer own the entanglement. In fact, they can’t. That’s why operational control has to be performed outside of the developers at a level above the codebase. As a result, technologies such as yours perform interaction control across the systems. They’re looking for “monsters,” and when they find them, they apply backpressure or other control primitives to keep things flowing.

I think this is great information today for our listeners, and if they want to learn more about this topic, where should they look?

Tobias: Our website is glasnostic.com—as in “Glasnost and Perestroika” for the older folks. You can also reach me at tobias@glasnostic.com, and I’m @tkunze on Twitter. My direct messages are open, too.

Mike: Great. I’ll list information on your blogs; you have some cool topics there. I’ll just throw out a couple. There’s one about Kindergarten Ops: How to Control Application Behaviors. I also love Mission Control for Microservices. And of course, if people have questions about what we’re talking about today, the Gigamon Community has lots of different options for you to ask questions and get answers.

So, with that, I want to thank you very much today, Tobias. It’s been a lot of fun, and I think we’ve got some cool things going on here. Thanks for everything you’ve done.

Tobias: Thanks for having me!