Cloud native applications are complex. This is all well and good until something goes awry.
Then, it is easy to get lost in details and miss the forest for the trees. There is so much data: metrics, logs, traces, and events vie for attention and lure engineers into often time-consuming and frustrating troubleshooting endeavors. After all, it takes a lot of diagnostics and sleuthing to understand what is going on and get a sense of what needs to be done.
Amidst the scramble, it can be difficult to prioritize with a cool head. As engineers, we want to know what’s going on and what we can do. But should we do it? Now?
This is where Service-Level Objectives (SLOs) come in. SLOs help SREs triage incidents by setting a clear standard of operational quality. As long as the standard is met, issues can be disregarded. If the standard is not met, issues must be addressed. The power of this approach lies in its simplicity.
Service-Level Agreements (SLAs) are also helpful because they set a clear—and typically laxer—standard of operational quality that a customer can expect. SLAs are important because they let teams communicate with customers more effectively during times of distress. Because SLAs typically follow customer priorities, they can also help guide the formulation of internal SLOs.
In short, SLOs define the operational standard SREs manage towards internally, whereas SLAs encode the operational standard that the business and a customer agree on. At all times, meeting your SLOs should imply that your SLAs are met as well.
That’s the high level. Below, I want to dive a bit deeper into what makes good SLOs.
A Service-Level Objective is a specific, measurable technical goal that a service ops team aims for (and commits to) in running the service. It can be used as a “technical contract” between teams but is not intended as an SLA. SLOs are typically defined technically in terms of Service Level Indicators (SLIs)—headline metrics that help capture the spirit of the SLO. In a sense, SLOs translate an SLA’s high-level, customer-facing provisions into specific, measurable benchmarks to meet.
Needless to say, the goal of any SRE is to meet or exceed a service’s SLO. As long as that is the case, errors do not matter—by definition.
However, when the service threatens to fail to meet the SLO—in other words, if the “error budget” is in danger of becoming exhausted—SREs are expected to switch to a low-risk mode of operations to ensure the SLO is met again.
In this sense, SLOs are the lodestars of SRE.
It is impossible to achieve a perfect 100% score when operating a service, and it is also important that the team spends enough time working on higher-ROI items rather than chasing every single issue. An SLO that appropriately codifies the level of service quality that can be realistically achieved and meets or exceeds the consumer’s expectations allows SREs to view a small amount of degradation or errors as acceptable and focus on those higher-ROI items instead.
The ability of SLOs to help SREs prioritize is perhaps the most critical capability in cloud native operations. With SLOs, SREs avoid getting caught up in small details because an SLO allows them to answer questions like: “What’s important to my customer?” or “Should I be worried and spend precious time remediating this problem? Or should I just not worry about the minutiae of this specific issue?”
As mentioned above, an SLA is a legal agreement between the service provider and a customer. It may reference specific SLIs, such as response time, but this is not required. An SLA can take any form.
According to Google, “An SLA normally involves a promise to a service user that the service availability SLO should meet a certain level over a certain period. Failing to do so then results in some kind of penalty.”
While SLAs help define business-level expectations and customer priorities, they often lack technical detail. This makes it challenging to understand them, act on them or execute proper remediation steps.
As helpful as they are, SLOs can only tell you when you should think about doing something. They don’t actually tell you what to do.
That’s when most teams default to diving deep into diagnostics, abandoning the all-important big picture. It becomes a wild goose chase searching for a root cause—anything that can be tweaked to restore the SLO. It feels a lot like missing the forest for the trees and, as a result, mean time to recovery (MTTR) increases unnecessarily.
Instead of “chasing every tree,” Glasnostic helps SREs guide diagnostics by providing big-picture visibility. And while the diagnostic process is going on, it lets you mitigate the incident while the diagnostics process continues in the background.
Glasnostic provides SREs with runtime control of how their services behave. In that sense, it acts as “air traffic control” for your cloud native production environment. With Glasnostic, you can maximize application reliability by auto-tuning how services interact.
Rather than requiring a lot of guesswork when an SLO is at risk, Glasnostic helps SREs meet their SLOs by letting them take action as soon as an incident is discovered.
SLOs are important because they help SREs focus on what really matters to service consumers and when remediation should take place. SLAs define service goals at the higher business level and are useful as well because they help SREs define effective SLOs.
However, SLOs still leave behind a gap: they don’t tell you what to do in order to actually remediate an issue as quickly as possible.
Glasnostic lets you take action quickly and effectively by regulating the flow between services. This helps optimize the reliability, performance and cost of your production environment.
Want to see whether Glasnostic is for you? Get started today for free!