Site reliability engineers might just be seen as modern-day wizards by the rest of their organization. They’re expected to work with modern cloud native applications, which are more complex than ever, and utilize both DevOps and Site Reliability Engineering skills. In addition, these engineers must play several roles when working with these complex apps, using powerful tooling, automating repeatable processes, and reducing toil (aka manual steps) wherever they can. All in all, site reliability engineering can be a tall order and requires the right kind of SRE tools.
The answer to this question can vary, depending on who you ask. Google succinctly describes it as “what you get when you treat operations as a software problem.” Treating operations as a “software problem” has become necessary as cloud applications moved every higher “up the stack.” Unlike traditional applications, modern cloud native applications depend on a complex layer of underlying infrastructure, containers, runtimes, cloud services, orchestration and software delivery lifecycle automations. This “cloud stack” has become too complex to be managed by hand and must be automated. Hence the need to treat much of operations as a software problem.
But often, the complexity of cloud native applications means that SREs can’t follow a single “textbook definition” of what their jobs should look like. Instead, they fulfill a series of activities that vary depending on how their organization is set up.
According to Catchpoint’s “2021 SRE Report”, the majority of SREs spend a significant amount of their time
Responding to incidents or outages
Executing post-mortem analyses and/or write-ups
Participating in on-call rotation
Developing applications or capabilities
Experimenting or receiving training to expand knowledge or skills
Authoring business processes, rules, or best practices
Performing audits of usage/cost allocation
Spinning up new hosts/instances.
To execute all of these tasks, especially while focusing on increasing automation and avoiding toil, SREs need good tooling. Below are the 4 critically important tools any SRE should have.
Infrastructure-as-code (IaC) tools let SREs provision infrastructure programmatically. Using IaC tools, SREs define how compute, network and storage services should be provisioned, sized and configured. For most organizations, this capability is a foundational prerequisite for automating the software delivery lifecycle, and as we saw earlier, automating the software delivery lifecycle is a prerequisite for being able to deploy continuously.
Examples: Terraform, AWS CloudFormation, Ansible, Puppet.
Next, SREs need observability tooling to diagnose issues quickly and conclusively, without having to set up specific monitors up-front. Unlike monitoring, which is designed to answer the particular questions for which you configured it, observability captures enough data in enough detail so you can answer questions you didn’t think of beforehand as well.
Observability tools provide insight into the application’s production performance and display how well they run from a user’s perspective. These tools also collect strategic analytics that aligns the app’s continuous performance with SLAs and other operational guidelines.
The pillars of observability are
Large amounts (“high-cardinality”) of rich (“high-dimensionality”) data points about software behavior
Log messages emitted from software components as they execute
Request traces that trace the execution of transactions across sets of services.
Examples: Dynatrace, AppDynamics, DataDog, New Relic, Honeycomb.
Cloud native applications grow ever more complex and can evolve rapidly over time. This makes it challenging to ensure that code and configuration are always correct and that there are always sufficient resources available. In addition, cloud native applications can have a large number of direct and indirect dependencies, further complicating the operational equation. As a result, incidents are unavoidable.
But incidents are also high-stress events, and it is vital that the incident response is carried out as collaboratively and frictionless as possible. Incident management tools provide SREs with the workflow automation they need to focus on the incident, triage alerts, appoint an incident commander, and run the quickest process to resolution without missing steps, duplicating efforts, or running in circles.
Examples: PagerDuty, Squadcast, Splunk On-Call.
Finally, SREs also need a way to quickly see what is going on and respond to issues as they arise. This means they need to be able to
See “the forest, not the trees,” so they can quickly identify where runtime pressures and undesirable behaviors originate and
Exert control over production behaviors at runtime without having to tweak configurations or rely on development to work on a patch.
Without a way to do these two things, engineers are unable to contextualize runtime behaviors and can only apply crude, imprecise measures, like rolling back deployments or rebooting instances.
As the industry’s first runtime control solution, Glasnostic fills this gap in today’s SRE tooling.
While eminently valuable in themselves, Infrastructure-as-code, observability and incident management tools force SREs to spend hours in diagnostics and then design and release a patch.
WIth runtime control, engineers are able to see the forest, not the trees, identify immediately where runtime pressures originate and know what needs to be done, proactively, ahead of time, or reactively, when issues arise—manually or automatically.
As a result, reliability is maximized, security enforced, performance and costs are optimized, and the customer experience is ensured.