Screenshot of Mario Kart ‘blue shell’ powerup
SREs can’t win without great tools.

Four SRE Tools Everybody Needs

Site reliability engineers might just be seen as modern-day wizards by the rest of their organization. They’re expected to work with modern cloud native applications, which are more complex than ever, and utilize both DevOps and Site Reliability Engineering skills. In addition, these engineers must play several roles when working with these complex apps, using powerful tooling, automating repeatable processes, and reducing toil (aka manual steps) wherever they can. All in all, site reliability engineering can be a tall order and requires the right kind of SRE tools.

What is the role of a site reliability engineer, really?

The answer to this question can vary, depending on who you ask. Google succinctly describes it as “what you get when you treat operations as a software problem.” Treating operations as a “software problem” has become necessary as cloud applications moved every higher “up the stack.” Unlike traditional applications, modern cloud native applications depend on a complex layer of underlying infrastructure, containers, runtimes, cloud services, orchestration and software delivery lifecycle automations. This “cloud stack” has become too complex to be managed by hand and must be automated. Hence the need to treat much of operations as a software problem.

But often, the complexity of cloud native applications means that SREs can’t follow a single “textbook definition” of what their jobs should look like. Instead, they fulfill a series of activities that vary depending on how their organization is set up.

According to Catchpoint’s “2021 SRE Report”, the majority of SREs spend a significant amount of their time

  • Responding to incidents or outages

  • Executing post-mortem analyses and/or write-ups

  • Participating in on-call rotation

  • Developing applications or capabilities

  • Experimenting or receiving training to expand knowledge or skills

  • Authoring business processes, rules, or best practices

  • Performing audits of usage/cost allocation

  • Spinning up new hosts/instances.

Graph of how SRE activities break down according to Catchpoint’s ‘2021 SRE Report’
What SREs did in 2021.

Types of SRE Tools

To execute all of these tasks, especially while focusing on increasing automation and avoiding toil, SREs need good tooling. Below are the 4 critically important tools any SRE should have.

Infrastructure-as-Code Tools

Infrastructure-as-code (IaC) tools let SREs provision infrastructure programmatically. Using IaC tools, SREs define how compute, network and storage services should be provisioned, sized and configured. For most organizations, this capability is a foundational prerequisite for automating the software delivery lifecycle, and as we saw earlier, automating the software delivery lifecycle is a prerequisite for being able to deploy continuously.

Examples: Terraform, AWS CloudFormation, Ansible, Puppet.

Observability Tools

Next, SREs need observability tooling to diagnose issues quickly and conclusively, without having to set up specific monitors up-front. Unlike monitoring, which is designed to answer the particular questions for which you configured it, observability captures enough data in enough detail so you can answer questions you didn’t think of beforehand as well.

Observability tools provide insight into the application’s production performance and display how well they run from a user’s perspective. These tools also collect strategic analytics that aligns the app’s continuous performance with SLAs and other operational guidelines.

The pillars of observability are

  1. Large amounts (“high-cardinality”) of rich (“high-dimensionality”) data points about software behavior

  2. Log messages emitted from software components as they execute

  3. Request traces that trace the execution of transactions across sets of services.

Examples: Dynatrace, AppDynamics, DataDog, New Relic, Honeycomb.

Incident Management Tools

Cloud native applications grow ever more complex and can evolve rapidly over time. This makes it challenging to ensure that code and configuration are always correct and that there are always sufficient resources available. In addition, cloud native applications can have a large number of direct and indirect dependencies, further complicating the operational equation. As a result, incidents are unavoidable.

But incidents are also high-stress events, and it is vital that the incident response is carried out as collaboratively and frictionless as possible. Incident management tools provide SREs with the workflow automation they need to focus on the incident, triage alerts, appoint an incident commander, and run the quickest process to resolution without missing steps, duplicating efforts, or running in circles.

Examples: PagerDuty, Squadcast, Splunk On-Call.

Runtime control

Finally, SREs also need a way to quickly see what is going on and respond to issues as they arise. This means they need to be able to

  1. See “the forest, not the trees,” so they can quickly identify where runtime pressures and undesirable behaviors originate and

  2. Exert control over production behaviors at runtime without having to tweak configurations or rely on development to work on a patch.

Without a way to do these two things, engineers are unable to contextualize runtime behaviors and can only apply crude, imprecise measures, like rolling back deployments or rebooting instances.

Summary

As the industry’s first runtime control solution, Glasnostic fills this gap in today’s SRE tooling.

While eminently valuable in themselves, Infrastructure-as-code, observability and incident management tools force SREs to spend hours in diagnostics and then design and release a patch.

WIth runtime control, engineers are able to see the forest, not the trees, identify immediately where runtime pressures originate and know what needs to be done, proactively, ahead of time, or reactively, when issues arise—manually or automatically.

As a result, reliability is maximized, security enforced, performance and costs are optimized, and the customer experience is ensured.

Try for yourself!