On the Glasnostic blog, we often talk about how the complex cloud-native environments we run today are prone to disruptive interaction patterns that can bring the business down. While this is certainly important to talk about, mere words do not do justice to the depth of the operational crisis we’re facing. Examples are much better suited to reminding us that failures can occur at any company, at any time. And as companies adopt more and more microservices, the extent, severity and rate of disruption will only increase.
With or without a “black swan” like COVID-19, infrastructure and operational issues with microarchitectures occurred frequently throughout 2020. While some might be chalked up, at least in part, to the chaos caused by the pandemic, most are simply a stark reminder of the complexity and interconnectedness of modern service environments.
Below, I am looking at some of the notable outages that occurred this year and what we can learn from them. This is not to point fingers or call anyone out. On the contrary, it is meant to show that these issues can happen to anyone and that there are specific techniques and capabilities we should invest in if we want to reduce downtime in the future.
A big shout-out to the intrepid operators who have published these post-mortems for others to learn!
Back in February, Github experienced a series of 4 outages related to their use of ProxySQL. First, a hefty query was accidentally run against the master of a MySQL cluster, exhausting its capacity. This led to query delays that clogged the ProxySQL connection pool and ultimately brought the cluster down. The next day, a maintenance task caused another load spike that again squeezed ProxySQL and caused the same outage. A few days later, active database connections exceeded a critical file descriptor limit, this time disrupting database writes. Finally, two days later, changed query patterns introduced by an application update again caused load levels that affected the availability of all dependent services.
Of course, the hefty query, maintenance task and application update were merely triggers. The outages were caused by system limits that were—“unexpectedly”—exceeded: the size of ProxySQL’s connection pool and the number of file descriptors. Also, it seems fair to say that better visibility into interaction behaviors—load distributions and, essentially, “who is talking to who, when and how much”—would have been handy during the diagnostic and remediation phases, along with an ability to exert classic backpressure to reduce load levels.
When rallying markets caused a flood of requests at the beginning of March, Robinhood went down for the day, causing massive losses for their customers. According to the founders, the flood of requests led to a thundering herd effect that overwhelmed their DNS.
Obviously, had they had the ability to detect this behavior and, again, exert classic backpressure, they could have simply pushed back against the thundering herd and continued to operate. At a minimum, they could have limped along while increasing their DNS capacity. Alas, the full day of outage was followed by additional outages over the next two days.
Also in March, a malformed service name led to crashing service discovery modules in Discord’s frontend workers, which created an outage. Remediation consisted of throttling incoming traffic as well as stopping database maintenance processes. Unfortunately, the frontend crash frequency also caused their supervisor process to fully restart about 50% of the nodes. In addition,
etcd ran out of available watcher processes.
While this outage didn’t start with an “unexpected” limit, it eventually hit a restart rate limit and ran out of
etcd watchers, exacerbating things considerably. Thankfully, the team was able to exert backpressure against (throttle) incoming traffic and to shed load from the database tier.
In March, a bulk update of data that expanded in a massive backlog of individual updates caused Google’s IAM service to exhaust memory and go down. This was exacerbated by cache interference and several other rollouts aimed at mitigating the outage. The incident required engineers to bulkhead life queries and serve stale data while teams increased memory and throttled backlog processing. Once order was restored, IAM was ramped up to gradually serve live data again, region-by-region.
The cause for this outage was a bulk update that snowballed into enough individual updates to—“unexpectedly”—hit the maximum available memory. Luckily, Google’s engineers were able to apply a bulkhead as well as backpressure against update processing, thus allowing the system to recover before gradually lifting the bulkhead.
In May, a performance bug triggered by a configuration change caused database load to increase at Slack, which in turn caused frontend web workers to scale up. This led to more connections, longer latencies and higher utilization. The record number of web workers also exceeded the number of available worker slots in the web proxy tier, which in turn caused the registration of new web workers to fail. As a result, and as more existing workers were deprovisioned, ever fewer workers were available, despite full worker slots and a record number of available workers. Once diagnosed, remediation involved rolling restarts of web proxies and removing the last active workers.
Again, while there were contributing factors such as dysfunctional alerting, the main culprit here was a hardcoded limit of web worker slots in the proxy tier that was hit—“unexpectedly.” And, the ability to exert backpressure against incoming requests would have been useful throughout the outage, particularly if coupled with visibility at the service interaction level.
Also in May, Quay.io went down repeatedly due to a storm of connection requests that exceeded the database’s processing capacity. Since the database became unresponsive, it could not be diagnosed directly but examining query logs for patterns to explain the storm did not yield results either. In addition, a coincidental OpenShift update proved unrelated. Service functionality was cut back as much as possible to reduce loads, and new infrastructure finally allowed the service to resume until a second outage hit days later. Frontend code was changed to enforce connection limits, which helped avoid a recurrence of the issue. Finally, the SRE team identified a request change in OpenShift 4 together with a wasteful implementation of that request as responsible. This quickly became a problem as the new version was adopted at scale.
As before, these outages were defined by a hard limit that was unexpectedly hit: the processing capacity of the database. At the same time, they were exacerbated by the lack of an ability to control quality of service between clients and a lack of visibility at the service interaction level. One should also note that the ability to apply bulkheads or exert backpressure would have remediated both outages right away, without a need to disable service functionality or make everything read-only.
In June, the price of Bitcoin rose to a level that caused high volumes of trading on Coinbase, which was unable to cope with the 5-fold increase within just 4 minutes. The spike caused latencies to increase between services, which in turn cause the frontends to be flooded by requests. Requests were first queued, then timed out and finally failed altogether. Crucially, auto-scaling was too slow to deal with the sudden surge. Of course, health checks competed with regular requests, failed accordingly, and the outage cascaded from there.
Again, the limit that was hit here rather “unexpectedly” was the speed with which the system could scale out. Backpressure could have turned this outage into a temporary degradation but, unfortunately, was not available. It would have also helped if health checks could have been granted a different quality of service.
This was not their only outage this year. In November, Coinbase accidentally included an external load balancer in a TLC certificate rollout, leading to near-total connectivity loss. A swift rollback caused a thundering herd problem, though, as all services tried to reconnect at the same time. This forced the team to kill connectivity so they could redeploy and scale out before turning traffic back on.
This incident stands out somehow in that it was initially not caused by an accidental limit but rather by an honest configuration mistake. It is also notable because the initial remediation led to a follow-on outage. The ability to limit the blast radius of the rollback would have been helpful, as would have the ability to exert backpressure instead of simply turning traffic off.
AWS Kinesis employs a sophisticated mechanism to route processing requests to actual stream processors that involves synchronized frontend servers dispatching requests to sharded backend servers. This mechanism broke down in November. Frontends not only take care of authentication, throttling and routing—they also cache the backend shard map and run a separate thread for communication with every other frontend server. Also, frontend processing of, e.g., shard maps competes with incoming requests, making a frontend restart a dangerous and expensive operation. On that day, a small amount of new capacity was added to the frontend fleet. This was followed by an outage. The initial hypothesis was that the outage was due to memory pressure. However, it was finally discovered that the addition of capacity caused all frontend servers to run out of operating system threads. Remediation involved a careful restart of fleets, “one by one.”
This is another example of an unknown limit being “unexpectedly” hit and causing an outage: the number of operating system threads. During remediation, backpressure was essential as it allowed the team to free up resources for internal processing during restart.
The one learning that stands out is that—except for the initial Discord outage and the thundering herd at Coinbase, which both began as functional incidents—every outage was caused by a limit that was reached or exceeded—“unexpectedly,” as we keep hearing.
All systems are finite, in particular, if we pretend they are not.
Of course, as these incidents show, this is all but unexpected! The story arc of an outage is always the same: “normal event meets conspiracy of factors to unexpectedly exceed a hitherto unknown limit, chaos ensues.” Yes, the specific details are always outlandish freak accidents that no one could have possibly predicted, but the fact that these unpredictable events unfold is entirely predictable.
As engineers, we need to stop pretending that software systems “work.” They work until they don’t. And it’s always something—something!—we didn’t expect and couldn’t possibly have foreseen. A black swan. It is time to recognize that there is an infinite number of “black swans.” They are called “crows.”
This leads to the second learning.
If we want to deal with unpredictability, we need high-level observability and real-time control. The trend to build new functionality by simply stitching more services together is inescapable and having real-time observability and control is the only way for businesses to assure the stability of their complex environments. We need to detect disruptive events at the service interaction level, rapidly, and we need to be able to respond to them, in real-time.
The need for real-time control becomes strikingly apparent in light of the incidents above: every single outage was either remediated with the help of classic backpressure or could have been mitigated by it into a mere degradation. Had they had the ability to exert backpressure, Coinbase could have prevented their frontends from flooding until their auto-scaling caught up, and Robinhood could have prevented their DNS services from going down. The ability to exert backpressure is perhaps the single most critical capability during remediation.
Other operational capabilities that were or would have been helpful during remediation of the above incidents include
The ability to control quality of service for individual connections: Quay were unable to debug their database because it was clogged by other requests, Kinesis was unable to reboot frontends rapidly because incoming requests competed directly with required internal processing tasks, and Coinbase learned that running health checks against a clogged frontend fleet was not helpful.
The ability to set up bulkheads between endpoints: Quay could have accelerated their recovery by inserting a concurrency-based bulkhead in front of their database, and Google processed their update backlog in a controlled manner.
The ability to shed load: Discord was able to drop database maintenance tasks.
However, nothing matches the magic of backpressure.
This leads to the third learning.
We need to stop obsessing about root causes. Like the true source of the Nile, there is no such thing. A “root cause” is merely something we feel comfortable enough about pointing a finger at. And it suggests that, somehow, going “deep” is valuable, when in fact it is almost always a rabbit hole and colossal waste of time.
Engineering is not about going “deep”—that’s science. It’s about making things work.
Attention to detail is a pernicious and counterproductive habit. We learned to analyze deeply in school because it makes tests easy to grade, and it’s become a habit we continue in our jobs. But, most of the time, in-depth analysis correlates negatively with great engineering. Making things work, thinking outside of the box to find solutions and understanding the big picture is what great engineering is all about.
It is high time we take a hard look at what our primitives are and what we need to observe. With today’s complex and dynamic environments, it’s not file descriptors and sockets or even error codes. It’s about how systems interact.
It is important to emphasize that the majority of companies listed here are not startups with limited resources. They are successful companies with exceptional engineering talent and sophisticated infrastructures. Yet, despite all of their technical prowess, they still experienced outages. Moreover, their outages are actually quite similar.
The reason for this is that systems today are highly connected, and applications are no longer islands. If your application is connected to other applications, you depend on these other applications in some way, shape, or form. That means your fate depends on how these systems behave. That is the reason why failure in complex and dynamic service landscapes does no longer occur primarily due to defects in code. Failure occurs predominantly due to unpredictable interaction behaviors between systems. In other words, it is the environmental factors—how all components behave together, in aggregate—that matter, not an individual thread of execution.
Make 2021 the year to invest in detecting and controlling these behaviors.