Stampede in the Serengeti
Be prepared. (Source: andbeyond.com)

“Thundering Herd” or Sound of Customers?

This Monday, markets rallied by 4.7% and Robinhood was down the entire day, leaving its users out in the cold, unable to capitalize on the rally. The outage sparked a social media outcry that saw a new “Robinhood Class Action” Twitter account amass over 6,500 followers.

What happened? What could possibly cause a day-long outage plus a few more hours of outage the following day?

Robinhood’s founders explained on Tuesday that “unprecedented load” due to “highly volatile and historic market conditions, record volume, and record account sign-ups,” among others, led to a “thundering herd” effect that eventually brought down Robinhood’s DNS system.

Classic “thundering herd” effects tend to cause only temporary resource exhaustion. Clearly, this was something more severe and we have to assume that the founders refer to a sustained onslaught of traffic somewhere in their service landscape. In other words, an apparently unexpected, massive surge of demand that caused a time-tested and famously resilient, critical service such as DNS to fail. Of course, once name resolution is down, everything disintegrates.

We can only speculate as to what exactly happened, but this much is clear: Robinhood did not go down due to a “thundering herd” effect. That “thundering herd” is merely the symptom of a series of knock-on effects that were allowed to take place undetected and ended up spiraling out of control.

Service landscapes always operate in a state of degradation. Robinhood consists of a complex landscape with thousands of services, from pricing to security management, account infrastructures, risk monitoring, fraud detection, transaction monitoring, options trading, and so on. A chain of events that caused the “thundering herd” may have started with an elevated number of pricing queries, which, on a day with higher-than-usual trading volumes, may have hit a caching layer that was already struggling with longer-than-usual latencies from external sources, and so forth. The exact circumstances rarely ever matter.

Complex emergent behaviors are always large-scale, non-linear and disruptive.

What matters is that such behaviors are always

  • Complex: they involve many chains of events across many systems,
  • Large-scale: they affect vast areas of the service landscape,
  • Unpredictable: they are emergent (and thus inescapable) and, above all,
  • Non-linear: i.e., a latency issue can quickly morph into a CPU issue, then turn into a network issue.

This is precisely why “thundering herd” is not a useful concept to analyze the issue. It is essential to look beyond the symptom and get to the large-scale and non-linear, complex sets of events that hide behind it.

In real-time.

In times of crisis, operators can’t afford to sift through heaps of data or look for “root causes.” They need to quickly detect—and respond to —complex interactions between systems. They need effective visibility and effective control. Universal, “golden” wire signals that capture the essence of what is going on and best-practice, operational patterns that lead to predictable remediation. First responders look at vital stats and administer urgent care. Operators need to look at golden signals and apply operational patterns. At a minimum, every operator must be able to apply backpressure, circuit breaking and bulkheads, anywhere, between arbitrary sets of services and in real-time.

This is the essence of Mission Control.

We don’t know what the true nature of Robinhood’s outage was, but it is almost certain that the outage could have been avoided if they had been able to see the forest for the trees, spot the complex emergent behaviors leading up to it, and then to do something about it.

A crisis is not the time to debug threads of execution or examine reams of forensic data. Distributed systems engineering may be the high art of software engineering, but it is also the slowest kind. In times of crisis, we need effective visibility and control. Screen and stabilize. Detect and remediate.

The bottom line is: as stressful as a “thundering herd” may be for your service landscape, it ultimately is a sign of a good thing—the sound of many customers.

Be ready. Control the herd.

Manage the complex behaviors of modern microservice architectures