QCon New York June 15-19, 2020 | | What Breaks Our Systems: A Taxonomy of Black Swans

This presentation is now available to view on InfoQ.com

What You’ll Learn

Find out about some of the incidents that may happen in production and can take the system down.
Hear about some of the strategies that can be employed to discover such possible incidents during canary and how to address them.

Abstract

Black swan events: unforeseen, unanticipated, and catastrophic issues. These are the incidents that take our systems down, hard, and keep them down for a long time.

By definition, you cannot predict true black swans. But black swans often fall into certain categories that we've seen before. This talk examines those categories and how we can harden our systems against these categories of events, which include unforeseen hard capacity limits, cascading failures, hidden system dependencies, and more.

Question:

Tell me a bit about yourself.

Answer:

I'm a Senior Staff Production Engineer at Slack, I've just started here. In terms of the work that I do here I don't know yet. This is my second week. At Google, I used to run large production services. I cared about primarily reliability and availability: is my service up, is my service performing well, are my users happy?

Question:

Tell me why you wrote this talk.

Answer:

One of the things that I think that we are not great at as an industry is thinking about outlier events, particularly with the way that SRE has gained popularity in the last three or four years. People started talking a lot about SLOs, error budgets, and maintaining those on a week to week basis or month to month basis. I think this is great but I think that one of the things that gets lost in that view of the world is thinking about what are my systemic risks? What are the things that can go really, really wrong with my systems. What are the things that could take me down for hours, days, maybe even longer, potentially even destroy the business. I think that worrying about those kinds of scenarios should be a core part of the job of site reliability engineers, production engineers, senior engineers. That's what this talk is about.

Question:

How do you get your mind around predicting an unknown-unknown?

Answer:

We can't predict an unknown-unknown. The crux of this talk is that my unknown-unknown is maybe like an incident that you had six months ago. Or it might be something that follows a pattern. An example that I use pretty early on the talk is canarying. Over the last few years the practice of canary testing has become extremely common - this is where we take our code or config or any kind of change, we deploy it on a subset of our machines early on. The point of doing this is so that we can tell if it's going to fail catastrophically, or even if there’s smaller regressions we can look at them and understand impact before we deploy to our full system. What we're trying to do with canarying is not to try and find any specific bug, but trying to create this generic defense against regressions or breakages caused by change. It's not perfect but it's pretty effective. What this talk tries to look get at is what are the other emergent best practices that we're going to see in the next number of years. What are things that we can do that can defeat entire categories of black swans that might otherwise take our systems down?

Question:

When you talk about the canary, do you talk about the idea of a canary, or do you go in and actually talk about how to effectively run the canary? How deep do you go?

Answer:

This is not a canary talk so I'm not going to talk in any great depth about canarying. But what I would spend some more time on is the actual production incidents. I think I've got something like 15 incidents in this talk that are divided across six different categories. And what I'll do is spend most of the time on those. For example, the first category that I'm going to look at is hitting limits that you didn't know were there. I've got a variety of five or six different incidents to discuss under that category, everything from connections under Amazon Web Services to Postgres transaction IDs. There are some strategies to try and find these problems ahead of time. I've got more time in this slot to go into some more specifics, both on the incidents that I'm going to discuss, and more particularly on the defenses.

Question:

Who is the core persona you're talking to?

Answer:

I think that this talk really works for everybody. I think this is something everybody should be learning early on in their careers. The truth is that we all need to keep up with how the industry is moving and that's what I've tried to do with this talk. I've tried to distill some of the knowledge about catastrophic failure that the industry as a whole has seen in the last five years. If you are okay with your system to be going away for a week, if that wouldn't damage your business, this talk is not for you. If you would be upset by that then come to this talk.

Question:

What would you want someone to walk away from this talk with?

Answer:

Walk away with an awareness of what are some of the risks that our systems face that we don't think about day to day and that don't show up when we think about SLOs and monthly error budgets.

Speaker: Laura Nolan

Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee

Laura Nolan's background is in Site Reliability Engineering, software engineering, distributed systems, and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly 'Site Reliability Engineering' book, as well as contributing to the more recent 'Seeking SRE'. She is a member of the USENIX SREcon steering committee.