QCon New York June 15-19, 2020 | | Heretical Resilience: To Repair is Human

This presentation is now available to view on InfoQ.com

What You’ll Learn

Hear about Apache SNAFU, how it started, how it worked out and the lesson learned along the way.
Find out how to design tools, processes and systems to increase resilience.
Learn how to improve teams to better respond to outage events.

Abstract

Resilient architecture is often thought of solely in terms of its technical aspects - with the right distributed system or automated failover or fancy new orchestration software, we want to believe we can avoid the inevitability of failure. While it is certainly true that we can design our systems to be more robust, true resilience comes from humans. The humans in complex systems, and especially the human-computer interactions and interfaces, are what can really make or break the true resiliency of these systems. This human-centric approach requires a different mindset than a solely infrastructure-focused one, but is no less rigorous and encourages change that are equally if not more important.

In this talk, I will describe the “Apache SNAFU” described in the SNAFU Catchers’ Stella Report, sharing my experiences as the instigator of that snafu and walking through the lessons that can be learned from such an event. Takeaways will include ideas for how to design tools, processes, and systems in ways that maximize the resilience and responsiveness of humans throughout engineering organizations.

Question:

Tell me a bit about the work that you do today.

Answer:

I'm currently working at Travis CI where I'm the lead of the build environment team. This team is working on the environment that allows our customers to run their builds - making sure that we can create, test, and update the environments where customer builds get run in a reliable manner, so that customers can continue to test as new software releases are coming out. I've been at Travis for just under a year now, and before that I spent three years doing web operations at Etsy, where I focused on provisioning, monitoring, and configuration management.

Question:

What's the goal for the talk?

Answer:

The goal is to get people to think about the human aspects of their systems. Many environments these days involve fairly complex systems with a lot of moving parts, and a lot of talks that I have heard about resilience focused solely on the technical aspects of that. I feel like a lot of those discussions tend to neglect the fact that ultimately there are human operators who are developing these systems, maintaining these systems, who are responding to pagers at 3AM when something goes wrong. What I want to do is focus on the human-system interactions, and how we can change or adapt the way that we think about building and maintaining systems to make them more reliable for the people using them and the people maintaining them. I think that we can make our technology be more reliable, but I think true resiliency, that ability to repair systems in a sustainable manner, comes from human learning and interaction.

Question:

Can you give me an example? Are we going to try and make the 3AM pager call less painful?

Answer:

One thing that I've thought about a lot in terms of monitoring and on calls specifically is the idea of alert design. What information are we putting into the alerts? Particularly putting some context in there - let's look at a typical example of a disk space alert. If you get a disk space alert and it says the disk is 90% full and it’s 3am, how urgent is that alert actually? If it’s been creeping up slowly to 90%, it can likely wait until morning. Or did it just spike up and if if you don't do something about it now, something is going to go terribly wrong? Adding context means something like being able to put a graph into that alert, because just number going over a threshold doesn't necessarily tell you what you need to know.

Question:

Who's the main persona that you're talking to?

Answer:

Mostly SREs and architects, could be developers as well. Anyone who is going to be responding to an incident in a system, whether they’re a developer or in operations, whether it's directly customer facing or not, could hopefully get value from this.

Question:

What do you want someone who is responding to an incident to take away from your talk?

Answer:

A lot of what I'm talking about is how we approach automation and orchestration. The Apache SNAFU is a story about when automation - in this case, configuration management - went terribly wrong, and how people were responding to that. When you automate away all the tedious bits or the boring bits of something, the parts that are left over tend to be more complex. In addition to that, over time things may have been automated for so long that people forget what is actually going on under the hood. When that goes wrong, when these complex tools that we've created to manage these complex systems go wrong, as every computer thing inevitably does at some point or another, how do we respond to that? Do we know how to repair our automation as well as the systems being automated? In addition, I want people to think about how we can make sure that we are also learning from these incidents. I've worked at places over the years where everyone was always so busy firefighting that nobody ever had the time to do anything more than that, more than just putting a quick bandaid fix on something. When you're in the midst of firefighting, sometimes that's all you have time to do in the moment. But that's not good in the long run- it’s not how you develop organizational learning, it's not how you develop resilience. Instead it usually ends up making things more and more fragile.

Question:

How do you build it in, making sure that you're learning not firefighting all the time?

Answer:

It's something that has to get buy-in throughout the organization - on teams and throughout management- because you have to be able to build that into your schedule. One thing that I've done is build some of that into time estimates and project planning. At some point we're going to have an incident, we're going to have to respond to it in the moment, and we're going to have remediation items to follow up on. Having the organizational flexibility to do that is necessary to developing resilience. Thinking about which teams have the most incidents and giving those teams more flexibility, allowing people to communicate with each other more directly when they need help. One of the interesting things about the Apache SNAFU was how quickly other people from a wide variety of teams were able to jump in and help. That worked because it was a culture where they had enough slack in their schedules to say, “hey, there was an incident. I know I said I was going to do this thing today but I ended up jumping in and helping out with this instead.” Making that sort of flexibility ok from a cultural perspective, and making it ok for people to make mistakes and learn from them, is key to building a resilient organization.

Speaker: Ryn Daniels

Staff Infrastructure Engineer @travisci

Ryn Daniels is a staff infrastructure operations engineer at Travis CI who got their start in programming with TI-80 calculators back when GeoCities was still cool. These days, they have opinions on things like monitoring, on-call usability, and Effective DevOps. Before escaping to the world of operations, they spent a few years doing R&D and systems engineering in the corporate world. Ryn lives in Brooklyn with a perfectly reasonable number of cats and in their spare time can often be found powerlifting, playing cello, or handcrafting knitted server koozies for the data center.