QCon New York June 15-19, 2020 | | How Did Things Go Right? Learning More From Incidents

This presentation is now available to view on InfoQ.com

What You’ll Learn

How to change your thinking from "Why did things go wrong?" to "How did things go right?"
See ways to increase our learning from incidents
Learn to ask better questions and facilitate effective conversation

Abstract

Solely learning from failure isn't a fundamental–it's a limitation.

A look into the New View of Safety, Human & Organizational Performance, and Resilience Engineering shows us that safety, great performance, and sources of resilience do not come from the absence of failure, but rather the presence of adaptive capacity.

Navigating a perfect storm in a world where availability is made up and the 9's don't matter requires expertise. This talk will describe more rewarding ways to approach incident investigation without overly focusing on failure prevention.

What's going on when it seems like nothing is happening?
When failure does occur, what's going to keep it from being worse?
How do teams adapt successfully when preventative techniques fail?
How should we prioritize the effort to develop systems that help us safely manage the consequences of failure?

These questions cannot be answered by trying to explain causes of failure and fixing remediation items.

We will move the needle forward and increase our opportunity for learning from success with some fundamental and practical ways that get us from, "Why did things to wrong?" to "How did things go right?"

Question:

What is the focus of your work today?

Answer:

I am part of a team at Netflix that we call the Core Team. At the moment, we're about ten people who are tasked with ensuring the availability of Netflix. One of the ways we do that is by helping the organization learn through the incident lifecycle. Our team has a variety of backgrounds that we are seeking to expand even further, but largely consists of people who have worked as software engineers and SREs in the reliability space on highly customer-focused products.

Question:

What’s the motivation for this talk?

Answer:

Learning from incidents in software is something we just don't get exposed to nearly as much as we should early on in our careers. Once we do encounter an incident, we tend to focus on why things went wrong. Highlighting and learning from our failures is important, but it's not enough. We need to respond to incidents with an eagerness to learn through holistic approaches rather than oversimplifications. I want to help move our industry past that limiting, gut reaction of, "How do we stop this from ever happening again?"

Question:

How would you describe the persona and level of the target audience?

Answer:

Anyone who has ever been involved in an incident
People who have struggled with diagnosing a bug only to end up asking, ‘how did this even work in the first place?’
Leadership who experience incidents with high amounts of uncertainty (which is basically all of them).
Anyone who thinks that the point of an incident investigation is to find out what caused it

Question:

What do you want this persona to walk away from your talk with?

Answer:

I want people to realize a new way to think about incidents and some topics to begin questioning at their organizations. They will

Know how to get rid of the templates and encourage investigations that people actually care to read.
How to ask better questions rather than stopping at a 'root cause'.
Know what conversations to start in their organizations to learn how work actually gets done.
Be able to find ways to ensure that the pressure to learn outweighs the pressure to fix an incident.

Question:

What do you feel is the most important trend in software right now?

Answer:

Coping with complexity, particularly in people wanting to 'automate everything'. This is a popular sentiment, and it needs to change. There is a huge want to get to a world where we don't have to think about anything except the logic of our applications. Forget ‘serverless’, people want it to be 'thoughtless' and 'careless' too, and there’s this belief that adding more technology to our automation is the best way to do that.

We know we can't just tick a bunch of checkboxes to create and maintain a new feature. Why do we think we can do this with every bit of infrastructure and platform tech? We should be thinking about automation as a team player versus automation as a replacement for humans.

Everyone out there building a platform is struggling with how much expertise do users of the platform really need to have about the underlying technologies so they can step in when the automation has problems. Instead, we should be approaching it as a collaborative endeavour. We need to design systems that enhance how people and software interact during joint activity. This is referred to as a 'Joint Cognitive Systems' view and is part of the field of Resilience Engineering.

Speaker: Ryan Kitchens

Site Reliability Engineering @Netflix

Ryan Kitchens is a Site Reliability Engineer on the Core team at Netflix where he works on building capacity across the organization to ensure its availability and reliability. Before that, Ryan was a founding member of the SRE team at Blizzard Entertainment.