Presentation: How Did Things Go Right? Learning More From Incidents
This presentation is now available to view on InfoQ.com
Watch video with transcriptWhat You’ll Learn
- How to change your thinking from "Why did things go wrong?" to "How did things go right?"
- See ways to increase our learning from incidents
- Learn to ask better questions and facilitate effective conversation
Abstract
Solely learning from failure isn't a fundamental–it's a limitation.
A look into the New View of Safety, Human & Organizational Performance, and Resilience Engineering shows us that safety, great performance, and sources of resilience do not come from the absence of failure, but rather the presence of adaptive capacity.
Navigating a perfect storm in a world where availability is made up and the 9's don't matter requires expertise. This talk will describe more rewarding ways to approach incident investigation without overly focusing on failure prevention.
- What's going on when it seems like nothing is happening?
- When failure does occur, what's going to keep it from being worse?
- How do teams adapt successfully when preventative techniques fail?
- How should we prioritize the effort to develop systems that help us safely manage the consequences of failure?
These questions cannot be answered by trying to explain causes of failure and fixing remediation items.
We will move the needle forward and increase our opportunity for learning from success with some fundamental and practical ways that get us from, "Why did things to wrong?" to "How did things go right?"
What is the focus of your work today?
I am part of a team at Netflix that we call the Core Team. At the moment, we're about ten people who are tasked with ensuring the availability of Netflix. One of the ways we do that is by helping the organization learn through the incident lifecycle. Our team has a variety of backgrounds that we are seeking to expand even further, but largely consists of people who have worked as software engineers and SREs in the reliability space on highly customer-focused products.
What’s the motivation for this talk?
Learning from incidents in software is something we just don't get exposed to nearly as much as we should early on in our careers. Once we do encounter an incident, we tend to focus on why things went wrong. Highlighting and learning from our failures is important, but it's not enough. We need to respond to incidents with an eagerness to learn through holistic approaches rather than oversimplifications. I want to help move our industry past that limiting, gut reaction of, "How do we stop this from ever happening again?"
How would you describe the persona and level of the target audience?
- Anyone who has ever been involved in an incident
- People who have struggled with diagnosing a bug only to end up asking, ‘how did this even work in the first place?’
- Leadership who experience incidents with high amounts of uncertainty (which is basically all of them).
- Anyone who thinks that the point of an incident investigation is to find out what caused it
What do you want this persona to walk away from your talk with?
I want people to realize a new way to think about incidents and some topics to begin questioning at their organizations. They will
- Know how to get rid of the templates and encourage investigations that people actually care to read.
- How to ask better questions rather than stopping at a 'root cause'.
- Know what conversations to start in their organizations to learn how work actually gets done.
- Be able to find ways to ensure that the pressure to learn outweighs the pressure to fix an incident.
What do you feel is the most important trend in software right now?
Coping with complexity, particularly in people wanting to 'automate everything'. This is a popular sentiment, and it needs to change. There is a huge want to get to a world where we don't have to think about anything except the logic of our applications. Forget ‘serverless’, people want it to be 'thoughtless' and 'careless' too, and there’s this belief that adding more technology to our automation is the best way to do that.
We know we can't just tick a bunch of checkboxes to create and maintain a new feature. Why do we think we can do this with every bit of infrastructure and platform tech? We should be thinking about automation as a team player versus automation as a replacement for humans.
Everyone out there building a platform is struggling with how much expertise do users of the platform really need to have about the underlying technologies so they can step in when the automation has problems. Instead, we should be approaching it as a collaborative endeavour. We need to design systems that enhance how people and software interact during joint activity. This is referred to as a 'Joint Cognitive Systems' view and is part of the field of Resilience Engineering.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
Not Sold Yet, GraphQL: A Humble Tale From Skeptic to Enthusiast
Software Engineer @Netflix
Garrett Heinlen
Let's talk locks!
Software Engineer @Samsara
Kavya Joshi
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26
Patrick Kua
Inside Job: How to Build Great Teams Within a Legacy Organization?
Engineering Director @Meetup
Francisco Trindade
Scaling Infrastructure Engineering at Slack
Senior Director of Infrastructure Engineering @Slack