Presentation: State of Chaos Engineering
What You’ll Learn
- Learn how Chaos Engineering can make your systems resilient.
- Understand how to roll out Chaos Engineering in your organisation.
- Discover the benefits of resilient system design.
Abstract
“I don’t always test my resilience, but when I do, it’s at 3 a.m.”
“I don’t always test my resilience, but when I do, it’s in Prod.”
“I don’t always test my resilience, but when I do, its an outage!”
These were the days… the days before Chaos Engineering. More and more practitioners are on their way to discovering the benefits of Chaos Engineering. What started as an odd, bold, and even scary practice has been embraced by many in the pursuit of more nines. This talk examines the current state of Chaos Engineering, emerging patterns of success, and the future opportunity at hand.
QCon: In December 2012 that you were on call with an AWS outage for Netflix - that means you've been working at least for five years in this space. It sounds like you've been doing Chaos Engineering for a while?
Bruce: Yeah I was doing Chaos Engineering before the term was first used. I was hired by the guy who wrote Chaos Monkey, and introduced to it in the first week.
QCon: What's your role at Twillo?
Bruce: I lead a team called insight engineering, we're doing a combination of telemetry solutions paired with Chaos Engineering. We have just started formalising the roll out of Chaos Engineering. Different teams have approached me to talk to me about how to get started: how do we see failure and observe failure in higher resolution and faster systems, distributed tracing, time series metrics, and so on.
QCon: What's the goal for your talk?
Bruce: The goal for my talk is is to make Chaos Engineering less scary and more accessible for everyone. I think having launched Chaos Engineering twice at two different organizations has taught me that what worked at one doesn't work with the other. I've been thinking about what are the commonalities between these, and what really resonates, and different approaches to rolling this out. The key takeaway is that it's not just for Netflix: you can roll it out anywhere, and it's not as scary as you think it is.
QCon: What do you want attendees to leave your talk with?
Bruce: My hope is that a developer attending my talk could leave with a sense of hope that your system can be resilient and you can stop being woken at 3:00 in the morning. But also be inspired to help and share.. There's a lot of room for this community and effort across the industry to grow.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26
Patrick Kua
Inside Job: How to Build Great Teams Within a Legacy Organization?
Engineering Director @Meetup