Presentation: State of Chaos Engineering

Track: Chaos & Resilience

Location: Majestic Complex, 6th fl.

Day of week:

Slides: Download Slides

Level: Intermediate

Persona: Developer, DevOps Engineer

What You’ll Learn

  • Learn how Chaos Engineering can make your systems resilient.
  • Understand how to roll out Chaos Engineering in your organisation.
  • Discover the benefits of resilient system design.

Abstract

“I don’t always test my resilience, but when I do, it’s at 3 a.m.”

“I don’t always test my resilience, but when I do, it’s in Prod.”

“I don’t always test my resilience, but when I do, its an outage!”

These were the days… the days before Chaos Engineering. More and more practitioners are on their way to discovering the benefits of Chaos Engineering. What started as an odd, bold, and even scary practice has been embraced by many in the pursuit of more nines. This talk examines the current state of Chaos Engineering, emerging patterns of success, and the future opportunity at hand.

Question: 

QCon: In December 2012 that you were on call with an AWS outage for Netflix - that means you've been working at least for five years in this space. It sounds like you've been doing Chaos Engineering for a while?

Answer: 

Bruce: Yeah I was doing Chaos Engineering before the term was first used. I was hired by the guy who wrote Chaos Monkey, and introduced to it in the first week.

Question: 

QCon: What's your role at Twillo?

Answer: 

Bruce: I lead a team called insight engineering, we're doing a combination of telemetry solutions paired with Chaos Engineering. We have just started formalising the roll out of Chaos Engineering. Different teams have approached me to talk to me about how to get started: how do we see failure and observe failure in higher resolution and faster systems, distributed tracing, time series metrics, and so on.

Question: 

QCon: What's the goal for your talk?

Answer: 

Bruce: The goal for my talk is is to make Chaos Engineering less scary and more accessible for everyone. I think having launched Chaos Engineering twice at two different organizations has taught me that what worked at one doesn't work with the other. I've been thinking about what are the commonalities between these, and what really resonates, and different approaches to rolling this out. The key takeaway is that it's not just for Netflix: you can roll it out anywhere, and it's not as scary as you think it is.

Question: 

QCon: What do you want attendees to leave your talk with?

Answer: 

Bruce: My hope is that a developer attending my talk could leave with a sense of hope that your system can be resilient and you can stop being woken at 3:00 in the morning. But also be inspired to help and share.. There's a lot of room for this community and effort across the industry to grow.

Speaker: Bruce Wong

R&D Leadership at @Twilio

Bruce Wong is Senior Engineering Manager leading Insight Engineering at Twilio. He formerly resided at Netflix, where he founded the Chaos Engineering team to stress and proactively introduce failure into critical production systems to validate resilience. He is passionate about tackling challenging problems, scaling engineering teams, and building compelling products. In his spare time he can be found applying engineering principles to iterate on BBQ and chocolate chip cookies.

Find Bruce Wong at

Similar Talks