Presentation: UNBREAKABLE: Learning to Bend but Not Break at Netflix
This presentation is now available to view on InfoQ.com
Watch videoWhat You’ll Learn
-
Hear how chaos experiments complement other types of testing.
-
Listen to some of the lessons Netflix learned through running chaos experiments.
-
Find out that incipient data collection can uncover code defects even before chaos experiments are run.
Abstract
How do you gain confidence that a system is behaving as designed and identify vulnerabilities before they become outages? You may have thought about using chaos engineering for this purpose, but it’s not always clear what that means or if it’s a good fit for your system and team.
My experience at Netflix has led me to embrace chaos engineering to build more resilient distributed systems. I will share examples of chaos experiments which identified problems and built confidence in our resilience mechanisms, as well as several challenges, lessons, and benefits we have encountered while scaling chaos engineering across Netflix.
Key Takeaways
- How Chaos experiments complement other types of testing.
- How your perspective of Chaos engineering changes as your role and services evolve.
- How Chaos engineering can be used to gain confidence in platforms, configurations, and resiliency mechanisms.
- Lessons automating Chaos experiments safely and effectively.
Tell me about your talk.
I’m going to share my personal journey at Netflix learning to build and operate distributed systems -- both as a service owner and as a Chaos engineer. As service owner, I’ll provide examples of how I used Chaos engineering to build better systems, even for non-critical services. As a chaos engineer, I’ll cover some of the lessons I’ve learned while building better tooling for safe experimentation.
Can you give me an example of one the lessons?
When running Chaos experiments, we leverage a canary strategy. We have a control and an experiment cluster, and we monitor KPI data during the experiment so we can shut it off quickly if things go awry. We've been adding more KPI’s so that we can watch different dimensions, and one of the challenges we’ve encountered is how to monitor low volume metrics to get a reliable signal for shut off. False positives create a lot of noise. We don’t want our users to get alert fatigue from unreliable results, so we have to find the right balance between failing on the side of caution and minimizing noise.
Martin Fowler said once that there has to be a certain amount of organizational maturity to adopt microservices. Is it the same with chaos?
To do chaos experimentation correctly in production I think that's probably true because you don't want to put your customers too much in harm’s way. However, there's still value in running experiments in a staging or test environment without all of the bells and whistles. There is just a class of problems you're not going to find that way. I'm going to talk about that a little bit -- even at Netflix, I had a discussion with someone from the Spinnaker team who felt bummed because he didn't run experiments in their production system. He said they weren’t really doing chaos. I said, if you're finding problems every time, why would you stop doing that? There's a lot of value in finding those types of problems in a staging environment if you can and you really shouldn’t move to production as long as you’re still finding issues there. I think that's a common point of confusion for people.
You say that chaos complements other types of testing. Please, explain.
Service owners may ask, what is the value in running experiments if we're not in the critical code path? I would say that it provides a way to ensure you're not critical. We had a service that was not in the critical path and we had tons of integration tests and unit tests around it -- yet we still had a case where a failure caused a customer outage. Running chaos experiments can uncover problems which tests will miss due to differences in data in the production environment and customer behavior. In that way, chaos complements traditional testing methods.
You have lessons on automating chaos experiments safely and effectively. You mentioned canaries -- what else?
When we started, one of the first things we needed was detailed insights into the services we wanted to experiment on. This included a list of dependencies for the service, how they were configured as far as retries and timeouts, latency characteristics, etc. We had to decide what's safe to run experiments on versus what's not safe and how we should design the experiments. There were two lessons that came out of that process. One -- it was relatively difficult for us to gather all that information in one place, so there is a lesson for platform owners to expose that data in a more consumable way. Once we had the data, we realized the data itself would be valuable to service owners. Even before running any experiments, we exposed it in a UI and were able to flag problems to users so they could see -- hey, that looks wrong and we should fix it. We've had wins just out of providing that visibility. Running the automated experiments has uncovered vulnerabilities as well, but it was a pleasant surprise to find that value even outside of the experimentation.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
Not Sold Yet, GraphQL: A Humble Tale From Skeptic to Enthusiast
Software Engineer @Netflix
Garrett Heinlen
Let's talk locks!
Software Engineer @Samsara
Kavya Joshi
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
Graceful Degradation as a Feature
Director of Product @GremlinInc