Presentation: UNBREAKABLE: Learning to Bend but Not Break at Netflix

Track: Chaos, Complexity, and Resilience

Location: Broadway Ballroom South, 6th fl.

Duration: 2:55pm - 3:45pm

Day of week:

Slides: Download Slides

Level: Intermediate

Persona: Architect, Developer

This presentation is now available to view on InfoQ.com

Watch video

What You’ll Learn

  1. Hear how chaos experiments complement other types of testing.

  2. Listen to some of the lessons Netflix learned through running chaos experiments.

  3. Find out that incipient data collection can uncover code defects even before chaos experiments are run.

Abstract

How do you gain confidence that a system is behaving as designed and identify vulnerabilities before they become outages? You may have thought about using chaos engineering for this purpose, but it’s not always clear what that means or if it’s a good fit for your system and team.

My experience at Netflix has led me to embrace chaos engineering to build more resilient distributed systems. I will share examples of chaos experiments which identified problems and built confidence in our resilience mechanisms, as well as several challenges, lessons, and benefits we have encountered while scaling chaos engineering across Netflix.

Key Takeaways

  1. How Chaos experiments complement other types of testing.
  2. How your perspective of Chaos engineering changes as your role and services evolve.
  3. How Chaos engineering can be used to gain confidence in platforms, configurations, and resiliency mechanisms.
  4. Lessons automating Chaos experiments safely and effectively.
Question: 

Tell me about your talk.

Answer: 

I’m going to share my personal journey at Netflix learning to build and operate distributed systems -- both as a service owner and as a Chaos engineer.  As service owner, I’ll provide examples of how I used Chaos engineering to build better systems, even for non-critical services. As a chaos engineer, I’ll cover some of the lessons I’ve learned while building better tooling for safe experimentation.

Question: 

Can you give me an example of one the lessons?

Answer: 

When running Chaos experiments, we leverage a canary strategy. We have a control and an experiment cluster, and we monitor KPI data during the experiment so we can shut it off quickly if things go awry. We've been adding more KPI’s so that we can watch different dimensions, and one of the challenges we’ve encountered is how to monitor low volume metrics to get a reliable signal for shut off. False positives create a lot of noise. We don’t want our users to get alert fatigue from unreliable results, so we have to find the right balance between failing on the side of caution and minimizing noise.

Question: 

Martin Fowler said once that there has to be a certain amount of organizational maturity to adopt microservices. Is it the same with chaos?

Answer: 

To do chaos experimentation correctly in production I think that's probably true because you don't want to put your customers too much in harm’s way. However, there's still value in running experiments in a staging or test environment without all of the bells and whistles. There is just a class of problems you're not going to find that way. I'm going to talk about that a little bit -- even at Netflix, I had a discussion with someone from the Spinnaker team who felt bummed because he didn't run experiments in their production system. He said they weren’t really doing chaos. I said, if you're finding problems every time, why would you stop doing that? There's a lot of value in finding those types of problems in a staging environment if you can and you really shouldn’t move to production as long as you’re still finding issues there. I think that's a common point of confusion for people.

Question: 

You say that chaos complements other types of testing. Please, explain.

Answer: 

Service owners may ask, what is the value in running experiments if we're not in the critical code path? I would say that it provides a way to ensure you're not critical. We had a service that was not in the critical path and we had tons of integration tests and unit tests around it -- yet we still had a case where a failure caused a customer outage. Running chaos experiments can uncover problems which tests will miss due to differences in data in the production environment and customer behavior. In that way, chaos complements traditional testing methods.

Question: 

You have lessons on automating chaos experiments safely and effectively. You mentioned canaries -- what else?

Answer: 

When we started, one of the first things we needed was detailed insights into the services we wanted to experiment on. This included a list of dependencies for the service, how they were configured as far as retries and timeouts, latency characteristics, etc. We had to decide what's safe to run experiments on versus what's not safe and how we should design the experiments. There were two lessons that came out of that process. One -- it was relatively difficult for us to gather all that information in one place, so there is a lesson for platform owners to expose that data in a more consumable way. Once we had the data, we realized the data itself would be valuable to service owners. Even before running any experiments, we exposed it in a UI and were able to flag problems to users so they could see -- hey, that looks wrong and we should fix it. We've had wins just out of providing that visibility. Running the automated experiments has uncovered vulnerabilities as well, but it was a pleasant surprise to find that value even outside of the experimentation.

Speaker: Haley Tucker

Senior Software Engineer, Chaos Engineering @Netflix

Haley Tucker is a member of the Chaos Engineering team at Netflix where she is responsible for verifying the resiliency of Netflix services to ensure that customers always enjoy their favorite shows.  Prior to that, she worked on the Playback Features team where she was responsible for ensuring customers receive the best possible viewing experience every time they click play. Her services filled a key role in enabling Netflix to stream amazing content to more than 118M members on thousands of devices worldwide. Prior to Netflix, Haley spent a few years building near-real-time command and control systems at Raytheon. She then moved into a consulting role where she built custom billing and payment solutions for cloud and telephony service providers by integrating Java applications with Oracle systems. Haley enjoys applying new technologies to develop robust and maintainable systems and the scale at Netflix has been a unique and exciting challenge. Haley received a BS in Computer Science from Texas A&M University.

Find Haley Tucker at