Presentation: Choose Your Own Adventure: Chaos Engineering
What You’ll Learn
- Learn what is Chaos Engineering and how Netflix is using it.
- Discover how one can introduce Chaos Engineering to their organization.
- Discuss how to present in a convincing way Chaos Engineering to the upper management.
Abstract
Chaos Engineering is described as "the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production". This is immensely beneficial when executed properly, however all too often the road to cultural acceptance may not match our expectations as SREs, Chaos Engineers, and Productivity engineers.
Choose Your Own Adventure is a series of children's gamebooks where each story is written from a second-person point of view, with the reader assuming the role of the protagonist and making choices that determine the main character's actions and the plot's outcome.
This presentation will play on the book series and go over different experiences on "Chaos Adventures" including both successes and failures introducing Chaos in an organization. Chaos Engineering can lead to better development processes and procedures and better preparedness for outage.These benefits are available to any company willing to invest in more resilient and antifragile systems.
Chaos tools can positively influence the development process, and audience members will leave this talk with a game plan on how to bring Chaos practices to their organization. The "Chaos Adventure" will look a little different from everyone depending on type of organization, size of organization, and inter-team communication.
QCon: What's the work you're focused on at Netflix today?
Nora: I'm on the Chaos team at Netflix. Our mission is to make sure that a system withstands turbulent conditions that happen in production on a regular basis. Making sure it's resilient enough to do that. We are using Chaos Engineering which involves injecting purposeful failure in the system, doing experiments on the system at different injection points that you can create between your services, and working with different microservice teams to do that. Our goal is to reveal failures before they become large-scale failures.
QCon: Your title is called "Choose Your Own Adventure: Chaos Engineering." What's that mean?
Nora: From my experience so far I have found that there is no one solution for chaos. There is no precise process that you can follow step by step, and there are many factors to weigh in when making chaos solutions for your teams. Culture is a big factor. Getting social acceptance and cultural acceptance around chaos engineering is so important. And it's different with every organization whether it's a startup like Jet.com or a massive organization like Netflix. There are differences. Based on the kind of issues that would occur in each of those organizations there are different routes of chaos that I recommend choosing. I'll go through "choose your own adventure story" with the audience where different scenarios will come up and we'll have to pick a path to go down and see what happens based on that.
QCon: Can you give me an example of one of these paths?
Nora: Sure. Say, for example, that you were having a lot of issues with Kafka. Your organization relies on Kafka on a pretty regular basis. All of a sudden, topics were getting overloaded, there were too many services writing to the same topic at the same time, or were reading from the same topic at the same time. How do you handle that? How do you control the chaos in that? One way to do that would be to arbitrarily increase reader rights on topics on a semi-regular basis on a semi-random basis, and see if your system can handle that. Many times with microservices architecture, they get so big that you don't even realize you have a ton of different services listening to the same topic. That could be one chaos introduced with Kafka. There are a few other ways that you can you can handle that too. Based on how you decide to handle that could reveal the actual problem or it could reveal different problems in the system as well.
QCon: Who's the main audience persona you're addressing?
Nora: I would say that engineers, managers and PMs can all take something away from this conversation. I found in my experience that getting managers and PMs to understand what chaos engineering is and understanding the goal with it is so important for the engineer that's actually doing it. I try to tailor it to both audiences, so it's a mixture of both the business side and the the technical side.
QCon: Will your talk give engineers the information they need to convince their managers on this?
Nora: Yes. And I'll speak from first hand experience.
QCon: What do you want someone who comes to your talk to walk away with?
Nora: I would like for people to come away with a cultural and technical plan to introduce chaos to their organization. An introduction to a language to build a failure injection library, and some real life examples of actually bringing chaos to an organization and testing the several different functions of a distributed system, from queues to databases to regional failures and beyond.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
Not Sold Yet, GraphQL: A Humble Tale From Skeptic to Enthusiast
Software Engineer @Netflix
Garrett Heinlen
Let's talk locks!
Software Engineer @Samsara
Kavya Joshi
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
Graceful Degradation as a Feature
Director of Product @GremlinInc