Failure isn’t a question of if, but when. Embracing a habit of introducing chaos on a regular basis strengthens systems. In this track we’ll hear from experts who have designed systems that became increasingly more resilient and reliable over time. Attendees will learn architectural patterns and approaches that didn’t and did work, with take-aways that can be applied to their own systems. Attendees will hear how chaos engineering, disaster recovery testing and other tools are being used to create incredibly resilient systems.
Track: Chaos & Resilience
Location: Majestic Complex, 6th fl.
Day of week:
Track Host: Tammy Butow
Tammy Butow is a Principal SRE at Gremlin where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Tammy previously led SRE teams at Dropbox responsible for Databases and Storage systems used by over 500 million customers. Prior to this Tammy worked at DigitalOcean and one of Australia's largest banks in Security Engineering, Product Engineering and Infrastructure Engineering.
Trackhost Interview
- QCon: Interview: What is the Chaos & Resilience track about?
Tammy: The main goal of the Chaos & Resilience track is to share with everyone who's coming along the idea that it's not really a question if failure will happen but when, and present things that we can embrace to strengthen our systems so when they do fail it doesn't impact customers. And it doesn't impact all of the services that we have at our companies. The goal is to help everybody build more resilient systems by using different techniques and we're going to share many techniques throughout the entire day from the number of companies Netflix, DropBox, Betterment, Comcast.
Choose Your Own Adventure: Chaos Engineering
Chaos Engineering is described as "the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production". This is immensely beneficial when executed properly, however all too often the road to cultural acceptance may not match our expectations as SREs, Chaos Engineers, and Productivity engineers.
Choose Your Own Adventure is a series of children's gamebooks where each story is written from a second-person point of view, with the reader assuming the role of the protagonist and making choices that determine the main character's actions and the plot's outcome.
This presentation will play on the book series and go over different experiences on "Chaos Adventures" including both successes and failures introducing Chaos in an organization. Chaos Engineering can lead to better development processes and procedures and better preparedness for outage.These benefits are available to any company willing to invest in more resilient and antifragile systems.
Chaos tools can positively influence the development process, and audience members will leave this talk with a game plan on how to bring Chaos practices to their organization. The "Chaos Adventure" will look a little different from everyone depending on type of organization, size of organization, and inter-team communication.
It Will Break
In the modern world, tech companies build their products on extremely reliable servers that never break. They’re stacked in the racks with highly reliable switches with firmware that is rock solid and guaranteed to have no bugs. These switches talk to each other over super low latency networks that have close to zero packet loss rates. And this whole thing is located in the building with infinite and redundant power supply. Just kidding, it’ll all break.
Companies can buy the most expensive top notch hardware, platinum support, pick the best vendors in the industry, but sooner or later everyone realize that everything fails. We’re going to talk about the inevitability of a failure and the ways how engineers can design their systems to be able to tolerate them.
State of Chaos Engineering
“I don’t always test my resilience, but when I do, it’s at 3 a.m.”
“I don’t always test my resilience, but when I do, it’s in Prod.”
“I don’t always test my resilience, but when I do, its an outage!”
These were the days… the days before Chaos Engineering. More and more practitioners are on their way to discovering the benefits of Chaos Engineering. What started as an odd, bold, and even scary practice has been embraced by many in the pursuit of more nines. This talk examines the current state of Chaos Engineering, emerging patterns of success, and the future opportunity at hand.
Nonconformist Resilience: DB-Backed Job Queues
Resilience in the face of chaos is a tall order. As a vertically integrated financial institution where rapidly delivered features with complete data consistency and scrupulous correctness are all non-negotiable, Betterment had its work cut out for it. So we moved the goalposts - inward. By eliminating complexity that many teams consider table stakes, we’ve built a distributed software ecosystem that empowers engineers to do their best work with a minimum of high-wire distributed systems thinking.
One of the complexity-obliterating weapons in our arsenal is our approach to background work. I’ll present how we use, deploy, and even love Delayed::Job (yes, a database-backed job queue) at Betterment for its transactional enqueue semantics, safe retry with exponential backoff, and its storage model, which lends itself to simple but powerful SLA-based monitoring and alerting. DJ enables engineers to pour their creativity into their features and get resilience by default.
Drinking from the Elixir Fountain of Resilience
When talking about resiliency and Elixir, The Open Telecom Platform (OTP) is usually the main topic discussed. In this talk we will discuss other factors that contribute to Elixir's perfect match for fault tolerance and resiliency. Topics that will be discussed are, ease of deploying, operations and monitoring, typespecs, and the BEAM's forgiving nature.