Presentation: Reasoning About Complex Distributed Systems
What You’ll Learn
- Gain tools and techniques helpful to reason about distributed systems.
- Learn how to investigate root causes of malfunctions in distributed systems.
- Discuss ideas on how to teach others in your team how to reason about distributed systems.
Abstract
One of the biggest challenges of working with distributed systems (even small ones with only 10 services) is maintaining them once they're live and performing triage of major issues and returning systems back to health as quickly as possible. This creates a key need for a good developer experience with complex systems: how to minimize the amount of time spent awake at 2am in order to achieve Return To Service. Having a good experience for developers is founded upon how the distributed system is built and developing specific problem solving strategies. For example, using technical tools (such as distributed tracing techniques) strategically to understand how a system is currently behaving and quickly identify what is misbehaving. This talk will cover the technical tools you need to gain information on a complex system and practical approaches to convert that information into an actual understanding of the system.
QCon: Tell us a bit about the talk that you're giving at QCon.
Erich: In this talk I'm going to go through tools and techniques and I've come across and that help me work with complex distributed system. How to understand their behavior and how to reason through issues that are happening in distributed system. If something's not functioning correctly, if you're not getting the expected behavior, how to quickly determine the root cause and fix it by applying an understanding of how the system should behave. Using that understanding to look at performance metrics, tracing data and logs to be able to do simple experiments where you can test the system. The way tests behave tell us what is broken or what is not broken. This is meant to quickly narrow down and find out what subset of the system is causing issues, and then from there determine what the root cause is and bring the system back to service.
The main goal here is to help developers understand the tools I've been using and develop their own tools that let them reason about complex systems, to do better architecture and design, and better triage and support.
QCon: Can you give me an example of an experiment that you'll be talking about?
Erich: At one of our companies, we had a system that contained 16 to 20 services interacting together to form the business logic for our customers. And there was a public facing API that had about 10 functions on it. By calling those 10 functions and looking at how each one behaves it was possible to diagnose exactly what was going wrong with the system because each function hit a difference set in a different way. If we are experiencing an outage or an error or things are not behaving correctly we can run through those 10 functions. And by looking at each one of their results we can see the ones that were not behaving correctly, and use those to triangulate exactly which parts of the system are working correctly. Within five minutes we can narrow it down to one or two services that could be misbehaving or a specific piece of infrastructure like a database that must be misbehaving. That saves us a tremendous amount of time in terms of triage: we didn't have to look at logs, we didn't have to check the dashboard. We could just run those APIs in a matter of minutes. Then we would know where the error was happening and fix the actual broken service.
QCon: Distributed systems vary a lot. How do you help people reason about their system to diagnosticate their problems?
Erich: First this is talking to the other teams or the other developers. Usually there's a small team for each subsystem in the distributed system. Developers know what their systems role is and how it's expected to behave. Then setting up the infrastructure, such as log aggregators, to collect data that traces things as they move across different subsystems from team to team. The third tool is a social one, knowing who to talk that knows a specific subsystem really well, to know how your system works and how it interacts with other systems and what to expect from it. That will allow you to set up experiments on your system, thinking that if I'd use this on my system then it will hit these specific services and this database, and it is going to make this call to the others. That will let you know what services or databases are broken.
And you can say your system is one part of the set of systems that must be having issues. If you have enough of them overlap you can see which calls are working. We'll tell you which services must be working correctly and calls that aren't working tell you which ones may be broken. And if you overlap them you can start cutting out the ones that you verify there were some other tasks and reduced down to just a small set of things that may be broken. If they're in your system you can look at them directly, if not you can reach out to the person that knows each other really well and get them involved. They can tell you what that behavior might mean on their side, and then you can get to what the root cause.
QCon: Do you discuss metrics and how to interpret them?
Erich: Yes. A lot of this gets down to contracts between systems. If I send you this data I expect this response. If I make this request I expect this response. SLAs come into play here. Also, metrics that tell you what performance is and then logging error messages.
QCon: Who are you talking to?
Erich: Technical leads and architects, people who would use this as knowledge to teach junior level engineers how to think about this stuff, and people who help them design and improve the design and architecture of the systems, to provide tools and metrics that allow them to collect the data, to make it easier to reason about it, getting the data to determine what the current behavior is.
QCon: What do you want a tech leader who comes to your talk to leave with?
Erich:I want them to leave with a set of tools that help them work with and understand their complex systems, and ideas to teach the people on their team so they can be better engineers at working with distributive systems.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26
Patrick Kua
Inside Job: How to Build Great Teams Within a Legacy Organization?
Engineering Director @Meetup