Presentation: PID Loops and the Art of Keeping Systems Stable
This presentation is now available to view on InfoQ.com
Watch video with transcriptWhat You’ll Learn
- Find out why and how AWS is using PID Loops.
- Learn how to verify and enforce system stability with PID Loops.
Abstract
Building ultra-reliable large-scale services is an incredible challenge. Systems often exhibit emergent properties and network effects that can be beyond the practical limits of testing, how do we keep things stable even when the unpredictable happens? Control theory, a branch of engineering that has existed for over a hundred years has a lot to offer us. Systems of all sizes can be analyzed and stabilized with PID control loops - often simple algorithms that contain Propotional, Integral, and Derivative components. But how? This session will show what PID loops look like in the context of modern systems, and to see how expoential backoff, flow-control, and other techniques can be wielded to build self-healing systems.
Tell us a bit about some of the stuff that you've worked on.
I've been working at Amazon Web Services for eleven years and I've got to work on a lot of activities. Right now I work on EC2, but also I've got to work on platforms and Route 53, S3, ELB, and a few more in between.
What can a software engineer learn about PID Loops?
I think probably the biggest thing to learn about PID Loops is the loop part, that we can build stable systems by measuring those systems, seeing what state they're in, and then driving them to the state we want them to be. Taking that approach and measuring things first and then applying any corrections we need turns out to be incisive deep powerful way to build systems that's not intuitive.
Do you have to be massive scale to be able to use a PID loop effectively?
It works even for very simple systems, a system of one or two boxes, and you're just trying to get some very simple configuration data that box, user settings or something like that. Nine times out of ten most people have solved that problem by just sending the settings to that box and it'll work most of the time, but occasionally they won't get there because maybe there's a network problem or a system crash or something. And even in a very simple case like that a controller with a loop will fix it. It will detect that it's not the way it should be to repair it.
What do you want an individual contributor architect to leave your talk with?
To be able to walk away and look at control systems that distribute settings or configuration, and just be able to tell whether they're stable or likely to be stable.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26
Patrick Kua
Inside Job: How to Build Great Teams Within a Legacy Organization?
Engineering Director @Meetup
Francisco Trindade
Scaling Infrastructure Engineering at Slack
Senior Director of Infrastructure Engineering @Slack