Presentation: PID Loops and the Art of Keeping Systems Stable

Track: Modern CS in the Real World

Location: Broadway Ballroom South, 6th fl.

Duration: 1:40pm - 2:30pm

Day of week:

Slides: Download Slides

This presentation is now available to view on

Watch video with transcript

What You’ll Learn

  1. Find out why and how AWS is using PID Loops.
  2. Learn how to verify and enforce system stability with PID Loops.


Building ultra-reliable large-scale services is an incredible challenge. Systems often exhibit emergent properties and network effects that can be beyond the practical limits of testing, how do we keep things stable even when the unpredictable happens? Control theory, a branch of engineering that has existed for over a hundred years has a lot to offer us. Systems of all sizes can be analyzed and stabilized with PID control loops - often simple algorithms that contain Propotional, Integral, and Derivative components. But how? This session will show what PID loops look like in the context of modern systems, and to see how expoential backoff, flow-control, and other techniques can be wielded to build self-healing systems.


Tell us a bit about some of the stuff that you've worked on.


I've been working at Amazon Web Services for eleven years and I've got to work on a lot of activities. Right now I work on EC2, but also I've got to work on platforms and Route 53, S3, ELB, and a few more in between.


What can a software engineer learn about PID Loops?


I think probably the biggest thing to learn about PID Loops is the loop part, that we can build stable systems by measuring those systems, seeing what state they're in, and then driving them to the state we want them to be. Taking that approach and measuring things first and then applying any corrections we need turns out to be incisive deep powerful way to build systems that's not intuitive.


Do you have to be massive scale to be able to use a PID loop effectively?


It works even for very simple systems, a system of one or two boxes, and you're just trying to get some very simple configuration data that box, user settings or something like that. Nine times out of ten most people have solved that problem by just sending the settings to that box and it'll work most of the time, but occasionally they won't get there because maybe there's a network problem or a system crash or something. And even in a very simple case like that a controller with a loop will fix it. It will detect that it's not the way it should be to repair it.


What do you want an individual contributor architect to leave your talk with?


To be able to walk away and look at control systems that distribute settings or configuration, and just be able to tell whether they're stable or likely to be stable.  

Speaker: Colm MacCárthaigh

Senior Principal Engineer @awscloud

Colm is an engineer at Amazon Web Services. For just over ten years Colm has been building some of the largest services at AWS, including Amazon EC2, S3, ELB, CloudFront, and Route53.  Colm is also an active Open Source contributor and is the main author of Amazon s2n, AWS's Open Source implementation of TLS/SSL, as well as a member of the Apache Software Foundation and a core contributor to Apache httpd and apr. In evenings and weekends, Colm is an Irish folk musician and singer and regular tours, produces and records albums, and enjoys teaching workshops. 

Find Colm MacCárthaigh at

Similar Talks

Are We Really Cloud-Native?


Director of Technology @Luminis_eu

Bert Ertman

The Trouble With Learning in Complex Systems


Senior Cloud Advocate @Microsoft

Jason Hand

What Breaks Our Systems: A Taxonomy of Black Swans


Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee

Laura Nolan

Scaling Infrastructure Engineering at Slack


Senior Director of Infrastructure Engineering @Slack

Julia Grace