Presentation: Presidential Campaigns & Immutable Infrastructure
What You’ll Learn
- Understand how to build systems that are designed to fail in graceful ways.
- Hear stories of rapid growth in very short periods of time.
- Learn techniques and practices to improve System Reliability Engineering.
Abstract
Hillary for America was arguably one of 2016’s largest startups. It was in the news every day, raised billions of dollars, and grew at an incredibly fast rate. There was even a very splashy exit. But what isn’t often talked about is the technical infrastructure behind it. Over the course of 18 months, HFA tech’s SRE team built and ran an immutable infrastructure, supporting a tech org that started with one developer and grew to 80, letting people deploy hundreds of times a day, with little to no downtime. In this talk Michael will explore how the campaign systematically approached every design decision to stay true to immutable principles, leveraging AWS infrastructure along with open source technology like Packer, Ansible, Consul, and a healthy dose of Varnish.
QCon: Aside from supporting a website, people might ask why would a Presidential campaign need immutable infrastructure? What are some use cases that the team had to handle and how large was the team?
Michael: I joined Hillary for America at the beginning of a campaign in June of 2015. At that point, we had just a few things that we were doing. These were things like collecting money online, trying to get people to sign up for emails, or keeping engagement with web site.
From there, we built out some of the initial products the campaign had before moving on to creating a site reliability engineering team which handled build and deploy tooling. But, most importantly, we architected the more than 70 microservices that ran throughout the campaign.
All 70 of these Microservices were built with immutable infrastructure that did lots of interesting stuff. These were things that took money, signed people up for events, call tools, sync tools, voter protection tools, the list goes on and on.
By the end of the campaign, we were an SRE team of four and a tech team of 80 (including more than 50 software engineers pushing code every single day).
QCon: What are you going to discuss in your talk?
Michael: Basically, my motivation is to answer the question with 50 engineers pushing code as fast as the can and an SRE team of four, how do you do that?
How do you balance the needs of your developers against reliability? That's where immutable infrastructure came in. So I want to talk about how that's the handshake agreement between developers that are moving insanely fast and reliability concerns.
I want to talk about what our stack looked like, and how we did it. I will also talk about some of the tools we used like Consul and Vanish.
QCon: Who is the primary audience you're talking to in your talk?
Michael: I'm talking to is somebody who is an architect, reliability engineer, or a person who is in the position of making decisions right. So they're not just implementing. They are making calls on how to prioritize what they're working on and the tradeoffs.
What I want them to come away with is an understanding that even with the best immutable infrastructure plan, you will fail. So the question is really about how you build an immutable infrastructure system that will scale. But, more importantly, allows you to fail gracefully and in a way that your users don't don't notice.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
Making a Lion Bulletproof: SRE in Banking
IT Chapter Lead Site Reliability Engineering @ingnl
Janna Brummel
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26