Presentation: Managing Millions of Data Services @Heroku
What You’ll Learn
- Learn about the evolution of Heroku servers and services.
- Hear approaches to reducing the late night calls and pager churn.
- Understand new ways of thinking about fleet orchestration, immutable infrastructure, and managing cloud resources.
Abstract
Over the years, Heroku Data's offerings continue to grow and reach new higher demands with Postgres, Kafka and Redis. Performing repairs, maintainenances, applying patches and auditing a fleet of millions creates some serious time constraints. We'll walk through the evolution of fleet orchestration, immutable infrastructure, security auditing and more to see how managing the data services for many Salesforce customers, start-ups and hobby developers alike is done with as little human interaction as possible.
What is the focus of your work today?
Gabriel: My main focus pertains to running our fleet in efficient, secure and performant manners. I want to make sure our services are highly available and provide the most bang-for-buck in comparison to what companies have had to homegrow and remove the cludge for other engineering organizations to get back to analyzing and solving problems
QCon: What’s the motivation for your talk?
Gabriel: I’ve done Cloud computing and DevOps for the last 4 years, and honestly, I hear the same complaints all the time about how ragged engineers are run with on-call and rolling code. There’s so much to be improved on running databases, app servers, and monitoring. This talk means empathizing with my fellow on-call engineers and hopefully provide a new idea or way of thinking to address problems managing large fleets.
QCon: How would you rate the level of this talk?
Gabriel: I’d say it’s a medium level talk, we’ll get into recent, real scenarios that all web-based companies using cloud technologies face and ways to keep services alive during seriously impactful events. I’ll have examples of code, architecture, and a bit of theory sprinkled in as well.
QCon: Can you give me an example of some of the things you'll discuss?
Gabriel: One example I'm going to address is the Amazon Web Services S3 incident that happened in February, because it practically brought down one third of the Internet. Frankly, we weren't unaffected. We were affected as much as everyone else was I think, but what I think what made it different for us is that we had enough stability in place to keep things up and running while the S3 incident was being worked on.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
Making a Lion Bulletproof: SRE in Banking
IT Chapter Lead Site Reliability Engineering @ingnl
Janna Brummel
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26