QCon New York June 15-19, 2020 | | Managing Millions of Data Services @Heroku

What You’ll Learn

Learn about the evolution of Heroku servers and services.
Hear approaches to reducing the late night calls and pager churn.
Understand new ways of thinking about fleet orchestration, immutable infrastructure, and managing cloud resources.

Abstract

Over the years, Heroku Data's offerings continue to grow and reach new higher demands with Postgres, Kafka and Redis. Performing repairs, maintainenances, applying patches and auditing a fleet of millions creates some serious time constraints. We'll walk through the evolution of fleet orchestration, immutable infrastructure, security auditing and more to see how managing the data services for many Salesforce customers, start-ups and hobby developers alike is done with as little human interaction as possible.

Question:

What is the focus of your work today?

Answer:

Gabriel: My main focus pertains to running our fleet in efficient, secure and performant manners. I want to make sure our services are highly available and provide the most bang-for-buck in comparison to what companies have had to homegrow and remove the cludge for other engineering organizations to get back to analyzing and solving problems

Question:

QCon: What’s the motivation for your talk?

Answer:

Gabriel: I’ve done Cloud computing and DevOps for the last 4 years, and honestly, I hear the same complaints all the time about how ragged engineers are run with on-call and rolling code. There’s so much to be improved on running databases, app servers, and monitoring. This talk means empathizing with my fellow on-call engineers and hopefully provide a new idea or way of thinking to address problems managing large fleets.

Question:

QCon: How would you rate the level of this talk?

Answer:

Gabriel: I’d say it’s a medium level talk, we’ll get into recent, real scenarios that all web-based companies using cloud technologies face and ways to keep services alive during seriously impactful events. I’ll have examples of code, architecture, and a bit of theory sprinkled in as well.

Question:

QCon: Can you give me an example of some of the things you'll discuss?

Answer:

Gabriel: One example I'm going to address is the Amazon Web Services S3 incident that happened in February, because it practically brought down one third of the Internet. Frankly, we weren't unaffected. We were affected as much as everyone else was I think, but what I think what made it different for us is that we had enough stability in place to keep things up and running while the S3 incident was being worked on.

Speaker: Gabriel Enslein

Senior Infrastructure Engineer @Heroku

Gabe is a recent addition on the Heroku Data team since late last year working on fleet orchestration and infrastructure optimization for Postgres. His prior endeavor was working at Careerbuilder as Full Stack engineer for the API Routing & Authorization team focusing on DevOps and infrastructure design for the last two years.