Presentation: Managing Millions of Data Services @Heroku

Track: Immutable Infrastructures: Orchestration, Serverless, and More

Location: Broadway Ballroom South, 6th fl.

Day of week:

Slides: Download Slides

Level: Intermediate - Advanced

Persona: DevOps Engineer

What You’ll Learn

  • Learn about the evolution of Heroku servers and services.
  • Hear approaches to reducing the late night calls and pager churn.
  • Understand new ways of thinking about fleet orchestration, immutable infrastructure, and managing cloud resources.

Abstract

Over the years, Heroku Data's offerings continue to grow and reach new higher demands with Postgres, Kafka and Redis. Performing repairs, maintainenances, applying patches and auditing a fleet of millions creates some serious time constraints. We'll walk through the evolution of fleet orchestration, immutable infrastructure, security auditing and more to see how managing the data services for many Salesforce customers, start-ups and hobby developers alike is done with as little human interaction as possible.

Question: 

What is the focus of your work today?

Answer: 

Gabriel: My main focus pertains to running our fleet in efficient, secure and performant manners. I want to make sure our services are highly available and provide the most bang-for-buck in comparison to what companies have had to homegrow and remove the cludge for other engineering organizations to get back to analyzing and solving problems

Question: 

QCon: What’s the motivation for your talk?

Answer: 

Gabriel: I’ve done Cloud computing and DevOps for the last 4 years, and honestly, I hear the same complaints all the time about how ragged engineers are run with on-call and rolling code. There’s so much to be improved on running databases, app servers, and monitoring. This talk means empathizing with my fellow on-call engineers and hopefully provide a new idea or way of thinking to address problems managing large fleets.

Question: 

QCon: How would you rate the level of this talk?

Answer: 

Gabriel: I’d say it’s a medium level talk, we’ll get into recent, real scenarios that all web-based companies using cloud technologies face and ways to keep services alive during seriously impactful events. I’ll have examples of code, architecture, and a bit of theory sprinkled in as well.

Question: 

QCon: Can you give me an example of some of the things you'll discuss?

Answer: 

Gabriel: One example I'm going to address is the Amazon Web Services S3 incident that happened in February, because it practically brought down one third of the Internet. Frankly, we weren't unaffected. We were affected as much as everyone else was I think, but what I think what made it different for us is that we had enough stability in place to keep things up and running while the S3 incident was being worked on.

Speaker: Gabriel Enslein

Senior Infrastructure Engineer @Heroku

Gabe is a recent addition on the Heroku Data team since late last year working on fleet orchestration and infrastructure optimization for Postgres. His prior endeavor was working at Careerbuilder as Full Stack engineer for the API Routing & Authorization team focusing on DevOps and infrastructure design for the last two years.

Find Gabriel Enslein at

Similar Talks

Are We Really Cloud-Native?

Qcon

Director of Technology @Luminis_eu

Bert Ertman

The Trouble With Learning in Complex Systems

Qcon

Senior Cloud Advocate @Microsoft

Jason Hand

Making a Lion Bulletproof: SRE in Banking

Qcon

IT Chapter Lead Site Reliability Engineering @ingnl

Janna Brummel

What Breaks Our Systems: A Taxonomy of Black Swans

Qcon

Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee

Laura Nolan