Presentation: Scaling Infrastructure Engineering at Slack
This presentation is now available to view on InfoQ.com
Watch video with transcriptWhat You’ll Learn
- Hear about Slack’s journey in changing their infrastructure from one that served a game to an online collaboration tool for companies and enterprises.
- Find out some of their war stories and lessons learned along the way.
- Listen about their distributed infrastructure that needs to scale, and the hiring process used along the way.
Abstract
In August of 2016, I was asked to build Slack’s first Infrastructure engineering organization. The company was a little over 2 years old, and we were approaching the scalability limits of the original infrastructure written by the founders several years prior. Things were starting to break in strange, and unpredictable ways.
Organizations much larger than we had initially envisioned were using Slack. Thousands of developers were building on our external APIs and stressing the system in new and unusual ways. It was taking high double digit seconds to minutes for Slack to load for very large teams, and we wanted to continue growing as fast as we could.
I’ll discuss the architectural and organizational challenges, mistakes and war stories of 2.5 years that followed, including how we:
- Overcame the initial scalability challenges by building out our caching tier, transitioning many of our internal APIs from broadcast to publish/subscribe and rewrote many parts of our asynchronous job queueing system.
- Continued to operate our PHP/Hack monolith, but introduced more services, and formalized how we deploy, monitor and build those services.
- Grew the infrastructure engineering team to a global function with teams around the world.
- Defined and cultivated an engineering-led culture in a product-led company.
- Introduced product management, and the evolution of PM in the infrastructure team.
- Identified key transition points when it was time to hire infrastructure specialists versus generalists.
How long have you been with Slack and what work are you doing there?
I've been there almost four years. When I first started at Slack I ran engineering for our developer platform team -- we were responsible for APIs is that external developers use to build on top of Slack. I did that for a year, then the CTO, Cal Henderson, asked me to build an infrastructure team. The original infrastructure was largely written by the founders when they were building the game Glitch which later became Slack. In many ways that's incredible and impressive that it got us so far, but it was never designed to be operated at our scale and growth trajectory. So I spent the next two and a half years building our infrastructure team from 10 engineers mostly in San Francisco to 100+ engineers in San Francisco, Vancouver and New York.
This talk is war stories on that journey bringing up an infrastructure team to where you are today?
Exactly. The journey, the war stories, the lessons learned over that two and a half years.
Can you give me an example of a lesson learned or something that maybe telegraphs a little bit about the type of things you'll be talking about?
Yes. The first thing that comes to mind is infrastructure means different things to different people at different companies. At Slack what it means is infrastructure is the organization that builds the distributed systems that we run on AWS to ensure that Slack has high availability, is incredibly performant for any size team anywhere in the world. In 2016 performance varied dramatically if you used Slack in the US compared to Asia-Pacific and I'll talk a little bit about that journey. Infrastructure also meant front-end infrastructure, a really interesting group of developers that studied distributed systems and have done a lot of back-end programming, but they're building the back-end of the front end-stack -- handling how we establish the WebSocket connection, parse and cache information coming across it, etc. We have a sister team that includes all our SREs, DREs that handle our interactions with AWS - they are not within Infrastructure.
This talk is all about how do you build organizations in a cloud-native environment when you've got a really rapid user base, a company that's growing at incredible incredible rates, and an application that has a very different footprint and usage pattern as compared to consumer internet companies. For example, you might open Facebook or Twitter, interact with the site, then close it. With Slack we open a WebSocket connection and the majority of users leave it open for 8-10 hours a day. There are really interesting performance ramifications and I'll talk about how that decision to have a WebSocket was very powerful in the beginning, but had significant ramifications on how we scaled from both a technology and organizational standpoint.
Is it more general or specific?
I’ll give a few examples of technology challenges that we ran into to illustrate and lay the foundation of why infrastructure was important and why it was needed, and also to demonstrate what I like to call the “ mantra of infrastructure”: when it works nobody notices, but when it’s broken everyone does. I’ll tell a few stories, and then talk about the organization of the team, meaning how did we hire, what was our process for hiring engineers, how did we structure the team from an organizational standpoint, and how did they work on different things and some lessons there. A topic that I've discussed in the past that often really resonates with folks is when and how do you incorporate product management in an infrastructure organization. So I'll talk about our journey with product management, not having product managers, and then adding product managers later on and why we did that. This talk isn’t only about management -- when you become a very senior IC (individual contributor) that involves significant leadership, including how you think about collaborating cross-functionally with other teams. It's not a talk about how to be an engineering manager, it's a talk about the stories of infrastructure from a leadership and technology perspective.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26
Patrick Kua
Inside Job: How to Build Great Teams Within a Legacy Organization?
Engineering Director @Meetup