QCon New York June 15-19, 2020 | | Scaling Infrastructure Engineering at Slack

This presentation is now available to view on InfoQ.com

What You’ll Learn

Hear about Slack’s journey in changing their infrastructure from one that served a game to an online collaboration tool for companies and enterprises.
Find out some of their war stories and lessons learned along the way.
Listen about their distributed infrastructure that needs to scale, and the hiring process used along the way.

Abstract

In August of 2016, I was asked to build Slack’s first Infrastructure engineering organization. The company was a little over 2 years old, and we were approaching the scalability limits of the original infrastructure written by the founders several years prior. Things were starting to break in strange, and unpredictable ways.

Organizations much larger than we had initially envisioned were using Slack. Thousands of developers were building on our external APIs and stressing the system in new and unusual ways. It was taking high double digit seconds to minutes for Slack to load for very large teams, and we wanted to continue growing as fast as we could.

I’ll discuss the architectural and organizational challenges, mistakes and war stories of 2.5 years that followed, including how we:

Overcame the initial scalability challenges by building out our caching tier, transitioning many of our internal APIs from broadcast to publish/subscribe and rewrote many parts of our asynchronous job queueing system.
Continued to operate our PHP/Hack monolith, but introduced more services, and formalized how we deploy, monitor and build those services.
Grew the infrastructure engineering team to a global function with teams around the world.
Defined and cultivated an engineering-led culture in a product-led company.
Introduced product management, and the evolution of PM in the infrastructure team.
Identified key transition points when it was time to hire infrastructure specialists versus generalists.

Question:

How long have you been with Slack and what work are you doing there?

Answer:

I've been there almost four years. When I first started at Slack I ran engineering for our developer platform team -- we were responsible for APIs is that external developers use to build on top of Slack. I did that for a year, then the CTO, Cal Henderson, asked me to build an infrastructure team. The original infrastructure was largely written by the founders when they were building the game Glitch which later became Slack. In many ways that's incredible and impressive that it got us so far, but it was never designed to be operated at our scale and growth trajectory. So I spent the next two and a half years building our infrastructure team from 10 engineers mostly in San Francisco to 100+ engineers in San Francisco, Vancouver and New York.

Question:

This talk is war stories on that journey bringing up an infrastructure team to where you are today?

Answer:

Exactly. The journey, the war stories, the lessons learned over that two and a half years.

Question:

Can you give me an example of a lesson learned or something that maybe telegraphs a little bit about the type of things you'll be talking about?

Answer:

Yes. The first thing that comes to mind is infrastructure means different things to different people at different companies. At Slack what it means is infrastructure is the organization that builds the distributed systems that we run on AWS to ensure that Slack has high availability, is incredibly performant for any size team anywhere in the world. In 2016 performance varied dramatically if you used Slack in the US compared to Asia-Pacific and I'll talk a little bit about that journey. Infrastructure also meant front-end infrastructure, a really interesting group of developers that studied distributed systems and have done a lot of back-end programming, but they're building the back-end of the front end-stack -- handling how we establish the WebSocket connection, parse and cache information coming across it, etc. We have a sister team that includes all our SREs, DREs that handle our interactions with AWS - they are not within Infrastructure.

This talk is all about how do you build organizations in a cloud-native environment when you've got a really rapid user base, a company that's growing at incredible incredible rates, and an application that has a very different footprint and usage pattern as compared to consumer internet companies. For example, you might open Facebook or Twitter, interact with the site, then close it. With Slack we open a WebSocket connection and the majority of users leave it open for 8-10 hours a day. There are really interesting performance ramifications and I'll talk about how that decision to have a WebSocket was very powerful in the beginning, but had significant ramifications on how we scaled from both a technology and organizational standpoint.

Question:

Is it more general or specific?

Answer:

I’ll give a few examples of technology challenges that we ran into to illustrate and lay the foundation of why infrastructure was important and why it was needed, and also to demonstrate what I like to call the “ mantra of infrastructure”: when it works nobody notices, but when it’s broken everyone does. I’ll tell a few stories, and then talk about the organization of the team, meaning how did we hire, what was our process for hiring engineers, how did we structure the team from an organizational standpoint, and how did they work on different things and some lessons there. A topic that I've discussed in the past that often really resonates with folks is when and how do you incorporate product management in an infrastructure organization. So I'll talk about our journey with product management, not having product managers, and then adding product managers later on and why we did that. This talk isn’t only about management -- when you become a very senior IC (individual contributor) that involves significant leadership, including how you think about collaborating cross-functionally with other teams. It's not a talk about how to be an engineering manager, it's a talk about the stories of infrastructure from a leadership and technology perspective.

Speaker: Julia Grace

Senior Director of Infrastructure Engineering @Slack

Julia loves solving challenging engineering problems at scale, growing businesses, and leading teams. She’s currently a Senior Director of Product Engineering at Slack focused on building network effects into Slack through shared channels. Prior to joining product engineering she built the infrastructure team at Slack, growing it from from 10 to 100 engineers in 3 offices in 2 years. She excels in high velocity environments especially during hyper growth: Slack engineering headcount grew from 100 to over 600 during her 3.5 years there.

She advises early and mid-stage startups, having extensive experience not only raising venture capital funding (including from top tier investors such as Andreessen Horowitz), and sold her previous company, Tindie, where she was the co-founder and CTO. She has also founded and sat on several advisory boards for startups and as well as large, multi-billion dollar non-profits. She holds a BS with Honors and MS in Computer Science from the University of North Carolina where her research focused on operating systems, including building a distributed system that allowed peer-to-peer data sharing from internet browser caches in low connectivity. She is an avid athlete and former collegiate rower, always trying to squeeze a run in (including chasing her young daughter)!