Presentation: Spotify Lessons: Learning to Let Go of Machines
What You’ll Learn
- Understand what should be present for an organization to migrate to an ephemeral/immutable infrastructure.
- Learn how to identify the parts of your infrastructure that cause issues for developers and how to reason about them.
- Develop ideas for managing and iteratively changing (through tooling) the parts of your system that causes the most issues.
Abstract
Spotify is currently one of the most popular music streaming services in the world with over 100 million monthly active users. At Spotify, a team of 6 engineers maintains the machine provisioning and capacity fleet for all 150+ Spotify teams. This talk is going to tell the story of how Spotify’s infrastructure evolved from teams owning and doting on groups of long-running servers to a distinctive separation of business code and value from the underlying ephemeral machines all of Spotify's services actually run on. We'll examine how this evolution also changed the way that Spotify developers write code and the vast increase in iteration and shipping speed. This talk will also cover a potential endpoint of improving the provisioning and capacity experience for developers: a world where service developers don't need to handle or concern themselves with any of the infrastructure at all. We'll discuss why Spotify wants to move toward this state and how we're getting there.
QCon: What is the focus of your work today?
James: Focus of work today is on Phoenix. Phoenix is a service my team is creating that can orchestrate and carry out a full rolling reset of all machines in the GCP portion of our fleet. For many of our machines, restarts are required to get the latest security updates so Phoenix ensures that all GCP machines in the fleet have the latest secure packages. Phoenix also enforces the concept of ephemeral infrastructure for devs and forces devs to make their services stateless and robust if they aren't already. For any service that a developer creates, temporarily removing an instance of that service (via the rolling resets) should not affect the overall service's performance. Phoenix helps enforce this desired behavior.
What’s the motivation for your talk?
James: It amazes me that Spotify has achieved a great balance in embracing the ops in squads model that limits how much context and infra/ops knowledge the squads actually need to know. The Ops-in-Squads model at Spotify involves individual teams taking on all the operational and on-call responsibilities for their services.
The impetus for this was that a singular or even multiple dedicated ops teams were not scaling well with the tens of teams and hundreds of services at Spotify. It can be difficult for even the best Ops engineers to handle incidents and ops for hundreds of services that they don't have much context on. This basic premise seems to imply that feature teams would need to take on and remember a huge amount of operational context and knowledge. However, with Developer Platform Alliance's and more specifically the Infrastructure and Operations tribe's tooling, feature teams are able to maintain their services without requiring too much additional context or time.
I want to share how Spotify maintains this balance and what infrastructure/ops concerns IO has removed from feature developer's responsibilities.
What do you feel is the most important thing/practice/tech/technique for a developer/leader in your space to be focused on today?
James: My initial instinct is to say to embrace ephemeral, immutable infrastructure. However, it's really finding out what (ideally through quantitative means based off of operational data) development and operational pain points developers are going through and writing tooling that fixes or alleviates those pain points. Ephemeral, immutable infrastructure may not be for everyone. An early-stage startup with 5 engineers probably doesn't need one engineer or resources dedicated to ensuring that their 2 servers use immutable deployments and can potentially auto-scale up to 100 servers. But then again, if it's observable that stateful deployments or servers is indeed a huge operational pain point, it might be the answer.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26
Patrick Kua
Inside Job: How to Build Great Teams Within a Legacy Organization?
Engineering Director @Meetup