Presentation: Scaling Event Sourcing for Netflix Downloads
What You’ll Learn
- Understand how Netflix used event sourcing to solve a use case around media downloads.
- Learn more how event sourcing can be used when projects are moving quickly and have evolving requirements.
- Hear lessons (and gotchas) around testing, scalability, and optimization for an event sourcing solution.
Abstract
In November of 2016 Netflix successfully launched its new Download feature, allowing users to download and play content offline on their mobile devices. This feature required us to change our previously stateless distributed licensing service to be real time and stateful. In a matter of months we needed to create a brand new stateful service that could evolve with rapid feature requirement iterations, while also being able to scale to millions of members using the feature across the globe.
This talk describes how we achieved these goals with the use of a Cassandra-backed event sourcing architecture. We describe our event store implementation, including the use of data versioning and snapshotting to provide flexibility and scale. We will cover what we learned along the way, and what we could have done better. Finally, we will review some improvements and extensions that we are planning to address going forward. Attendees will take home some compelling reasons to consider event sourcing for their architectures: it’s flexibility to adapt to changing business requirements, the relevance to distributed and scalable microservice architecture, and the means to replay a timeline of events and determine current or potential state.
QCon: How do you plan to structure your talk?
Phillipa: We're focusing the talk around the story of downloads. Early last year, we started work on a whole new service that needed to be completed by the end of the year, but the requirements for the service were going to change. So we had to design and implement a solution with these changing requirements in mind.
To build this service, we first looked at a typical SQL (and NoSQL) solution, but what we found was that an event sourcing approach was an exciting concept that gave us the flexibility to implement what we needed in a short time and still change it as needed. We're going to discuss that, discuss event sourcing full stop, and then spend a fair amount of time talking about our specific implementation.
Robert: We don't necessarily think of the approach we took as THE approach to take under this situation, but it is AN approach. We want to share our experience with what we learned. Looking back, there are definitely things we would've done differently, but, for the most part, we want to discuss how we used event sourcing to tackle a project that was moving quickly and a direction that wasn’t entirely known.
QCon: What do you want someone who comes to your talk to walk away with?
Phillipa: A better understanding of event sourcing, and how it can help them. Understand that if you have this scenario where they've got a data model which could constantly be changing (or could evolve over time) and you don't know where it’s eventually going to go, event sourcing could help. Event sourcing is this grand concept, but we want to offer concrete details on how it can assist and what the first steps might be.
Robert: In this talk, we'll give you a good starting point for event sourcing with a lot of options to chose from. What I want people to be able to take away is the ability to go back to their teams and say "Hey this can work at scale. There was a team at Netflix that did it in six months, and it's out there now running."
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
Not Sold Yet, GraphQL: A Humble Tale From Skeptic to Enthusiast
Software Engineer @Netflix
Garrett Heinlen
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26