Presentation: Personalizing Netflix with Streaming Datasets
What You’ll Learn
- Assess whether a streaming solution is fit for your problem.
- Learn how to design and architect a solution for replacing a batch system with a streaming one.
- Discuss how Spark compares to Flink and how to decide which engine is best for your problem.
Abstract
Streaming applications have historically been complex to design and implement because of the significant infrastructure investment. However, recent active developments in various streaming platforms provide an easy transition to stream processing, and enable analytics applications/experiments to consume near real-time data without massive development cycles.
This talk will cover the experiences Netflix’s Personalization Data team had in stream processing unbounded datasets. The datasets consisted of - but were not limited to - the stream of playback events (all of Netflix’s plays worldwide) that are used as feedback for recommendation algorithms. These datasets when ultimately consumed by the team's machine learning models, directly affect the customer’s personalized experience. As such, the impact is high and tolerance for failure is low.
This talk will provide guidance on how to determine whether a streaming solution is a right fit for your problem. It will compare microbatch versus event-based approaches with Apache Spark and Apache Flink as examples. Finally, the talk will discuss the impact moving to stream processing had on the team's customers, and (most importantly) the challenges faced.
QCon: What are you doing at Netflix?
Shriya: I work on the Data engineering team for Personalization. Which, among other things, delivers recommendations made for each user. We are responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix homepage .
QCon: What's the focus of your talk?
Shriya: Today, the training of our machine learning models happens offline and it happens at most once a day. As the size of Netflix user base and subsequently that of the data being collected is exploding and the researchers are innovating with newer models we are exploring if we can train these models on a more frequently updated dataset. Going streaming also has technical advantages. As is for most cloud solutions, our storage costs are much higher than compute costs. If we are not storing these large amounts of raw data, waiting for batch processes to pick them up but rather processing, aggregating and discarding them as they come in, it makes for a more efficient use of our cluster resources.
QCon: What are some of the considerations that you have to take into account when attempting to get to real time?
Shriya: One big thing is the accuracy of the data. Streaming data has an important temporal advantage: it's ready for access sooner. But, is it as accurate and as reliable as the batched data? Batch systems tend to be very accurate because you have all the data that you will need to process and all your sources have reconciled. Batch systems also deal with data recovery and repair far more easily than streaming systems. These are things you need to tackle in your streaming design.
QCon: Can you give me a taste of what you might go into regarding dealing with late arriving data?
Shriya: At Netflix, all day long we receive data on what a user played and where (on the homepage) they found that content. It's possible that one of the upstream services that was sending this play information had a delay, and it is sending data from the play that happened four hours ago. I can't store the information based on the time it arrived. We have to find out when it was actually played. This is an example of late arriving. We can solve it by either figuring out the partitioning scheme of the output data or by maintaining windows in the streaming app.
QCon: How do you decide between using Spark or Flink to solve these problems you have at Netflix?
Shriya: Well, different teams at Netflix use different streaming technologies, choosing the one that best fits their problem. In personalization, we care for the feature-richness of the engine a lot. The data we're producing today is being produced once a day, unlike a lot of completely online systems that are sensitive to sub-second SLAs, we are not. Streaming data pipelines serve a variety of purposes, some are for pure event routing where there isn’t a lot of business logic baked in the pipeline, some like ours where a majority of data manipulation is written natively in the pipeline. So that plays into our decision of choosing what engine to use.
QCon: Have you chosen Flink over Spark?
Shriya: We are moving forward with a proof of concept of solving one of our problems in Flink, success of which in production will determine future use-cases.
QCon: What is the level of the talk, is it intermediate or advanced?
Shriya: It is intermediate, not advanced, because it would not go super deep into technical details of any one streaming engine. But it's not beginner either as I am assuming the audience has already started thinking about this problem set. It covers how to design and architect a solution, if you were to replace a batch system with a streaming one.
QCon: What is the persona you are addressing with this talk?
Shriya: I'm talking to that person who has a batch system and is trying to do streaming. Since there are so many options out there, I'm trying to help people make an informed decision.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
Not Sold Yet, GraphQL: A Humble Tale From Skeptic to Enthusiast
Software Engineer @Netflix
Garrett Heinlen
Let's talk locks!
Software Engineer @Samsara
Kavya Joshi
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
Graceful Degradation as a Feature
Director of Product @GremlinInc