QCon New York June 15-19, 2020 | | How Machines Help Humans Root Cause Issues @Netflix

What You’ll Learn

Learn tips and techniques on building success tools for developers.
Understand how Netflix uses statistics and machine learning to enhance some of the tooling.
Learn approaches to pairing automation with human feedback to reduce mean time to detect and resolve issues more quickly.

Abstract

Automated root cause detection represents a holy grail goal for many systems monitoring tools. It’s also an extremely challenging domain. A successful approach must generalize well from limited examples, handle highly dimensional data, understand the application domain, and perform well in a real time environment where baseline behavior changes over time. Those are all problems humans are good at, but state of the art machine learning approaches often struggle with.

In this talk, we’ll discuss ways to build tools designed to enhance the cognitive ability of humans through automated analysis to speed root cause detection in distributed systems. We’ll focus on examples from large scale systems at Netflix. In particular, we’ll focus on the systems directly involved in browsing and playing Netflix movies, and how pairing automation with human feedback reduces mean time to detect and resolve issues.

Question:

You're at Netflix. What team are you working on, and what's the focus of the work that you do?

Answer:

I work on a team called operational insights.

Netflix basically has a structure where there are the teams that are directly in the line of fire. These are teams that build systems where you can't play a movie unless their servers are working correctly.

Then there are infrastructure teams that build a whole bunch of stuff that supports everyone else. The infrastructure team's servers are very mission critical, so there is a need for a sort of an application specific bridge. Where basically a team is not directly in the line of fire, so we have the opportunity to build better tools.

An example of why this is useful is maybe someone (who is in the line of fire) writes a quick script that solves an immediate problem they're having but they probably don't have the bandwidth to take that script (if it's useful) and turn it into a broader tool that's more reusable. So that's kind of where we come in. We swoop in and turn these sort of nuggets of ideas into usable broadly accessible tools for all the teams that are directly in the line of fire.

Question:

Your talk is about helping with root cause using operational data. Is this a machine learning team?

Answer:

What's interesting about our team is we didn't initially start with machine learning. Initially, we were a team that built straightforward tools for analyzing problems, making data lookup tools, tools for dashboards, and that sort of thing. Initially, that was successful, but very quickly it became unwieldy for people to do this without help from statistics and machine learning.

There's a couple of reasons for that. Netflix is built a bunch of microservices. It's a fairly large company, so if there are three teams involved in an interaction, just getting to the point where you can figure out which team is responsible for an interaction failing takes a decent amount of work. There's just a lot you have to look at to figure that out. So having machine learning and analytics to help surface those things became essential. That's sort of how we worked into it.

One takeaway from our experience that might be interesting to people in the room is you know you don't have to start over and AI/ML tooling from scratch. What you can do is take your existing tools and then apply on top of them machine learning. So build ML into existing tools. That's probably a better approach to getting adoption and success.

Question:

In your abstract you say this talk discusses ways to build tools designed to enhance cognitive ability through automated analysis. Can you give me an example of what you might discuss?

Answer:

One of them has to do with providing context when troubleshooting an individual trace. Let's say you are a device manufacturer, and you're trying to certify your Netflix device. What if something goes wrong in that trace? It's in the middle of a sea of traffic if this is on production right. So how can we actually get a full sense of everything that the person did and why it might have gone wrong?

One thing people talk a lot about is capturing trace data. Zipkin as a popular tool. Dapper was a very popular paper a few years back. Those tools are useful. But what we found is actually the most important thing is the context about the trace. The presentation of the data in a way that's digestible. If you just think about it even playing one movie, it's a stateful interaction that lasts half an hour. So if it's just a list of traces, it would still overwhelm you. Being able to basically make sense of your data and build a tool around that I think is the first step.

Question:

What do you feel is the most important trend in software today?

Answer:

I think it's that the golden age of machine learning. I think that the advances in Deep Learning will eventually be important even in this space. The reason Deep Learning doesn't work for us right now is that it requires large datasets. There's a lot of work in the last few years that's very promising about one shot deep learning (inferring rules or building reasoning maps) and that sort of thing. I think once that stuff is sort of at a more proven stage this whole space will go through a radical revolution.

Speaker: Seth Katz

Senior Software Engineer, Operational Insights @Netflix

Seth Katz has been responsible for building the insights tooling around Netflix servers for the past 5 years. He has specifically focused on the systems that ensure people can browse and play movies. During that time, he pioneered Netflix's streaming visualization data platforms, our contextual system tracing tools and analytics, and our anomaly systems for detecting and troubleshooting problems on Netflix’s most mission critical servers.

Prior to that he worked at Microsoft and Yahoo on scalable transaction systems.