Presentation: Conquering Microservices Complexity @Uber With Distributed Tracing
This presentation is now available to view on InfoQ.com
Watch video with transcriptWhat You’ll Learn
- Find out how Uber is using distributed tracing to make sense of a large number of microservices and the interaction among them.
- Hear how Uber is analyzing streams of tracing data to pinpoint root causes of outages in distributed systems.
Abstract
Microservices bestow many benefits on the organizations adopting them, but they come with a steep price: complexity of the resulting architecture. Distributed tracing is a recognized way of dealing with that complexity and getting back the visibility into our systems. At Uber we discovered that this visibility is not enough to be able to understand the system behavior. A trace for a single request from our mobile app may consist of several thousand data points. That’s too much information for engineers to weed through, especially in a high-stress outage situation when every minute counts. Complexity strikes again.
In this talk we present a methodology that uses data mining to learn the typical behavior of the system from massive amounts of distributed traces, compares it with pathological behavior during outages, and uses complexity reduction and intuitive visualizations to guide the user towards actionable insights about the root cause of the outages. The technique has proven to be highly effective in drastically reducing time to mitigation of high severity outages. The visualizations are built using the modules available in the open source as part of Jaeger, Uber's distributed tracing platform and an incubating project at Cloud Native Computing Foundation.
What is the focus of your work today?
I mostly work on distributed tracing, in the larger scope of overall observability.
What's the motivation for the talk?
We've been doing distributed tracing at Uber for quite a while, for four years. So I think we have some fairly interesting tales to tell about how specifically we use it. And from what I see we have a slightly different take on what we use distributed tracing for. Many people focus on things like performance optimizations with the tracing tools whereas Uber’s biggest use case so far has been understanding the complexity of the microservices architecture and using tracing to do root cause analysis during outages, to figure out where the issue with a specific outage is so that people can dive in deeper. That's what I'm trying to present. We've built relatively novel visualization techniques for analysing problematic traces which I want to share.
How would you describe the persona and the level of the target audience?
It's primarily targeted to the developers working with distributed systems. There is also something for SREs and DevOps because if your organization has a tracing team they might do something similar, and provide these tools to a broader audience in the company. In our case it was first responders and SREs who watch the business metrics at Uber and dive into the high level outages. Those are the major beneficiaries of this tool that I'm going to describe.
What do you want attendees coming to your talk to walk away with?
The understanding that tracing data provides a goldmine of information about the behavior of your distributed system. We built tools that are based on statistical data mining, and those techniques are highly effective and very useful for any size of organization. I want people to keep in mind that data mining is a possible future for all of the tracing work that's going to happen in the next few years.
Presumably that's because of the huge amounts of data that large scale complex distributed systems have generated which is very hard to make sense of, right?
Correct. Tracing data is in a very unique position compared to many other observability signals that we get about the systems because it's the only tool that really gives us simultaneously a macro view of the system, where you can see how many services and components participate in one single request across the board, and, at the same time, it is capable of giving a very micro view. You pick one single service instance and you see what that instance is doing exactly for this particular request versus hundreds of other requests that were running concurrently in that same instance. That's the power of tracing. When you start collecting that kind of data, if you are not doing data mining on it, you are wasting its potential for improving your observability ecosystem.
Are you able to give a specific example of maybe an outage or some kind of system slowdown or something like that where you apply this data mining to your distributed traces in order to understand what was going on in the system?
I cannot speak of specific outages details, but there have been a lot of very high-level outages where several people in the past would spend 30 minutes trying to look at their various logs across different systems and metrics to figure out what exactly is going wrong. And then when this new tool became available they were able to do that within a couple of minutes, because it was very precise in pointing to the exact place where the issue was happening and where you should dig deeper. We've seen these examples. In one example, Cassandra was throwing a quorum error. If you are a first responder, you might see it as a business metric, “my trips are not getting fulfilled on the Uber marketplace”. And so, to go down from that level all the way to some storage component and say, “this is the issue,” it's very hard. Tracking this down from the top-level outage signal all the way to the storage layer, that was very powerful.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26
Patrick Kua
Inside Job: How to Build Great Teams Within a Legacy Organization?
Engineering Director @Meetup