Presentation: Debugging Microservices: How Google SREs Resolve Outages
This presentation is now available to view on InfoQ.com
Watch videoWhat You’ll Learn
-
Learn techniques and approaches Google SREs take to debug and troubleshoot their systems.
-
Understand how to make more effective use of recent advances in tooling for debugging distributed systems.
-
Hear war stories of large-scale distributed systems and how tooling has helped engineers reason about them.
Abstract
When using tens or hundreds of microservices to provide an application's critical functionality, diagnosing what interaction between components is causing an outage can be challenging. Engineers spend a lot of time building dashboards to improve monitoring but still spend a lot of time trying to figure out what’s going on and how to fix it when they get paged. Building more dashboards isn’t the solution; using dynamic query evaluation and integrating tracing is. Learn how SREs discover and debug problems at Google during outages, and hear real stories about our experiences.
What is the work that you do today as a Google SRE?
Adam: I work for a Google DevOps team that takes care of Monarch. Monarch is a very large time series database used for querying and metrics collection. Monarch is roughly the internal equivalent of combining Prometheus, Grafana, and Graphite from the open source world. Monarch also adds to that stack all of Stackdriver and provides the backend for a lot of our cloud signals product. My role is an SRE-SWE which means I'm involved in the software engineering side as well. So a lot of my time is spent taking apart Monarch and putting it back together more durably and more reliably. Durability is especially important because Monarch is a globally distributed system (it runs in every single availability zone).
Liz: I’ve changed roles. I used to be the manager of the Bigtable Site Reliability Engineering Team. That role encompassed both cloud Bigtable and the internal-facing Bigtable API. Over the past year, I've shifted from managing that team to advising Google Cloud Platform users on how to better engineer and structure their systems to effectively scale up in the cloud. That's what my current team (Google Cloud Customer Reliability Engineering Team) does.
Can you give me an idea of the scope and size we’re talking about with Monarch?
Adam: I can’t be specific, but it’s very large in terms of both QPS and resources. The quantity of data per stream is extremely variable in size, from periodically receiving one byte, to receiving a constant stream of high-cardinality data. The same applies to the query side, where some queries need only fetch a single stream, and some need to fetch and aggregate a lot of them. Some consumers are doing ad hoc queries, and other teams are doing a tremendous number of queries per second to inform their actual customer facing products. Without Monarch, we have no monitoring or alerting, so it’s a critical system.
What’s the motivation for this talk?
Adam: It used to be the case that the microservices and distributed systems were the exception rather than the rule. I think that pattern has sort of inverted in the past years. Five years ago, the majority of the apps that I worked on were Rails monoliths, for example. Where they weren’t, people were building systems that looked like Rails monolith. But it seems to me that the default these days is embracing a distributed microservices architecture and all of the challenges that come along with that. I feel the vast majority of research, information, and documentation is from that old world and a lot of the strategies just don’t really apply anymore. There's a shocking lack of real world ‘war stories’ especially about the consequences of dealing with complex distributed systems. I think we need to do more of that because we are a company that sells a platform on which those systems are supposed to be built. I think we need to do a better job as an industry of publicizing those best practices. That's least what I'm seeking to do.
Liz: I broadly agree with that. We don't talk about specific products necessarily in this talk. Instead, we talk about the design and debugging principles. We try to answer questions like: ‘how do you set up your system so that you can be confident that it's going to work?’ My motivation is I want people to be successful at using cloud services, at using microarchitectures, at scaling up without scaling up their operations team (or having their operations team get woken up more at night). We see a lot of people struggling with this, and we want to help them develop the tools to make their lives better.
Netflix talks about this paved road, you have these certain infrastructure items that are in place to make it easier to be able to deploy, debug and troubleshoot. How do you respond to someone who says “I don't have Google's infrastructure, I can't do the things that you do because I don't have resources”?
Liz: There are tools that exist, which makes what we're talking about possible at many different organization levels. The tools just have to be put together in the right ways. For example, a lot of people don't see the value of distributed tracing because it's been unwieldy and clumsy. Therefore, we're trying to demonstrate to people that it's actually worth to set it up, and, yes, you can get the concrete gains out of it relatively quickly. The premise is that we're at a point where the adoption curve is taking off and people need to know what the incentives are for using some of these tools. I think that applies whether you're at Google scale or not.
When you talk about that distributed tracing, are you talking about a specific tool or a Google tool or something that's available to anybody?
Adam: We show examples from a Google internal tool called ‘Dapper’. I wanted to focus more on an internal tool which the majority of people just do not have access to, specifically so it was clear that we were focusing on the principles. It's clearly not a sales pitch since we’re not selling this tool. This is about the things that you can do if you have a really great monitoring system that can visualize heat maps and can do tracing. All of the things that we're talking about are available in products, but we focus on the principles.
Liz: We’re talking about the principles, we think that people can start laying the groundwork now even if they don't specifically have something we’re talking about in the software set that they’re using now. Even if that’s true, we expect that it will be widely available in the near future.
Who's the primary persona that you're kind of envisioning sitting in the audience during the talk?
Adam: For me, it's myself about 18 months ago. I don't feel like I have any magical powers now that I didn't back then. It was more just ignorance. I didn't know that a lot of the techniques that Google uses on a daily basis are techniques I could have used back then to make my life so much easier. If I look back to how I used to debug distributed systems back then, it was positively Neolithic. I’d like other people to walk away from this talk with a similar impression like “Wow! My tools kind of suck. Why am I living like this?”
What do you want someone who comes to your talk to walk away with?
Adam: I’d really like them to come away with a more critical eye towards the tools that they are using to debug. I'd like them to walk away with confidence that maybe if they do things in a different way, they'll still be able to get at their data and but solve problems better.
Liz: It is possible to find a solution to the curse of cardinality. You can actually see both the forest and the trees at the same time. There are ways of doing it and that you can start accelerating your adoption of it by starting to do tracing with the understanding of the context around the trace.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26
Patrick Kua
Inside Job: How to Build Great Teams Within a Legacy Organization?
Engineering Director @Meetup