Presentation: Datadog: A Real Time Metrics Database for One Quadrillion Points/Day
This presentation is now available to view on InfoQ.com
Watch video with transcriptWhat You’ll Learn
- Hear about metrics databases, what they are good for.
- Find out about some of the lessons architecting a horizontally scalable metrics database to handle a large number of data points per day.
- Listen what are our plans in motion are to increase the capability of handling even more input data points in the future in a way that scales sublinearly with input load, using a technology called sketches.
Abstract
In the course of its 8 years of existence, Datadog has grown its real time metrics systems that collect, process, and visualize data to the point they now handle trillions of points per day. This has been based on an architecture combining open source technologies such as Apache Cassandra, Kafka and PostgreSQL with a lot of in-house software, particularly for in-memory data storing and querying; especially for efficiently computing distribution metrics. In this talk Datadog's VP Metrics and Alerts Ian Nowland and Director of Distribution Metrics Joel Barciauskas, will speak to the challenges we face, how the architecture has evolved to cope with them, and what we are looking to in the future as we architect for a quadrillion points per day.
Can you tell me a bit about the challenges with metrics collected by Datadog?
Our primary challenge is scaling to meet our customer’s demand for fast queries on multi-dimensionally increasing data, but in a cost efficient manner. As our customers needs are becoming more complex, to get the insight into their systems they want to send us many more metric points with as many tags to query over. We are focused on finding ways to efficiently handle that scale, leveraging both traditional techniques like optimizing and horizontally scaling data indexes, and modern approaches like approximations (sketches) that allow us to scale sub-linearly with input load.
The motivation for the talk?
It's definitely not a vendor pitch. We're an engineering-driven culture and that leads us to want to share the problems we have and the ways we have solved them. We think the complexity of challenges we're tackling scaling around multiple dimensions of data growth mirror what many in the industry are seeing, and so we thought that this would be a good time to share how we're approaching that challenge.
Can you tell me a bit about these time-series challenges?
The first challenge we face is how to build a metrics database that scales well horizontally for both point load and tag load. Our lessons there all come from embracing the distinct customer use cases that restrict us from needing to build a general time series database, and then just applying good scaleable systems design principles. One of our newer challenges though is providing accurate percentiles over streams of data. As our customers are becoming more interested in SLAs and SLOs they want to query, rather than the average or sum or traditional aggregations, instead things like the 99th percentile value of their latency request size. Naively the way you do that is storing every single point, but we're using “approximate” data structures to provide accurate and fast answers without having to scale linearly with the number of points and values that customers are sending us.
When you talk about approximate data structures, are these proprietary data structures or in the wild?
These are data structures that are in the wild. The community calls this approach “sketches”. We have open sourced our implementation as part of the Datadog agent which is Apache-licensed, so that the code is there and it's there for anyone to go and take a look at. We also have a paper that was accepted at the 2019 VLDB conference that our data science team will be talking about in August, and we'll be open-sourcing standalone versions in several languages at that time as well.
What's the difference between approximation and probabilistic?
An approximation is giving you something close to the value you ask for. For example, it is impossible to compute the median of a sequence of numbers exactly in one pass without holding onto most of that data. Instead you will need to use something like a sketch, which is guaranteed to give you an approximation while holding on to much less data.
Some sketches are probabilistic, for example, they might need to choose a random hash function. Probabilistic sketches also have approximation guarantees, but they also have a failure probability where there’s a small chance that the approximation can be very bad.
Who are you talking to?
We are talking to people who have similar needs to us in terms of scaling systems handling metrics and/or time series data. Our customers need answers about what's happening in their systems in as close to real time as possible. Building a scalable storage architecture and using approximations has been our method to minimize that latency and get answers to people as quickly as possible.
Similar Talks
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26
Patrick Kua
Inside Job: How to Build Great Teams Within a Legacy Organization?
Engineering Director @Meetup
Francisco Trindade
Scaling Infrastructure Engineering at Slack
Senior Director of Infrastructure Engineering @Slack