Presentation: Datadog: A Real Time Metrics Database for One Quadrillion Points/Day

Track: Data Engineering for the Bold

Location: Empire Complex, 7th fl.

Duration: 5:25pm - 6:15pm

Day of week:

Slides: Download Slides

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  1. Hear about metrics databases, what they are good for.
  2. Find out about some of the lessons architecting a horizontally scalable metrics database to handle a large number of data points per day.
  3. Listen what are our plans in motion are to increase the capability of handling even more input data points in the future in a way that scales sublinearly with input load, using a technology called sketches.

Abstract

In the course of its 8 years of existence, Datadog has grown its real time metrics systems that collect, process, and visualize data to the point they now handle trillions of points per day. This has been based on an architecture combining open source technologies such as Apache Cassandra, Kafka and PostgreSQL with a lot of in-house software, particularly for in-memory data storing and querying; especially for efficiently computing distribution metrics. In this talk Datadog's VP Metrics and Alerts Ian Nowland and Director of Distribution Metrics Joel Barciauskas, will speak to the challenges we face, how the architecture has evolved to cope with them, and what we are looking to in the future as we architect for a quadrillion points per day.

Question: 

Can you tell me a bit about the challenges with metrics collected by Datadog?

Answer: 

Our primary challenge is scaling to meet our customer’s demand for fast queries on multi-dimensionally increasing data, but in a cost efficient manner. As our customers needs are becoming more complex, to get the insight into their systems they want to send us many more metric points with as many tags to query over. We are focused on finding ways to efficiently handle that scale, leveraging both traditional techniques like optimizing and horizontally scaling data indexes, and modern approaches like approximations (sketches) that allow us to scale sub-linearly with input load.

Question: 

The motivation for the talk?

Answer: 

It's definitely not a vendor pitch. We're an engineering-driven culture and that leads us to want to share the problems we have and the ways we have solved them. We think the complexity of challenges we're tackling scaling around multiple dimensions of data growth mirror what many in the industry are seeing, and so we thought that this would be a good time to share how we're approaching that challenge.

Question: 

Can you tell me a bit about these time-series challenges?

Answer: 

The first challenge we face is how to build a metrics database that scales well horizontally for both point load and tag load. Our lessons there all come from embracing the distinct customer use cases that restrict us from needing to build a general time series database, and then just applying good scaleable systems design principles. One of our newer challenges though is providing accurate percentiles over streams of data. As our customers are becoming more interested in SLAs and SLOs they want to query, rather than the average or sum or traditional aggregations, instead things like the 99th percentile value of their latency request size. Naively the way you do that is storing every single point, but we're using “approximate” data structures to provide accurate and fast answers without having to scale linearly with the number of points and values that customers are sending us.

Question: 

When you talk about approximate data structures, are these proprietary data structures or in the wild?

Answer: 

These are data structures that are in the wild. The community calls this approach “sketches”. We have open sourced our implementation as part of the Datadog agent which is Apache-licensed, so that the code is there and it's there for anyone to go and take a look at. We also have a paper that was accepted at the 2019 VLDB conference that our data science team will be talking about in August, and we'll be open-sourcing standalone versions in several languages at that time as well.

Question: 

What's the difference between approximation and probabilistic?

Answer: 

An approximation is giving you something close to the value you ask for. For example, it is impossible to compute the median of a sequence of numbers exactly in one pass without holding onto most of that data. Instead you will need to use something like a sketch, which is guaranteed to give you an approximation while holding on to much less data.

Some sketches are probabilistic, for example, they might need to choose a random hash function. Probabilistic sketches also have approximation guarantees, but they also have a failure probability where there’s a small chance that the approximation can be very bad.

Question: 

Who are you talking to?

Answer: 

We are talking to people who have similar needs to us in terms of scaling systems handling metrics and/or time series data. Our customers need answers about what's happening in their systems in as close to real time as possible. Building a scalable storage architecture and using approximations has been our method to minimize that latency and get answers to people as quickly as possible.

Speaker: Ian Nowland

VP Engineering Metrics and Alerting @datadoghq

Ian Nowland is the VP Engineering Metrics and Alerting at Datadog. Before that he was SVP Engineering Manager of the Compute Platform at Two Sigma, and he spent 8 years in AWS where his major achievement was building the team that shipped the first three generations of the EC2 Nitro platform.

Find Ian Nowland at

Speaker: Joel Barciauskas

Director of Engineering, Distribution Metrics

Joel is an experienced lead engineer and technical manager with an extensive engineering and technical consulting background. He currently leads Datadog's distribution metrics team, providing accurate, low latency percentile measures for customers across their infrastructure.

Find Joel Barciauskas at

Similar Talks

Are We Really Cloud-Native?

Qcon

Director of Technology @Luminis_eu

Bert Ertman

The Trouble With Learning in Complex Systems

Qcon

Senior Cloud Advocate @Microsoft

Jason Hand

What Breaks Our Systems: A Taxonomy of Black Swans

Qcon

Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee

Laura Nolan

Scaling Infrastructure Engineering at Slack

Qcon

Senior Director of Infrastructure Engineering @Slack

Julia Grace