Track: Data Engineering for the Bold

Location: Majestic Complex, 6th fl.

Day of week:

Data engineering is the practice of delivering high-fidelity, custom access to data in order to serve the varied needs of a business. The rich and engaging experiences many of us expect online today (e.g. personalized news feeds, highly-relevant search engines & recommender systems, smart home assistants) are powered by modern data pipelines and architectures that form the foundation of data engineering. The tools a data engineer can deploy for his/her needs today occupy a vast landscape. The field of data engineering may have started out as “put all of your data in that RDBMS over there”, but it has evolved into a field of a multitude of specialty data solutions. It encompasses databases (RDBMS, NoSQL, NewSQL, OLAP DBs, etc…), messaging systems (Kafka, Kinesis, Pulsar), data compute frameworks (Spark, Flink, Ray, graph compute), storage systems (distributed file systems, block storage, object storage), search engines, RT OLAP engines, and graph DBs, Machine Learning Frameworks (PetaStorm, Michelangelo), etc… As the volume and speed of the data grows, we are continuing to discover new patterns and frameworks for squeezing more out of our data. What are some of the new entrants in this space and what interesting problems are being solved with them? Come to this track to find out!

Track Host: Sid Anand

Hacker at Large, Co-chair @QCon & Data Council, PMC & Committer @ApacheAirflow

Sid Anand recently served as PayPal's Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari's Data Architect, a Technical Lead in Search & Data Analytics @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. When not working, Sid enjoys spending time with family and friends.

10:35am - 11:25am

Scaling DB Access for Billions of Queries Per Day @PayPal

As microservices scale and proliferate, they add increasing load on databases in terms of connections and resource usage. Open sourced in the Go programming language, Hera (High Efficiency Reliable Access to data stores) scales thousands of PayPal’s applications with connection multiplexing, read-write split, and sharding. This talk covers various approaches taken over the years to handle a large growth in application connections and OLTP database utilization. Beyond pure connection and query scaling, Hera has functionality for better manageability. Automatic SQL eviction and DBA maintenance control help to more easily operate hundreds of databases.

Petrica Voicu, Software Engineer @PayPal
Kenneth Kang, Software Engineer @PayPal

11:50am - 12:40pm

A Dive Into Streams @LinkedIn With Brooklin

Although LinkedIn data continues to grow rapidly over the years, scaling up to handle the increasing data volume has not been the only challenge in streaming data in near real-time. Supporting the proliferation of new data systems has become yet another huge endeavor for data streaming infrastructure at LinkedIn. Building separate, specialized solutions to move data across heterogeneous systems is not sustainable, as it slows down development and makes the infrastructure unmanageable. This called for a centralized, managed, and extensible solution that can continuously deliver data to nearline applications.

We built Brooklin as a managed data streaming service that supports multiple pluggable sources and destinations, which can be data stores or messaging systems. Since 2016, Brooklin has been running in production as a critical piece of LinkedIn’s streaming infrastructure, supporting a variety of data movement use cases, such as change data capture (CDC) and data propagation between different systems and environments. We have also leveraged Brooklin for mirroring Kafka data, replacing Kafka MirrorMaker at LinkedIn. In this talk, we will dive deeper into Brooklin’s architecture and use cases, as well as our future plans.

Celia Kung, Data Infrastructure @LinkedIn

1:40pm - 2:30pm

CockroachDB: Architecture of a Geo-Distributed SQL Database

In this talk Cockroach Labs' CTO and co-founder, Peter Mattis, will speak to the architecture of an open-source, geo-distributed, SQL database. The talk will be a whirlwind tour of CockroachDB’s internals, covering the usage of Raft for consensus, the challenges of data distribution, distributed transactions, distributed SQL execution, and distributed SQL optimizations.

Peter Mattis, CockroachDB maintainer, Co-founder & CTO @CockroachDB

2:55pm - 3:45pm

Data Engineering Open Space

Details to follow.

4:10pm - 5:00pm

Peloton - Uber's Webscale Unified Scheduler on Mesos & Kubernetes

With the increasing scale of Uber’s business, efficient use of cluster resources is important to reduce the cost per trip. As we have learned when operating Mesos clusters in production, it is a challenge to overcommit resources for latency-sensitive services due to their large spread of resource usage patterns. Uber also has significant demand on running large-scale batch jobs for marketplace intelligence, fraud detection, maps, self-driving vehicles etc.  

In this talk, we will present Peloton, a Unified Resource Scheduler for collocating heterogeneous workloads in shared Mesos clusters. The goal of Peloton is to manage compute resources more efficiently while providing hierarchical max-min fairness guarantees for different teams. Peloton schedules large-scale batch jobs with millions of tasks and also supports distributed TensorFlow jobs with thousands of GPUs.

Mayank Bansal, Staff Engineer @Uber
Apoorva Jindal, Senior Software Engineer @Uber

5:25pm - 6:15pm

Datadog: A Real Time Metrics Database for One Quadrillion Points/Day

In the course of its 8 years of existence, Datadog has grown its real time metrics systems that collect, process, and visualize data to the point they now handle trillions of points per day. This has been based on an architecture combining open source technologies such as Apache Cassandra, Kafka and PostgreSQL with a lot of in-house software, particularly for in-memory data storing and querying; especially for efficiently computing distribution metrics. In this talk Datadog's VP Metrics and Alerts Ian Nowland and Director of Distribution Metrics Joel Barciauskas, will speak to the challenges we face, how the architecture has evolved to cope with them, and what we are looking to in the future as we architect for a quadrillion points per day.

Ian Nowland, VP Engineering Metrics and Alerting @datadoghq
Joel Barciauskas, Director of Engineering, Distribution Metrics

Tracks

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.