QCon New York June 15-19, 2020 | | MLflow: An Open Platform to Simplify the Machine Learning Lifecycle

This presentation is now available to view on InfoQ.com

What You’ll Learn

Listen what are the main stages in ML lifecycle -data ingest, data preparation, model training, deployment - and some of the challenges associated with them.
Hear about MLFlow, what it is and what it is useful for in the ML workflow.
Find out how MLflow differentiates from other ML lifecycle solutions.

Abstract

Developing applications that successfully leverage machine learning is difficult. Building and deploying a machine learning model is challenging to do once. Enabling other data scientists (or even yourself, one month later) to reproduce your pipeline, compare the results of different versions, track what’s running where, and redeploy and rollback updated models is much harder.

Corey Zumar offers an overview of MLflow, a new open source project from Databricks that simplifies this process. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment and for managing the deployment of models to production. Moreover, MLflow is designed to be an open, modular platform—you can use it with any existing ML library and incorporate it incrementally into an existing ML development process.

Question:

QCon What is the focus of your work today?

Answer:

The focus of my work is MLflow: an open source platform for the complete machine learning lifecycle. The MLflow platform provides solutions for data collection, data preparation, model training, and model productionization.

Question:

What is the motivation for this talk?

Answer:

We observe that many businesses and research organizations are trying to leverage machine learning at increasingly large scales, and it's challenging for many of them to implement standardized ML pipelines with limited engineering resources. MLflow provides standardized components for each ML lifecycle stage, easing the development of ML applications.

Question:

What would you describe as the persona level of the target audience?

Answer:

The target audience has basic familiarity with machine learning. We're targeting data scientists and machine learning developers who have experienced or are interested in the challenges of getting up and running with ML, especially when working collaboratively with large teams or attempting to scale up machine learning within an organization.

Question:

What do you want this persona to walk away from your talk with?

Answer:

The audience should walk away with an understanding of the four-stage machine learning lifecycle. They should also understand some of the critical challenges in each of those stages. For example, it is challenging to support the wide variety of ML frameworks used by data scientists to develop models: data scientists may want to train models in TensorFlow and then deploy them to real-time serving platforms like Kubernetes or SageMaker. Alternatively, they might build classical models with tools like scikit-learn and score in batch on Spark. This complex ecosystem of tools motivates the importance of developing powerful abstractions that enable ML developers to move from one lifecycle stage to the next. The big takeaway is the design philosophy for machine learning platforms in a world where ML development is becoming increasingly complex.

Question:

There are quite a few other open source offerings that seem to try and solve similar problems. I'm thinking about Michelangelo from Uber or FBLearner from Facebook for example. What would you see as the advantages the MLflow has over them?

Answer:

That's a great question. We've derived a lot of inspiration from Uber's Michelangelo, Google's TFX and Facebook's FBLearner - limited parts of which are open source. These platforms provide standardization for this complex machine learning lifecycle. However, they often restrict the types of ML and data ingest tools, as well as deployment environments, that practitioners can leverage. The defining factor of MLflow is that it is fully open source and is deliberately built around an open, extensible interface design. This design has empowered open source contributors to advance the project; the project has merged code from over 70 third-party developers. Finally, MLflow integrates with the most popular tools that data scientists are leveraging; this makes it very easy to get started with the platform.

Question:

What are you offering in terms of scalability in deployment?

Answer:

We provide a generic model format that represents any model produced by the MLflow platform as a filesystem directory. MLflow also includes utilities for serializing models from popular frameworks in MLflow format. These MLflow Models can then be deployed to a variety of existing inference tools, such as Microsoft’s Azure ML, Amazon SageMaker, or Kubernetes. MLflow provides deployment APIs for these services, each of which offer solutions for scalability and deployment management. At this point, MLflow does not offer its own model serving solution.

Speaker: Corey Zumar

Software Engineer @databricks

Corey Zumar is a software engineer at Databricks, where he’s working on machine learning infrastructure and APIs for the machine learning lifecycle, including model management and production deployment. Corey is also an active developer of MLflow. He holds a master’s degree in computer science from UC Berkeley. At UC Berkeley’s RISELab, he was one of the lead developers of Clipper, an open source project and research effort focused on high-performance model serving.