Contribute Media
A thank you to everyone who makes this possible: Read More

Automating machine learning workflow with DVC


What data scientist / ML engineer wants to do while software engineers are busy with CI/CD.

As software engineers work on CI/CD process as soon as they start a new project, data scientists and ML engineers define a pipeline for data as it flows through a typical workflow. Each step of the pipeline is fed data processed from its preceding step as CI/CD process starts from code changes.

"Pipelining ML project" is sometimes misleading as it implies a large project with a group of engineers working on some large systems , being considered to be hard for an individual and unnecessary for a small project. Regardless of its size, having well organized pipelines for any ML projects is essential to succeed and actually it could be done easily with utilizing a proper tool.

In this talk, we will go through a machine learning workflow divided into a few steps composing a ML pipeline from data ingestion to model deployment. Each step depends on data produced by previous step, which are controlled by DVC. DVC is open-source version control system for data scientist and ML engineer helping them to organize data, models and experiments for some ML projects. The presentation will not only introduce how to use the tool but also show how to organize a ML pipeline with some examples.

The goal of this talk is to motivate data scientists and ML engineer to start building machine learning pipeline with DVC. Audience might expect a guide to using DVC for automating the pipeline. Also I will give some explanation about concepts of machine learning related techniques necessary for understanding the pipeline.

This session is designed to be accessible to everyone in beginners level. Understandings of basic concepts of machine learning and version control system (preferably, Git) might be helpful but not mandatory for the audience.

Improve this page