Contribute Media
A thank you to everyone who makes this possible: Read More

Datasets and machine learning models versioning using open source tools

Description

AI and ML are becoming an essential part of software engineering. Open source tools like Git, Git-LFS, MlFlow can increase ML teams productivity by introducing best practices. However, large datasets management and versioning are not covered by these tools. We will show how to overcome the limitations of the tools by using DVC.org - an open-source project for ML models and datasets versioning.

AI and ML are becoming an essential part of software engineering. The traditional engineering toolset does not fully cover machine learning team's needs. The teams need new tools for data versioning, ML pipeline versioning, ML model versioning, experiments metrics tracking, and others.

ML workflow is data-centric while software engineering workflow is centered around source code. We will discuss the current practices of organizing ML projects using open-source tools like Git, Git-LFS, MlFlow as well as their limitations. Thereby motivation for developing new ML specific data versioning systems will be explained.

Data Version Control or DVC.ORG is an open-source command-line tool. We will show how to version ML models and multi-gigabyte datasets, how to use your favorite cloud storage (S3, Google Cloud Storage, or bare metal SSH server) as a data file backend, how to apply the best engineering practices to your ML projects and how to combine the different tools in the same project.

Details

Improve this page