Description
In this talk I will present a new solution to automatically scale Jupyter notebooks to complex and reproducibility pipelines based on Kubernetes and KubeFlow.
Nowadays, most of the High Performance Computing (HPC) tasks are carried out in the Cloud, and this is as much as in industry as in research.
Main advantages provided by the adoption of Cloud services include (a) constant up-to-date hardware resources; (b) automated infrastructure setup; (c) simplified resource management. Therefore, new solutions have been recently released to the community (e.g. Kubernetes by Google) providing custom integrations to specifically support the migration of existing Machine/Deep Learning pipelines to the Cloud.
However, a shift towards a complete Cloud-based computational paradigm imposes new challenges in terms of data and model reproducibility, privacy, accountability, and (efficient) resource configuration and monitoring. Moreover, the adoption of these technologies still imposes additional workloads requiring significant software and system engineering expertise (e.g. set up of containerised environments, storage volumes, clusters nodes).
In this talk, I will present kale (/ˈkeɪliː/) - a new Python solution to ease and support ML workloads for HPC in the Cloud is presented.
Kale leverages on the combination of Jupyter notebooks, and Kubernetes/Kubeflow Pipelines (KFP) as core components in order to:
- (R1) automate the setup and deployment procedures by automating the creation of (distributed) computation environments in the Cloud;
- (R2) democratise the execution of machine learning models at scale by instrumented and reusable environments;
- (R3) provide a simple interface (UI, and SDK) to enable researchers to deploy ML models without requiring extensive engineering expertise.
Technical features of Kale as well as open challenges and future development will be presented, along with working examples integrating kale with the complete ML/DL workflows for pipeline reproducibility.
Domains:
- Jupyter
- Machine Learning
- DevOps
- Parallel Computing/HPC