Description
Open source and open science come together when the software is accessible, transparent, and owned by all. For data analysis pipelines that grow in complexity beyond a single Jupyter notebook, this can become a challenge as the number of steps and software dependencies increase. In this talk, Nicholas Del Grosso will review a variety of tools for packaging and managing a data analysis pipeline, showing how they fit together and benefit the development, testing, deployment, and publication processes and the scientific community. In particular, this talk will cover:
- Workflow managers (e.g. Snakemake, PyDoit, Luigi) to combine complex pipelines into single applications.
- Container Solutions (e.g. Docker and Singularity) to package and deploy the software on others' computers, including high-performance computing clusters.
- The Scientific Filesystem to build explorable and multi-purpose applications.
- Testing Frameworks (e.g. PyTest, Hypothesis) to declare and confirm the assumptions and functionality of the analysis pipeline.
- Ease-of-Use Utilities to share the pipeline online and make it accessible to non-programmers.
By writing software that stays manageable, reproducible, and deployable continuously throughout the development cycle, we can better fulfill the goals of open science and good scientific practice in a digital era.
A review of DevOps tools as applied to data analysis pipelines, including workflow managers, software containers, testing frameworks, and online repositories for performing reproducible science that scales.