Description
Dask Array implements the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays. This lets us compute on arrays larger than memory using all of our cores.
We describe dask, dask.array, dask.dataframe, as well as task scheduling generally.
NumPy and Pandas provide excellent in-memory containers and computation for the Scientific Python ecosystem. As we extend to larger-than-memory datasets these containers fail, leaving scientists with less productive options that mesh less well with the existing ecosystem.
A common solution to this problem is blocking algorithms and task scheduling. Blocking algorithms define macro-scale operations on the full dataset as a network of smaller operations on in-memory blocks of the dataset. Task scheduling allows many parallel workers to execute these tasks in a way consistent to their data dependencies.
We introduce dask, a task scheduling specification, and dask.array a high-level abstraction that implements a large subset of the NumPy API with blocked algorithms. In many cases dask.array provides a drop-in replacement for NumPy for out-of-core datasets with parallel execution. We discuss the design choices behind dask, dask.array, and related projects and show performance both quantitatively with benchmarks and also in usability by demonstrating integration into the larger ecosystem.
Slides available here: https://github.com/ContinuumIO/dask-tutorial