Summary
Authors: Bekolay, Trevor, University of Waterloo
Track: Reproducible Science
Every scientist should be able to regenerate the figures in a paper. However, all too often the correct version of a script goes missing, or the original raw data is filtered by hand and the filtering process undocumented, or the student who has the data or code has switched labs.
In this talk, I will describe a workflow for a complete end-to-end analysis pipeline, going from raw data to analysis to plotting, using existing tools to make each step of the pipeline reproducible, documented, and efficient, while requiring few sacrifices in terms of a scientist's time and effort.
The key insight is to decouple each analysis step and each plotting step, in order to do several analyses or plots in parallel. Each step can be cached if it is costly, with the code that produces the cached data serving as the documentation for how it is produced.
I will discuss a way to organize code in order to make analyzing and plotting large data sets efficient, parallelizable, and cacheable. Once completed, source code can be uploaded to a hosting service like Github or Bitbucket, and data can be uploaded to a data store like Amazon S3 or figshare. The end result is that readers can completely regenerate the figures in your paper at no or nearly no cost to you.