Contribute Media
A thank you to everyone who makes this possible: Read More

Scalable Pipelines with Luigi or: I’ll have the Data Engineering, hold the Java!

Description

In this workshop you see how (and why) to leverage the PyData ecosystem to build a robust data pipeline. More specifically you will learn how to use the Luigi framework to integrate multiple stages of a model building pipeline (collection, processing, vectorization, training of multiple models, and validation) all in Python!

As companies scale prototypes and ad hoc analyses into production systems, it is critical to build automated (and repeatable) systems for data collection/processing and model training /evaluation which are fault tolerant enough to adapt to changing constraints. Sustainable software development is often an afterthought for data scientists, especially since the tools for analysis (R, scientific python, etc.) do not naturally lend themselves to building scalable and extensible software abstractions. But now we can have our cake and eat it too... all with Python!

In this workshop you see how (and why) to leverage the PyData ecosystem to build a robust data pipeline. More specifically you will learn how to use the Luigi framework to integrate multiple stages of a model building pipeline: collection, processing, vectorization, training of multiple models, and validation.

Outline: The basic components of a data pipeline (5min) What and Why Luigi (10min) Lab: The Smallest (1 stage) pipeline (15min) Managing dependencies in a pipeline (10min) Lab: Multi-stage pipeline and introduction to the Luigi Visualizer (15min) Serialization in a Data Pipeline (10min) Lab: Integrating your pipeline with HDFS and Postgres (20min) Scheduling (10min) Lab: Parallelism and recurring jobs with Luigi (20min) Wrap up and next steps (5min)

Materials available here: Github Repo: https://github.com/Jay-Oh-eN/data-engineering-101 Slides: http://www.slideshare.net/jonathandinu/presentation-45784222

Details

Improve this page