Description
PyData SF 2016
Airflow is a pipeline orchestration tool for Python that allows users to configure multi-system workflows that are executed in parallel across workers. I’ll cover the basics of Airflow so you can start your Airflow journey on the right foot. This talk aims to answer questions such as: What is Airflow useful for? How do I get started? What do I need to know that’s not in the docs?
Airflow is a popular pipeline orchestration tool for Python that allows users to configure complex (or simple!) multi-system workflows that are executed in parallel across any number of workers. A single pipeline might contain bash, Python, and SQL operations. With dependencies specified between tasks, Airflow knows which ones it can run in parallel and which ones must run after others. Airflow is written in Python and users can add their own operators with custom functionality, doing anything Python can do.
Moving data through transformations and from one place to another is a big part of data science/engineering, but there are only two widely-used orchestration systems for doing so that are written in Python: Luigi and Airflow. We’ve been using Airflow (http://pythonhosted.org/airflow/) for several months at Clover Health and have learned a lot about its strengths and weaknesses. We use it to run several pipelines multiple times per day. One includes over 450 heavily linked tasks!