Description
Two trends for data analysis are the ever increasing size of data sets and the drive for lower-latency results. In this talk, we present Apache Beam--a parallel programming model that allows one to implement batch and streaming data processing jobs that can run on a variety of scalable execution engines like Spark and Dataflow--and its new Python SDK. We discuss some of the interesting challenges in providing a Pythonic API and execution environment for distributed processing, and show how Beam allows the user to write a Python pipeline once that can run in both batch and streaming mode. We walk through a few examples of data processing pipelines in Beam for use cases such as real time data analytics and feature engineering with Tensorflow for machine learning pipelines.