Scaling your data workloads with pandas and PySpark

YouTube

Description

While pandas is widely used for data preprocessing and analysis tasks, it is not designed for large-scale data processing. This leaves data analysts with a dilemma: whether to downsample the data and lose information, or to scale out the data workload using a distributed processing framework.

PySpark is one of the representative distributed processing tools for such cases. However, to use it, data analysts have to learn a new PySpark from scratch. To solve this problem, Apache Spark provides the pandas API. Existing pandas users can simply replace the pandas package with pyspark.pandas to distribute their existing workloads.

Alternatively, you can write your own user-defined functions (UDFs) that are not included in the existing PySpark API. Pandas Function APIs, introduced in Spark 3.0, allow users to apply arbitrary Python native functions to PySpark dataframes and process their inputs and outputs as pandas instances. This allows data analysts to train ML models based on each group of data using the pandas functions they already use.

In this session, we will cover how to perform distributed processing from the perspectives of the above two pandas users and PySpark users.

Hyukjin Kwon He is a Staff Software Engineer at Databricks, a tech lead for the open source PySpark team, and an Apache Spark PMC member and committer. He works on various areas of Apache Spark such as PySpark, Spark SQL, SparkR, and infrastructure, and has made the most commits in Apache Spark. He also leads various projects such as Project Zen, Pandas API on Spark, and Python Spark Connect.

PyVideo

Scaling your data workloads with pandas and PySpark

Description

Details