Description
One of the ever-present banes of a data scientist’s life is the constant wait for the data processing code to finish executing. Slow code affects almost every step of a typical data pipeline: data collection, data pre-processing/parsing, feature engineering, etc. Many times, the lengthy execution times force data scientists to work with only a subset of data, depriving him/her of the insights and performance improvements that could be obtained with a larger dataset. One of the tools that can mitigate this problem and speed up data science pipelines (and CPU-bound programs) is parallelization.
Parallelization is a useful way to work around the limitations of the Global Interpreter Lock (GIL), a key feature of Python that prevents code from fully utilizing multiple processor cores and can impact performance. In this session, we’ll walk through several ways to parallelize Python code, depending on the specific needs of your program and the type of parallelism you want to achieve.