Description
Coping with the growing rate of data sources is becoming a big challenge, not only in terms of efficiently storing it, but also (and most specially) in doing more general computations with them. Compressing your data may help in many (and sometimes unexpected) ways in this task. This talk will introduce several ways in which you can benefit from highly efficient compression libraries.
Abstract
Nowadays CPUs are fast; they are coming with more and more cores and, in comparison, memory speed is not keeping this race in terms of speed. As a result, this opening gap is what is making of compression a valuable technique, not only for storing the same data by using less storage but also to accelerate data handling operations in an increasing number of cases.
My talk will start by introducing the technological reasons behind the increasing benefit of using compression in data science, and then will show some practical cases where data compression can lead to much more efficient data pipelines. For this, I will be using well-proven compression libraries like Blosc, Zstandard and LZ4 that, either in combination with data handling libraries (like PyTables, bcolz or zarr), or used for handling high-speed data streams (transmitted e.g. via gRPC).