Description
We at Blue Yonder use Pandas quite a lot during our daily data science and engineering work. This choice, together with Python as an underlying programming language gives us flexibility, a feature-rich interface, and access to a large community and ecosystem. When it comes to preserving the data and exchanging it with different software stacks, we rely on Parquet Datasets / Hive Tables. During the write process, there is a shift from a rather weakly typed world to a strongly typed one. For example, Pandas may convert integers to floats for many operations without asking, but parquet files and the schema information stored alongside them dictate very precise types. The type situation may get even more "colorful", when datasets are written by multiple code versions or different software solutions over time. This then results in important questions regarding type compatibility.
This talk will first represent an overview on types at different layers (like NumPy, Pandas, Arrow and Parquet) and the transition between this layers. The second part of the talk will present examples of type compatibility we have seen and why+how we think they should be handled. At the end there will be a Q+A, which can be seen as the start of a potentially longer RFC process to align different software stacks (like Hive and Dask) to handle types in a similar way.