Description
Tabular data is ubiquitous, and pandas has been the de facto tool in Python for analyzing it. However, as data size scales, analysis using pandas may become untenable. Luckily, modern analytical databases (like DuckDB) are able to analyze this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Many of these systems only provide a SQL interface though; something far different from pandas’ dataframe interface, requiring a rewrite of your analysis code.
This is where Ibis comes in. Ibis provides a common dataframe interface to many popular databases and analytics tools (BigQuery, Snowflake, Spark, DuckDB, …). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL (but you can use SQL if you want to!). No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend.
In this tutorial we’ll cover:
- The basic operations of Ibis (select, filter, group_by, order_by, join, and aggregate), and how these operations may be composed to form more complicated queries.
- How Ibis may be used on a number of different local and remote backend engines to execute the same queries on different systems.
- The tradeoffs of different database engines, and recommendations for how to choose the best tool for the job.
- How Ibis integrates into the larger Python data ecosystem, including tools like Scikit-Learn, Matplotlib, PyArrow, pandas, Altair, and VegaFusion.
This is a hands-on tutorial, with numerous examples to get your hands dirty. Participants should ideally have some experience using Python and pandas, but no SQL experience is necessary.