Run local queries in DuckDB

PYICEBERG

For DuckDB there are currently two paths for Iceberg integration in PyIceberg.

  1. Directly, using the Iceberg extension in DuckDB, or
  2. Within a PyIceberg Python process, where you use DuckDB to process the data.

This recipe demonstrates how to use DuckDB through PyIceberg because that option supports catalogs to make it easy to connect to the data and partition pruning with predicate pushdown.

Reading with PyIceberg

The recommended way to read from an Iceberg table is to connect to an Iceberg catalog. Refer to the PyIceberg configuration recipe to configure the catalogs in your PyIceberg environment.

The first step is to load a catalog.

from pyiceberg.catalog import load_catalog
catalog = load_catalog('sandbox')

Using that catalog, call load_table to get a table instance and then use the scan method to configure what to read from the table. In this example, both row_filter and selected_fields are used to filter down to just the desired rows and project columns. Finally, call to_duckdb to run the scan and register the result as a table in DuckDB.

table = catalog.load_table('examples.nyc_taxi_yellow')

con = table.scan(
    row_filter="pickup_time >= '2021-01-01T00:00:00+00:00' and pickup_time < '2021-02-01T00:00:00+00:00'",
    selected_fields=("pickup_time", "trip_distance", "tip_amount"),
).to_duckdb(table_name="taxi_trips_tips")

PyIceberg makes sure that it only fetches the files that are relevant to the query, minimizing I/O and speeding up the query. By calling to_duckdb, the table gets registered in DuckDB under the given name.

Once PyIceberg returns a connection, you can use it to run DuckDB SQL queries.

con.execute(
    """
        SELECT 
            dayofweek(pickup_time) as day_of_week, 
            AVG(tip_amount / trip_distance) AS tip_per_mile
        FROM taxi_trips_tips
        GROUP BY 1
        ORDER BY 1
    """
).fetchall()

The example query above computes the average tip per mile:

[
    (0, 1.0910165927887745), 
    (1, 1.1429818315208613), 
    (2, 1.1490640264618843), 
    (3, 1.1257297065284793), 
    (4, 1.1384737465986514), 
    (5, 1.1774549535849888), 
    (6, 1.1548896444517271)
]

The average tip on the weekend is higher than throughout the week.