PYICEBERG
For DuckDB there are currently two paths for Iceberg integration in PyIceberg.
- Directly, using the Iceberg extension in DuckDB, or
- Within a PyIceberg Python process, where you use DuckDB to process the data.
This recipe demonstrates how to use DuckDB through PyIceberg because that option supports catalogs to make it easy to connect to the data and partition pruning with predicate pushdown.
Reading with PyIceberg
The recommended way to read from an Iceberg table is to connect to an Iceberg catalog. Refer to the PyIceberg configuration recipe to configure the catalogs in your PyIceberg environment.
The first step is to load a catalog.
from pyiceberg.catalog import load_catalog
catalog = load_catalog('sandbox')
Using that catalog, call load_table
to get a table instance and then use the scan
method to configure what to read from the table. In this example, both row_filter
and selected_fields
are used to filter down to just the desired rows and project columns. Finally, call to_duckdb
to run the scan and register the result as a table in DuckDB.
table = catalog.load_table('examples.nyc_taxi_yellow')
con = table.scan(
row_filter="pickup_time >= '2021-01-01T00:00:00+00:00' and pickup_time < '2021-02-01T00:00:00+00:00'",
selected_fields=("pickup_time", "trip_distance", "tip_amount"),
).to_duckdb(table_name="taxi_trips_tips")
PyIceberg makes sure that it only fetches the files that are relevant to the query, minimizing I/O and speeding up the query. By calling to_duckdb
, the table gets registered in DuckDB under the given name.
Once PyIceberg returns a connection, you can use it to run DuckDB SQL queries.
con.execute(
"""
SELECT
dayofweek(pickup_time) as day_of_week,
AVG(tip_amount / trip_distance) AS tip_per_mile
FROM taxi_trips_tips
GROUP BY 1
ORDER BY 1
"""
).fetchall()
The example query above computes the average tip per mile:
[
(0, 1.0910165927887745),
(1, 1.1429818315208613),
(2, 1.1490640264618843),
(3, 1.1257297065284793),
(4, 1.1384737465986514),
(5, 1.1774549535849888),
(6, 1.1548896444517271)
]
The average tip on the weekend is higher than throughout the week.