PYICEBERG
This recipe introduces the PyIceberg API.
Before running code examples, refer to the PyIceberg catalog configuration recipe for how to configure catalogs.
This recipe focuses on metadata operations. If you’re interested in reading data, check the DuckDB recipe or Pandas recipe in this cookbook, or jump to the Apache Arrow, Pandas, or DuckDB section of the PyIceberg docs.
Loading a catalog
The primary starting point for working with the PyIceberg API is the load_catalog
method that connects to an Iceberg catalog. To create a catalog instance, pass the catalog’s name from your YAML configuration:
from pyiceberg.catalog import load_catalog
catalog = load_catalog('default')
Or use a different name and pass the configuration as keyword arguments:
from pyiceberg.catalog import load_catalog
catalog = load_catalog('sandbox', **{'uri': 'http://rest-catalog:8181'})
You can check that your catalog is working by running methods to explore its contents:
catalog.list_namespaces()
# [('default',), ('examples',)]
catalog.list_tables('examples')
# [
# ('examples', 'backblaze_drive_stats'),
# ('examples', 'nyc_taxi_locations'),
# ('examples', 'nyc_taxi_yellow')
# ]
Inspecting a table
The most basic use of PyIceberg is to load and inspect a table. Use load_table
to load a table from the catalog. Methods such as schema
and spec
are used to show the table’s schema and partitioning configuration:
table = catalog.load_table('examples.nyc_taxi_yellow')
table.schema() # Returns the schema
table.properties # Returns the table properties
table.location # Returns the location where the table is stored
table.current_snapshot() # Returns the current snapshot of the table
table.spec() # Returns the partition-spec
Updating a table’s schema
The API can also be used to make changes to the table, such as updating properties or evolving the table schema.
Changes to a table use a common pattern. The table exposes high-level operations through methods like update_schema()
. Those methods return or yield an object that can use to configure the change. Once you’ve configured the operation, returning from the with
block or calling commit()
will apply and validate the changes and then commit to the catalog.
Consider a table with the following schema:
table {
1: datetime: required timestamp
2: symbol: required string
3: bid: optional float
4: ask: optional double
5: details: optional struct<4: created_by: optional string>
}
This example updates the table schema using methods like add_column
on the update object passed to a with block.
with table.update_schema() as update:
# Add columns
update.add_column("retries", IntegerType(), "Number of retries to place the bid")
# In a struct
update.add_column("details.confirmed_by", StringType(), "Name of the exchange")
# Rename column
update.rename("retries", "num_retries")
# This will rename `confirmed_by` to `exchange`
update.rename("properties.confirmed_by", "exchange")
# Move columns
update.move_first("symbol")
update.move_after("bid", "ask")
# This will move `confirmed_by` before `exchange`
update.move_before("details.created_by", "details.exchange")
# Type promotion, from a float to a double
update.update_column("bid", field_type=DoubleType())
# Make a field optional
update.update_column("symbol", required=False)
# Update the documentation
update.update_column("symbol", doc="Name of the share on the exchange")
At the end of the with block, the changes are applied and committed. Other updates in the Iceberg API work the same way.