Getting started with the Python API

PYICEBERG

This recipe introduces the PyIceberg API.

Before running code examples, refer to the PyIceberg catalog configuration recipe for how to configure catalogs. This recipe focuses on metadata operations. If you’re interested in reading data, check the DuckDB recipe or Pandas recipe in this cookbook, or jump to the Apache Arrow, Pandas, or DuckDB section of the PyIceberg docs.

Loading a catalog

The primary starting point for working with the PyIceberg API is the load_catalog method that connects to an Iceberg catalog. To create a catalog instance, pass the catalog’s name from your YAML configuration:

from pyiceberg.catalog import load_catalog
catalog = load_catalog('default')

Or use a different name and pass the configuration as keyword arguments:

from pyiceberg.catalog import load_catalog
catalog = load_catalog('sandbox', **{'uri': 'http://rest-catalog:8181'})

You can check that your catalog is working by running methods to explore its contents:

catalog.list_namespaces()
# [('default',), ('examples',)]

catalog.list_tables('examples')
# [
#   ('examples', 'backblaze_drive_stats'),
#   ('examples', 'nyc_taxi_locations'),
#   ('examples', 'nyc_taxi_yellow')
# ]

Inspecting a table

The most basic use of PyIceberg is to load and inspect a table. Use load_table to load a table from the catalog. Methods such as schema and spec are used to show the table’s schema and partitioning configuration:

table = catalog.load_table('examples.nyc_taxi_yellow')
table.schema()            # Returns the schema
table.properties          # Returns the table properties
table.location            # Returns the location where the table is stored
table.current_snapshot()  # Returns the current snapshot of the table
table.spec()              # Returns the partition-spec

Updating a table’s schema

The API can also be used to make changes to the table, such as updating properties or evolving the table schema.

Changes to a table use a common pattern. The table exposes high-level operations through methods like update_schema(). Those methods return or yield an object that can use to configure the change. Once you’ve configured the operation, returning from the with block or calling commit() will apply and validate the changes and then commit to the catalog.

Consider a table with the following schema:

table {
  1: datetime: required timestamp
  2: symbol: required string
  3: bid: optional float
  4: ask: optional double
  5: details: optional struct<4: created_by: optional string>
}

This example updates the table schema using methods like add_column on the update object passed to a with block.

with table.update_schema() as update:
    # Add columns
    update.add_column("retries", IntegerType(), "Number of retries to place the bid")
    # In a struct
    update.add_column("details.confirmed_by", StringType(), "Name of the exchange")

    # Rename column
    update.rename("retries", "num_retries")
    # This will rename `confirmed_by` to `exchange`
    update.rename("properties.confirmed_by", "exchange")

    # Move columns
    update.move_first("symbol")
    update.move_after("bid", "ask")
    # This will move `confirmed_by` before `exchange`
    update.move_before("details.created_by", "details.exchange")

    # Type promotion, from a float to a double
    update.update_column("bid", field_type=DoubleType())
    # Make a field optional
    update.update_column("symbol", required=False)
    # Update the documentation
    update.update_column("symbol", doc="Name of the share on the exchange")

At the end of the with block, the changes are applied and committed. Other updates in the Iceberg API work the same way.

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg