PyIceberg Writes

Write support has recently been added to PyIceberg in the 0.6.0 release. As it is new, there may be rough edges. If you run into a problem, make sure to check out the documentation online. Also, feel free to raise an issue.

In this recipe, you learn how to write to an Apache Iceberg table from Python. Iceberg write support uses Apache Arrow, which is a widely used in-memory data format for Python and other languages.

PyIceberg supports both appending to an Iceberg table or overwriting the existing table data. Both options accept an Arrow “table”, which is Arrow’s term for a dataframe.

The first step is to load a catalog and then create or load the table you want to write into. For more background on catalogs and how to configure them, see the Python Configuration recipe.

from pyiceberg.catalog import load_catalog
catalog = load_catalog('default')

df = pa.Table.from_pylist(
    [
        {"lat": 52.371807, "long": 4.896029},
        {"lat": 52.387386, "long": 4.646219},
        {"lat": 52.078663, "long": 4.288788},
    ],
)

Appending to a new table

If you are creating a table, there’s no need to convert the data’s schema to its Iceberg equivalent. You can pass an Arrow schema and PyIceberg will handle conversion automatically. Just be aware that the schema may be more strict than needed. For example, Arrow may not mark fields optional if there are no null values and that could cause writes with nulls to fail later.

# create a new table
table = catalog.create_table(
    'default.coordinates',
    schema=df.schema
)

To write to your table, pass the dataframe to append:

table.append(df)

Overwriting data in an existing table

To work with a table that already exists, you can get a table object by calling load_table.

table = catalog.load_table('default.coordinates')

For existing tables, there are two write options. As in the example above, you can also append new data using append. Or you can use overwrite to replace the table’s contents with the records from a dataframe.

# Careful, this will replace the table's data
table.overwrite(df)

Limitations

The first PyIceberg release with write support is 0.6.0. In that version, write support is limited. You can only write to unpartitioned tables and can either append and overwrite as shown above.

Schema Evolution

In Iceberg, the table schema is a contract with consumers so schema evolution is an explicit operation. Writes do not automatically update types or add columns — those changes might break downstream consumers. Instead, there is a strict check on the schema that will fail if you try to write something incompatible.

Here is an example that will raise an exception.

df = pa.Table.from_pylist(
    [
        {"x": 52.371807, "y": 4.896029},
        {"x": 52.387386, "y": 4.646219},
        {"x": 52.078663, "y": 4.288788},
    ],
)

tbl.overwrite(df)

The new data does not match the schema of the table:
Got:
x: double
y: double
Table:
lat: double
long: double

If you encounter this, you need to align the Arrow table to the Iceberg table’s schema, or update the table schema. The PyIceberg API recipe includes examples of schema evolution.

To easily update the table schema using an Arrow dataframe, PyIceberg supports union_by_name. 

with tbl.update_schema() as update:
    update.union_by_name(df.schema)

If the schemas cannot be merged, the above operation will raise an exception.