Retain and expire snapshots

DATA OPERATIONS

In Apache Iceberg, every change to the data in a table creates a new version, called a snapshot. Iceberg metadata keeps track of multiple snapshots at the same time to give readers using old snapshots time to complete, to enable incremental consumption, and for time travel queries.

Of course, keeping all table data indefinitely isn’t practical. Part of basic Iceberg table maintenance is to expire old snapshots to keep table metadata small and avoid high storage costs from data files that aren’t needed. Snapshots accumulate until they are expired by the expireSnapshots operation in the API or by calling stored procedures in Apache Spark or Trino. The best practice is to run snapshot expiration on a daily basis.

This recipe shows how to expire snapshots in a table that are no longer needed.

Snapshot expiration and retention

Expiration is configured with two settings:

Maximum snapshot age: A time window beyond which snapshots are discarded.
Minimum number of snapshots to keep: A minimum number of snapshots to keep in history. As new ones are added, the oldest ones are discarded.

These are set as table properties:

history.expire.min-snapshots-to-keep
History.expire.max-snapshot-age-ms in milliseconds)

By default, the retention policy is 5 days and only the latest snapshot will be kept (min-snapshots-to-keep=1).

Note: The min number of snapshots to keep in history takes precedence over age-based expiration. This means that a snapshot older than the age you set can be deleted but may not necessarily be deleted.

There is also a third setting: history.expire.max-ref-age-ms. This tells Iceberg how long to keep branches and tags (refs). The snapshot expiration applies the two retention settings above for historical snapshots in table branches.

Expiring snapshots from Spark

To expire snapshots from Spark, use the expire_snapshots procedure. This will also clean up data and metadata files that are no longer referenced after the expired snapshots are removed.

CALL system.expire_snapshots(table => 'examples.nyc_taxi_yellow')

The Spark procedure will parallelize data cleanup in your Spark cluster. This command applies the retention settings from the table properties, which is the recommended best practice.

If needed, you can override the table defaults — for example, to clean up a data spill by removing old versions immediately. In this case, use the older_than argument to specify the desired timestamp before which snapshots will be removed:

CALL system.expire_snapshots(
    table => 'examples.nyc_taxi_yellow',
    older_than => TIMESTAMP '2021-06-30 00:00:00.000')

Expiring snapshots with the Iceberg API

You can also use the Java API to expire snapshots. This method directly calls the same expiration logic as the Spark stored procedure. To use the Java API, you’ll need to create a catalog and use it to load the table. Once you have loaded a table, call the expireSnapshots operation:

import org.apache.iceberg.catalog.TableIdentifier;

// load the table
TableIdentifer ident = TableIdentifier.of("examples", "nyc_taxi_yellow");
Table table = catalog.loadTable(ident);

// expire snapshots using table settings
table.expireSnapshots().commit();

You can also customize the retention period by passing a timestamp to expireOlderThan. Snapshots older than the timestamp will be removed.


// expire snapshots using a custom expiration threshold
table.expireSnapshots()
    .expireOlderThan(System.currentTimeMillis() - ONE_DAY_MS)
    .commit();

For more details, refer to the API docs for ExpireSnapshots.

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg