BASICS
This recipe shows how to inspect Apache Iceberg table metadata with SQL queries.
Iceberg metadata
Iceberg table metadata is available through special tables called, unsurprisingly, metadata tables. The most commonly used metadata tables show the levels in Iceberg’s metadata tree. These include:
snapshots
– known versions of the table, with summary metadatamanifests
– manifest files in the current snapshot; these track data and delete filesdata_files
– data files in the current snapshotdelete_files
– delete files in the current snapshot; these store deleted keys or file offsets
You can use these metadata tables to inspect all metadata in an Iceberg table, including partition values, lower bounds, upper bounds, and counts used for scan planning.
To query a metadata table in Apache Spark, add the metadata table name as another part in the table identifier. For example, to show the list of known table versions for the nyc_taxi_yellow
table, run:
SELECT snapshot_id, committed_at, operation, summary['spark.app.id'
FROM examples.nyc_taxi_yellow.snapshots
Note that you need to include the full namespace because the table identifier has multiple parts.
Another useful metadata table is the history
table. It shows a log of when the current table state changed and which snapshot (by ID) became the current version. It also shows whether each historical version is an ancestor of the table’s current snapshot (in the is_current_ancestor
column), or whether it was rolled back.
Metadata tables are just like normal tables. For example, you can run an aggregation on the data_files
table to see average file sizes by partition. Or you can join the history
and snapshots
tables to see when the current state changed, along with summary information like what Spark app created the snapshot:
SELECT
h.made_current_at,
s.operation,
h.snapshot_id,
h.is_current_ancestor,
s.summary['spark.app.id']
FROM examples.nyc_taxi_yellow.history h
JOIN examples.nyc_taxi_yellow.snapshots s
ON h.snapshot_id = s.snapshot_id
ORDER BY made_current_at
Engines other than Spark may use different naming conventions for metadata tables. For instance, Trino uses $
to add the metadata table to a table name and it must be quoted:
examples."nyc_taxi_yellow$history"