Querying Table Metadata

MENU – Apache Iceberg Cookbook

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg

Querying Table Metadata

BASICS

This recipe shows how to inspect Apache Iceberg table metadata with SQL queries.

Iceberg metadata

Iceberg table metadata is available through special tables called, unsurprisingly, metadata tables. The most commonly used metadata tables show the levels in Iceberg’s metadata tree. These include:

snapshots – known versions of the table, with summary metadata
manifests – manifest files in the current snapshot; these track data and delete files
data_files – data files in the current snapshot
delete_files – delete files in the current snapshot; these store deleted keys or file offsets

You can use these metadata tables to inspect all metadata in an Iceberg table, including partition values, lower bounds, upper bounds, and counts used for scan planning.

To query a metadata table in Apache Spark, add the metadata table name as another part in the table identifier. For example, to show the list of known table versions for the nyc_taxi_yellow table, run:

SELECT snapshot_id, committed_at, operation, summary['spark.app.id'
FROM examples.nyc_taxi_yellow.snapshots

Note that you need to include the full namespace because the table identifier has multiple parts.

Another useful metadata table is the history table. It shows a log of when the current table state changed and which snapshot (by ID) became the current version. It also shows whether each historical version is an ancestor of the table’s current snapshot (in the is_current_ancestor column), or whether it was rolled back.

Metadata tables are just like normal tables. For example, you can run an aggregation on the data_files table to see average file sizes by partition. Or you can join the history and snapshots tables to see when the current state changed, along with summary information like what Spark app created the snapshot:

SELECT
     h.made_current_at,
     s.operation,
     h.snapshot_id,
     h.is_current_ancestor,
     s.summary['spark.app.id']
 FROM examples.nyc_taxi_yellow.history h
 JOIN examples.nyc_taxi_yellow.snapshots s
     ON h.snapshot_id = s.snapshot_id
 ORDER BY made_current_at

Engines other than Spark may use different naming conventions for metadata tables. For instance, Trino uses $ to add the metadata table to a table name and it must be quoted:

examples."nyc_taxi_yellow$history"