BASICS
This recipe demonstrates ways to query historical snapshots of Apache Iceberg tables.
Time travel to query historical snapshots in Iceberg tables
Every change to an Iceberg table creates an independent version of the metadata tree, called a snapshot. As they are replaced, historical snapshots aren’t deleted immediately because there could be long-running jobs still reading from them. Snapshots are kept for 5 days by default. To learn how to customize snapshot retention, see the recipe on snapshot expiration.
Before snapshots expire, you can still read them using time travel queries. Time travel queries are supported in Apache Spark and Trino using the SQL clauses FOR VERSION AS OF
and FOR TIMESTAMP AS OF
.
SELECT count(*) FROM examples.nyc_taxi_yellow
-- 2,853,020
SELECT count(*) FROM examples.nyc_taxi_yellow
FOR VERSION AS OF 2188465307835585443
-- 2,798,371
When you use FOR VERSION AS OF
, Iceberg loads the snapshot specified by the snapshot ID. To see what snapshots are available, query the snapshots
metadata table. You can also use this syntax to read from a tag or a branch by name.
SELECT count(*) FROM examples.nyc_taxi_yellow
FOR TIMESTAMP AS OF TIMESTAMP '2022-01-01 00:00:00.000000 Z'
-- 2,798,371
When you use FOR TIMESTAMP AS OF
, Iceberg loads the snapshot that was current at that time, based on the table history. You can inspect table history by querying the history
metadata table.
Time travel to a specific time is convenient, but isn’t guaranteed to use the same snapshot that was read by a given job. Because of clock skew, the timestamp of, say, a log message from one node might be inconsistent with the timestamp another node used to create the commit. When using time travel to reproduce a read, it is best to probe the reader’s log to identify the snapshot ID that was used.