Time Travel Queries

MENU – Apache Iceberg Cookbook

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg

Time Travel Queries

BASICS

This recipe demonstrates ways to query historical snapshots of Apache Iceberg tables.

Time travel to query historical snapshots in Iceberg tables

Every change to an Iceberg table creates an independent version of the metadata tree, called a snapshot. As they are replaced, historical snapshots aren’t deleted immediately because there could be long-running jobs still reading from them. Snapshots are kept for 5 days by default. To learn how to customize snapshot retention, see the recipe on snapshot expiration.

Before snapshots expire, you can still read them using time travel queries. Time travel queries are supported in Apache Spark and Trino using the SQL clauses FOR VERSION AS OF and FOR TIMESTAMP AS OF.

SELECT count(*) FROM examples.nyc_taxi_yellow
-- 2,853,020
SELECT count(*) FROM examples.nyc_taxi_yellow
FOR VERSION AS OF 2188465307835585443
-- 2,798,371

When you use FOR VERSION AS OF, Iceberg loads the snapshot specified by the snapshot ID. To see what snapshots are available, query the snapshots metadata table. You can also use this syntax to read from a tag or a branch by name.

SELECT count(*) FROM examples.nyc_taxi_yellow
FOR TIMESTAMP AS OF TIMESTAMP '2022-01-01 00:00:00.000000 Z'
-- 2,798,371

When you use FOR TIMESTAMP AS OF, Iceberg loads the snapshot that was current at that time, based on the table history. You can inspect table history by querying the history metadata table.

Time travel to a specific time is convenient, but isn’t guaranteed to use the same snapshot that was read by a given job. Because of clock skew, the timestamp of, say, a log message from one node might be inconsistent with the timestamp another node used to create the commit. When using time travel to reproduce a read, it is best to probe the reader’s log to identify the snapshot ID that was used.