DATA ENGINEERING
This recipe shows how to create and manage tags and branches in an Apache Iceberg table.
What are tags and branches?
In Iceberg, every change to table data creates a new snapshot (version) in the table. Iceberg metadata keeps track of multiple snapshots at the same time to give readers using snapshots that were current when they started time to complete their tasks, to enable incremental consumption, and to allow for time travel queries.
Originally, there was just one named reference (ref), the current snapshot. But tags and branches were added to track more snapshots and unlock new use cases. Tags and branches are named references to those snapshots in the table’s metadata. By convention, tags are read-only, but branches can be updated just like the table itself — in fact, the table’s current state now has its own branch, main.
If this sounds suspiciously familiar, you’ve probably used the nearly identical features in git. The snapshots in Iceberg tables are much like the commits in a git repository. And tracking tags and branches works almost exactly the same way. They’re lightweight named references that point to an ID.
In addition, Iceberg also uses a persistent tree structure similar to git that efficiently stores snapshot data and metadata. Only changed files are rewritten to produce a new snapshot. The majority of the existing data and metadata is reused across snapshots to greatly reduce write amplification.
Creating tags and branches
You can create a tag or branch using the Iceberg Table API, or using extended table DDL in Spark. Creating a ref without a specific snapshot will point the ref to the table’s current snapshot.
ALTER TABLE logs CREATE BRANCH v2_development
ALTER TABLE logs CREATE TAG q3_2023 AS OF VERSION 9823578123490876
The create command optionally supports OR REPLACE
and IF NOT EXISTS
. Note that although tags are read-only, you can replace a tag to alter its state (like in git).
You can query tags and branches by passing the reference name through time travel syntax:
SELECT count(*) FROM logs FOR VERSION AS OF q3_2023
Writing to branches
It can be difficult to test pipelines built with MERGE because the command’s behavior depends on both the incoming source data and the current table state. Branches are a great way to solve that problem because they start with a lightweight copy of the table’s data at the time the branch was created.
To write to a branch in Spark, update the table name by adding another identifier, branch_<name>
. For example, to write to the v2_development
branch, use examples.accounts.branch_v2_development
. (Remember that identifiers with more than one part need to include the namespace.)
MERGE INTO examples.accounts.branch_v2_development AS t
USING transfers AS s
ON t.account_id = s.account_id AND
s.ts > ${last_processed_ts}
WHEN MATCHED AND s.amount IS NULL THEN
DELETE
WHEN MATCHED THEN
UPDATE SET t.balance = t.balance + s.amount
WHEN NOT MATCHED THEN
INSERT (t.account_id, t.balance) VALUES (s.account_id, s.amount)
Branches also support more comprehensive data engineering patterns, like the write-audit-publish.
Retention settings
By default, refs are kept indefinitely. Refs will prevent Iceberg from garbage collecting the data files in the referenced snapshot, so it’s often a good idea to set a ref retention when creating a tag or branch.
ALTER TABLE logs CREATE TAG q3_2023 RETAIN 730 DAYS
Iceberg will garbage collect the ref when its snapshot reaches the retention age limit, and will also clean up newly unreferenced snapshots that can be removed.
Branch history retention settings
Branches can be configured with additional settings for snapshots that control how Iceberg retains snapshots that are ancestors of the branch’s current state: maximum snapshot age and minimum snapshots to keep.
ALTER TABLE logs CREATE BRANCH v2_development
WITH SNAPSHOT RETENTION 7 DAYS 10 SNAPSHOTS
These default to the table-level retention settings in the snapshot expiration recipe.
Removing tags and branches
You can explicitly remove a ref using DROP
.
ALTER TABLE logs DROP TAG q3_2023
ALTER TABLE logs DROP BRANCH v2_development