This year at Subsurface, my talk was on my favorite developments in Iceberg in the last year. The one I’m most excited about is the addition of tags and branches to the Iceberg spec.
What are tags and branches? They are named references to versions of the data in an Iceberg table — which Iceberg calls “snapshots”. Each snapshot is the result of a change to the state of a table. Before, there was just one reference, the current snapshot, but tags and branches were added to track more snapshots and unlock new use cases.
If this sounds suspiciously familiar, you’ve probably used the nearly identical features in git. The snapshots in Iceberg tables are much like the commits in a git repository, and tracking tags and branches works almost exactly the same way. They’re lightweight named references that point to an ID; tags can’t be changed, but branches allow new commits to update the reference.
Iceberg departs from the git model in that its snapshots expire and are removed to allow cleaning up the large underlying data files. But what if you don’t want a particularly important version to expire? Tagging to the rescue! The simplest use is tagging a version to keep it around longer. For example, if you want to keep the data used for Q4 earnings reports for 2 years, that’s easy:
ALTER TABLE accounts CREATE TAG q4_2022 RETAIN 730 DAYS
Tags are great for labeling important versions for reproducible results, like tagging the version of a training set used to build a particular model version.
Tags are helpful, but branches enable new use cases. A common Iceberg pattern is to integrate audits into a pipeline so that data isn’t published to downstream consumers until it has been validated. It’s much better to prevent bad data from leaking than to clean up a mess.
Branches improve integrated audits by making it much easier to work with staged changes. Branches can be written to multiple times before fast-forwarding the current table state, and it’s much easier to audit versions that can be referenced by name with time-travel syntax:
SELECT ... FROM target VERSION AS OF branch_name
Branches also make testing easier because a “test” branch will include all of the table’s older data. It’s easier to validate changes when downstream jobs can be run directly rather than trying to reconstruct partial results from a test table.
Like tags, branches have a retention option, so they can be removed automatically when they’re unused.
The DDL for creating tags and branches in Spark is landing in the upcoming 1.2.0 release, as is support for writing to branches in both Spark and Flink. I’m excited to see what other patterns are built around these powerful new building blocks!