The Apache Iceberg community just released a new version, 1.1.0. In this post, we’ll explore some of the recent highlights:
- API stability
- Puffin Stats Files
- Branching and Tagging
- Table Scan Metrics
- Changelog Scans
- Spark FunctionCatalog
API stability
Iceberg 1.1 comes on the heels of the 1.0 release that added API stability guarantees. While API stability isn’t brand new, it’s still worth noting! From 1.0 onward, the community will maintain binary compatibility within major release versions for the iceberg-api
module.
Other modules also have documented guarantees – core and other modules intended for query engine integration will continue to deprecate and support APIs for at least one minor release. Spark modules, for example, can change more rapidly because their purpose is to plug capabilities into Spark’s API rather than providing an API directly.
Puffin Stats Files
Puffin is a format for storing statistics, indexes, and data sketches. In this release, Puffin files have been added to table metadata to track statistics used by cost-based optimizers, like column distinct value counts (NDVs). Puffin is used both to share these stats across query engines and to maintain the data sketches that produce those stats incrementally.
Branching and Tagging
The 0.14 release added the metadata API to manage table branches and tags, but the features weren’t yet exposed to engines. The 1.1 release adds the ability to append or delete data in a branch, and the ability for Spark to read from a named branch or tag:
df = spark.read
.option("tag", "q4_2022")
.table("accounting.transactions")
Table Scan Metrics
An important trick for query performance is to make sure job planning is able to take advantage of metadata indexing. This is now much easier because 1.1 collects and logs scan metrics, including the number of manifests used for planning and the total manifests in the table. This makes it simple to spot cases where queries can be sped up by rewriting metadata.
Changelog Scans
Before 1.1, it was possible to incrementally read data appended to a table, but not data that was deleted. While this worked for fact tables, not all tables change only by inserting rows.
Iceberg 1.1 introduces a new scan type for reading tables incrementally that produces all inserted or deleted rows, along with metadata columns that signal whether the row was added or deleted and when the change happened.
Using the new scan type from Spark is as easy as reading from the changes metadata table:
df = spark.read
.option("start-snapshot-id", "5186366032052790134")
.table("taxi.nyc_taxi_yellow.changes")
Spark FunctionCatalog
Iceberg’s internal partition functions are now available from Iceberg catalogs in Spark SQL. This makes it significantly easier to provide a custom sort order in a job that writes to a partitioned table that uses the bucket or truncate functions. The functions are exposed in the system database:
SELECT system.bucket(128, "Thanks, Kyle!") as bucket_val
There’s also a function to return the current Iceberg version:
SELECT system.iceberg_version() as version