Yesterday the Apache Iceberg community released version 1.4.0, packed with improvements. In this post, I’ll highlight a few of those improvements that will make life easier for data practitioners.
Core Iceberg improvements
Let’s start with the core library. First, a few defaults were changed in this release and the most significant is the default table format. Tables will now use format v2 by default.
The v2 format spec was adopted a little more than 2 years ago — in early August 2021 — and is now widely supported. Although older clients could, in theory, lack support, there aren’t many deployments with versions 2+ years old. Further, v2 is already the default for some query engines like Trino. You can still use v1 if you choose by creating tables with the format-version property set to 1, and this doesn’t affect existing tables.
There are a couple motivations to update to v2. A few metadata problems were fixed in v2 that no longer need work-arounds, like needing placeholders to drop partition fields. In addition, v2 added support for row-level delete files that can be used to reduce write amplification. These delete files are used by the merge-on-read strategy in Spark’s row-level MERGE and UPDATE commands. Support for merging delete files at read time is missing in some engines, but using delete files is an optional performance improvement. Make sure all readers support delete files before producing them from a MERGE (by default, Spark uses copy-on-write).
The compression codec default changed from gzip to zstd. Choosing the compression codec and level per table gets the best results, but zstd is a good starting point. At Tabular, we’ve found that zstd is often (though not always) selected by our table analyzer and tends to have a much better compression rate with the same read performance.
The release also comes with a few notable additions to IO and catalogs. Direct ADLS support is now available using AzureFileIO. FileIO is a simpler abstraction than a file system, so direct support avoids the overhead of unnecessary requests handling directory logic.
Lastly, the REST catalog protocol was extended to support multi-table commits. The new endpoint allows sending a set of requirements and changes for a set of tables rather than just one table. This is one of the building blocks needed to support multi-table transactions, but once those are fully supported, multi-table commits can be used to coordinate operations on multiple tables. For example, this now supports the write-audit-publish pattern across multiple tables by staging updates in branches, then fast-forwarding the main branch to the new state for all tables at the same time.
Spark updates
The release also includes some big updates to Spark support. The Spark community released 3.5 a couple of weeks ago and support for Spark 3.5 is included in Iceberg 1.4. With the addition of 3.5, the community has removed the artifacts for Spark 3.1, which was released more than 2 years ago and is no longer supported by the Spark community. Spark 3.2 is now considered deprecated.
Spark 3.5 brings several improvements that make life easier for data engineers. The first is a new interface that lets Iceberg request better task sizes when writing. Before, adaptive query execution (AQE) used the same task size for intermediate stages and writes — 64 MB by default — even though data written to Parquet tends to be much smaller once encoded and compressed. As a result, Spark would produce small files even with AQE enabled. Now, the target task size (or “advisory partition size” in Spark) can be set by the data source connector, so Iceberg will request larger tasks to avoid creating too many tiny files.
In general, we think this is the right approach. But you may have jobs where tasks are larger and the write takes more time. In most cases you shouldn’t worry because it is better to write data correctly in the first place. If you need, you can set the target size per table by setting write.spark.advisory-partition-size-bytes
in table properties.
Another major change in 3.5 is that Iceberg’s row-level commands (like MERGE INTO) were moved into Spark. This is a win for both projects. Spark had no implementation of MERGE and UPDATE, and now other data sources can take advantage of both the copy-on-write and merge-on-read plans. For Iceberg this means the implementations stay up-to-date as Spark changes, which will make integrating new Spark versions faster.
In addition to support for Spark 3.5, the 1.4 release adds several performance improvements:
- Writes to unsorted tables will skip local sorts and will be faster. Iceberg previously required a local sort to ensure data was clustered by partition. Before Spark supported requesting a distribution for the incoming data, tasks could receive data for any partition. That would easily exhaust memory by keeping an open Parquet file for every partition. Now Iceberg will request a distribution that avoids lots of open files so it’s no longer necessary to add an expensive local sort.
- Scans for very large tables can distribute planning in a Spark cluster. In really large tables, metadata becomes big data. Usually, Iceberg can use metadata indexing to skip reading most metadata and plan table scans quickly. But metadata clustering isn’t always aligned with scan patterns. For those cases, distributed planning can take advantage of Spark to parallelize metadata reads for faster queries.
- Task sizes will adapt to maximize parallelism for small queries. Small queries are sometimes bottlenecked by using a few large tasks instead of many small tasks. When a query can be satisfied by a few hundred MBs, a cluster may have excess capacity idling because there are only a couple tasks. In 1.4, Iceberg will attempt to detect when there is excess capacity and rebalance work to take advantage.
Last, the new release also supports function pushdown, so queries with filters that use Iceberg functions, like bucket(16, id) = 0
, will be pushed down and can select Iceberg partitions directly.
What’s next?
There’s even more in 1.4 than I can cover here, so check out the Releases page for more information.
Next, the community is working hard on the 1.5 release!
The common view API has already landed in the main branch, with a test implementation in the in-memory catalog.
We’ll also be adding view support to the REST protocol and working on Spark integration.
Also, the recent work on table encryption is expected to land in 1.5. This will include metadata encryption using the AES GCM encryption streams and native Parquet encryption for data files.