Iceberg's Guiding Light: The Iceberg Open Table Format Specification

blog-image

If you’ve worked with Iceberg tables, you may have come accross the table property format-version and wondered what the difference is between versions 1 and 2. This post will help demystify this table property by describing its significance, as well as its background and the implications of different values.

The Iceberg Table Spec

When working with Iceberg tables through a compute engine or an Iceberg client library, it’s easy to think of Iceberg as a piece of software for reading and writing data. At its core, however, Iceberg is an open community standard with a very detailed table specification that ensures compatibility across various languages and implementations. This spec is the bedrock that enables consistent behavior and data correctness and, just like the Iceberg source code, is maintained with extreme care by the Iceberg community.

Format Version

The data space continues to experience immense growth and innovation which comes with an ever-growing list of critical features for the ecosystem of tools that support it. The Iceberg community takes an approach of carving out these requirements and expected behaviors in a carefully constructed and detailed format specification. Instead of focusing on simply implementing a canonical library with some set of behaviors, the contracts that should be expected from an Iceberg implementation are clearly mapped out, providing a tight set of requirements for readers and writers and ensuring no surprises.

The Iceberg community divides new features into two categories, those that require careful implementation into the Iceberg clients or runtimes, and those that require modifying or extending the Iceberg table format spec. As you might expect, changes to the spec are done with extraordinary care and consideration and since Iceberg’s creation, a new version of the format spec has only been released once. In 2020, version 2 of the format spec was finalized and adopted by the Iceberg community. In the same way that SQL has provided an open standard for query behavior, helping various query engines “speak the same language”, the Iceberg open table specification provides an open standard for table behavior at massive scale. So far, this has helped many powerful open source compute engines share the same data warehouse, allowing the data warehouse to seamlessly serve as the center of data gravity for many organizations.

V1 vs. V2

The original Iceberg specification (V1) outlined much of the core design and behaviors that exist in Iceberg today. It outlined how optimistic concurrency control is achieved through atomic swapping of table metadata files and defined the requirements for readers and writers. It also listed the minimally required operations for a compatible file system and details all metadata files, properties, and data-types, as well as table constructs such as schema and partitioning.

As designs for row-level mutations emerged, it became obvious that this feature would require breaking changes to the Iceberg table spec. Enabling row-level deletes, in particular, introduces a new element–delete files. Delete files are used to encode row-level deletes in two ways. Position deletes specify a data file path and a row position that should be considered deleted by readers. Equality deletes, on the other hand, specify one or more column values where each row containing that column value should be considered deleted by readers. This of course required broad implementation changes to the Iceberg clients as well as all compute engines and so was defined in a new version of the Iceberg table spec, V2.

Additionally, new tables created according to the V2 spec needed to be identifiable by implementations in order to ensure the correct behavior for readers and writers. This made the value set for the format-version property for Iceberg tables extremely important, which until the release of the V2 spec was largely ignored.

V3

Many new Iceberg features on the horizon include changes to the specification. These features include data encryption, secondary indexes, default values for fields, and supporting relative paths. You can see more details about these features by checking out the “Spec v3” projects in the Iceberg repo. As these features are released, you can bet that the behavior contracts will be captured in the V3 spec!

Setting format-version of a Table

This wouldn’t be a proper post without any code so the following shows how you can set format-version as a table property when creating a table.

CREATE TABLE logs (app string, lvl string, message string, event_ts timestamp)
TBLPROPERTIES ('format-version' = '2')