July 2022 - What's New in Iceberg?

blog-image

It’s been an awesome year for Apache Iceberg–new features, an explosion of support across different engines, and a growing number of contributors and community members. Here is a recap of what’s happened over the past month.

New Features & Support

Iceberg now supports Spark 3.3.0 which includes improved join query performance via Bloom filters, more complete ANSI compliance, and overall better error handling, among many other things. The recently released Iceberg version 0.14 includes merge-on-read support for UPDATE and MERGE commands, as well as improved scan planning performance for Spark 3+.

The latest release of Iceberg also supports Flink 1.15 and comes with a FLIP-27 reader. FLIP-27 is the newer source interface for Flink, unified for both streaming and batch modes and comes with better performance, as well as stronger failure recovery with region recovery mode.

Other big new features includes the new REST catalog, a new file format for index and stats data called “Puffin”, and a new Z-Order strategy option for rewrite operations!

Python Client Progress

A tremendous amount of progress has been made around development of a new native python client for Iceberg. Core components of the client are shaping out, such as expressions, readers, and catalog implementations and the community is working diligently towards an initial release that includes read support.

Two New Specifications

Two new specs have been added that complement the Iceberg table format spec. The Iceberg View-Spec brings the multi-engine support vision of Iceberg tables to views. Iceberg views standardize view metadata so that multiple compute engines can create, modify, and remove views from a common metastore.

In addition to the view spec, the Puffin file-format has been accepted and adopted by the Iceberg community. Puffin is a file-format for storing indexes and additional statistics about Iceberg tables and columns. These files are tracked by the table’s metadata and can be used by engines during query planning. The data is stored as binary blobs and the first version of the Puffin spec includes a blob type for theta sketches produced by the Apache DataSketches library. One of the first use cases for Puffin stats is storing the number of distinct values (NDV) for individual columns in a table.

Expansive Support For Iceberg

Iceberg support has spread immensely across many compute engines and platforms this year. AWS Athena , Snowflake , and Cloudera have all made big announcements around full support for Iceberg. Last month Apache Doris graduated from incubator status and recently announced support for reading Iceberg tables . Apache Doris is an open source massively parallel processing (MPP) database, started on Baidu’s advertising team and officially open sourced back in 2017. Another high-performance MPP database that was originally based on Apache Doris, is StarRocks. StarRocks aims to enable the vast landscape of rapid data analytics through highly optimized query performance and as of StarRocks version 2.1.0, Iceberg tables can be used with StarRocks. . When using StarRocks 2.3 or later, using Iceberg is even easier with automatically synchronized schemas!

As more platforms support Iceberg tables, a ton of great content is getting published based on real experiences. Cloudera recently released a technology spotlight called Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform that includes a great overview of Iceberg’s strengths and how you can leverage them in the Cloudera Data Platform. Snowflake also published a great article titled 5 Compelling Reasons to Choose Apache Iceberg that makes a case for Iceberg as a modern table format with a lot to offer.

Community Growth

One of the best parts of Iceberg is the vibrant community of engineers, developers, and users. The community has seen continued growth with new members joining from all over the world. Whether it’s the efforts around the native python client or the growing number of compute engine integrations or just the love of all things data, new members have brought fresh ideas and perspectives that have kept the momentum going at impressive levels.

With all of the excitement this year so far, the second half of 2022 looks like it will be even more exhilarating. Don’t wait to join the community! We’ve outlined various ways to join and participate on our Community Page . I hope to see you there!