It’s been an awesome year for Apache Iceberg–new features, an explosion of support across different engines, and a growing number of contributors and community members.
Here is a recap of what’s happened over the past month.
- New Features & Support
- Python Client Progress
- Two New Specifications
- Expansive Support For Iceberg
- Community Growth
New Features & Support
Iceberg now supports Spark 3.3.0 which includes improved join query performance via Bloom filters, more complete ANSI compliance, and overall better error handling, among
many other things. The recently released Iceberg version 0.14 includes merge-on-read support for UPDATE
and MERGE
commands, as well as improved
scan planning performance for Spark 3+.
The latest release of Iceberg also supports Flink 1.15 and comes with a FLIP-27 reader. FLIP-27 is the newer source interface for Flink, unified for both
streaming and batch modes and comes with better performance, as well as stronger failure recovery with region recovery mode.
Other big new features includes the new REST catalog, a new file format for index and stats data called “Puffin”, and a new Z-Order strategy option for rewrite operations!
Python Client Progress
A tremendous amount of progress has been made around development of a new native python client for Iceberg. Core components of the client are shaping out, such as
expressions, readers, and catalog implementations and the community is working diligently towards an initial release that includes read support.
Two New Specifications
Two new specs have been added that complement the Iceberg table format spec. The Iceberg View-Spec brings the
multi-engine support vision of Iceberg tables to views. Iceberg views standardize view metadata so that multiple compute engines can create, modify, and
remove views from a common metastore.
In addition to the view spec, the Puffin file-format has been accepted and adopted by the Iceberg community.
Puffin is a file-format for storing indexes and additional statistics about Iceberg tables and columns. These files are tracked by the table’s metadata and can
be used by engines during query planning. The data is stored as binary blobs and the first version of the Puffin spec includes a blob type for theta sketches
produced by the Apache DataSketches library. One of the first use cases for Puffin stats is storing the number of distinct values (NDV)
for individual columns in a table.
Expansive Support For Iceberg
Iceberg support has spread immensely across many compute engines and platforms this year. AWS Athena, Snowflake, and Cloudera have all made big announcements around full support for Iceberg. Last month Apache Doris graduated from incubator status
and recently announced support for reading Iceberg tables. Apache Doris is an open source massively parallel processing (MPP) database, started on Baidu’s advertising
team and officially open sourced back in 2017. Another high-performance MPP database that was originally based on Apache Doris, is StarRocks. StarRocks aims to enable
the vast landscape of rapid data analytics through highly optimized query performance and as of StarRocks version 2.1.0,
Iceberg tables can be used with StarRocks..
When using StarRocks 2.3 or later, using Iceberg is even easier with automatically synchronized schemas!
As more platforms support Iceberg tables, a ton of great content is getting published based on real experiences. Cloudera recently released a technology spotlight called Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform that includes a great overview of Iceberg’s strengths and how you can leverage them in the Cloudera Data Platform. Snowflake also published a great article titled 5 Compelling Reasons to Choose Apache Iceberg that makes a case for Iceberg as a modern table format with a lot to offer.
Community Growth
One of the best parts of Iceberg is the vibrant community of engineers, developers, and users. The community has seen continued growth with new members joining
from all over the world. Whether it’s the efforts around the native python client or the growing number of compute engine integrations or just the love of all
things data, new members have brought fresh ideas and perspectives that have kept the momentum going at impressive levels.
With all of the excitement this year so far, the second half of 2022 looks like it will be even more exhilarating. Don’t wait to join the community! We’ve outlined
various ways to join and participate on our Community Page. I hope to see you there!