❤️ Apache Iceberg? Spread the word by giving it a ⭐ on the apache/iceberg repo!
Project updates
Iceberg Java
- Release 1.4.0 is just around the corner! 🙌🏿.
- Added Spark 3.5 support
- Row-level implementations of merge, update, and delete have moved to Spark, and all related extensions have been dropped from Iceberg.
- Passing an advisory partition size, so you can set the target size per table and have automatic coalesce with Adaptive Query Execution. This makes the output tunable to minimize the number of small files requiring compaction based on runtime metrics.
- This release removed support for Spark 3.1 and are deprecating support for 3.2. This along with the row-level implementations make Spark release upgrades in Iceberg happen faster.
- Added Spark support for distributed planning
- Iceberg planning utilizes manifest partition info to quickly plan queries with partition filters, especially when metadata is properly clustered.
- Distributed planning can enhance performance for specific use cases due to higher cluster parallelism compared to driver cores.
- Tests on a table with 20 million files showed significant improvements in planning times for various queries, but the cost of delivering results can be a limiting factor.
- Push down Iceberg functions to Spark V2 filtersIceberg can now push down system functions to reduce the amount of data read from files. For example, if we only want to retrieve data from a single bucket we can use:
spark.sql( """ SELECT * FROM my_catalog.db.table WHERE my_catalog.system.bucket(10, id) = 2; """ )
This will also work with other Iceberg partition functions. In addition, we can take advantage of this when calling rewriteDataFiles (rewrite_data_files) and rewritePositionDeleteFiles (rewrite_position_delete_files), like so:spark.sql( """ CALL my_catalog.system.rewrite_data_files( table => 'foo.bar', where => 'my_catalog.system.bucket(4, url) = 0') """ )
- Added AES GCM Stream encryption and decryptionSupport has been added for AES GCM Stream, which provides data encryption and integrity verification. This is a great step forward in the ongoing effort towards full metadata encryption.
- Added strict metadata cleanupStrict metadata cleanup provides additional protection against table corruption by only triggering metadata cleanup operations when commits fail due to an exception that implements the CleanableFailure interface.
- Add Vectorized reads on delete, update, and merge plans
- Remove restrictions in Arrow and Spark 3.4 logic that only enabled delete reads.
- Enable delete, update, and merge plans to continue with vectorized execution rather than falling back to row based reads.
PyIceberg, Iceberg-Go, and Iceberg-Rust
- Released PyIceberg 0.5.0 🎉
- Support serverless environments (including AWS Lambda)
- Support for schema evolution
- PyArrow HDFS support through PyArrow
- More about PyIceberg
- The Iceberg Rust client continues progressing with the addition of FileIO and the Catalog API
- The next frontier is Table interfaces
- Not sure what any of this means? Read FileIO or a Catalog to get a better understanding of these APIs.
- The Iceberg Go client added support for partition specs, manifest files, and is now progressing to the Table interface as well.
- What’s a partition spec?
Bergy Blogs
- How to Reduce Full Table Scans during Merges in Apache Iceberg and Save Money
- Iceberg REST Catalog with Hive Metastore
- The Disruptive Nature of Data Lakehouses
- Spark + Kyuubi + Iceberg = Lakehouse
- How to work with Iceberg Format in AWS Glue
- Ryan Blue: Deep Dive into CDC Series:
Ecosystem Updates
- Announcing DuckDB 0.9.0 🦆DuckDB launched an experimental Iceberg extension.
- Experimental extension that currently supports basic Iceberg table reads with little optimization
- Added 🐻❄️ Polars integration based on PyIceberg
- Read the docs on scan_iceberg method
- Trino 🐇 adds read support for refs and tags in their 427 release
- That’s right, you can read branches and tags from Trino now
- Stay tuned for creating branches and tags via SQL
Vendor Updates
- A Case for Independent Storage. Tabular raises $26M
- MotherDuck announces the DuckDB Iceberg support and their strategy to support data lakes
- Query your Iceberg tables in data lake using Amazon Redshift (Preview)
- Tabular File Loader: Hassle-Free File Ingestion to Iceberg Tables
- Google reveals BigQuery innovations to transform working with data
- Puppy Graph: Doing graph + tabular analytics directly on modern data lakes
- Building a Data Lakehouse using Apache Iceberg and MinIO
Iceberg Resources
🏁 Get Started with Apache Iceberg
👩🏫 Learn more about Apache Iceberg on the official Apache site
📺 Watch and subscribe to the Iceberg YouTube Channel
📰 Read up on some community blog posts
🫴🏾 Contribute to Iceberg
👥 SELECT * FROM you JOIN
iceberg_community
📬 Subscribe to the Apache Iceberg mailing list