Why Apache Iceberg — for data warehouse users

Major data warehouse platforms such as Google BigQuery, Snowflake, AWS, and Databricks have all announced support for Apache Iceberg tables. Commercial warehouse engines seldom add new data formats, and it’s even more noteworthy that these vendors are adding support at the same time.

If you’re looking at Iceberg from a data lake background, the attraction is obvious: transactions are safe so queries never lie, queries can time travel to historical versions, partitioning is automatic and can be updated, schema evolution is reliable — no more zombie data! — and a lot more.

However, an experienced data warehouse user might say, “I already have SQL behavior and ACID guarantees, so why should I care about Iceberg?”

Well, data lakes provide compute flexibility that has been lacking in data warehouses. For example, Trino prioritizes speed and is great for ad hoc federated SQL queries. Spark prioritizes reliability and can mix Python, Scala, and SQL together. And there are more examples, such as Apache Flink for streaming or Python DataFrames that run on GPUs.

On the other hand, data warehouses are built on the assumption that all data access is through a tightly-coupled query layer. Iceberg’s appeal is that you aren’t forced to choose between a larger toolbox (data lakes) and powerhouse SQL engines (data warehouses). You can have a data architecture that incorporates both, sitting on top of a shared Iceberg storage layer without compromising SQL behavior or ACID guarantees.

What you gain when you de-couple query from storage

There are a growing number of ways to work with data and people want the right tool for the job, whether it’s to load data into hundreds of Python containers in parallel to test model parameters, or use a streaming framework to easily sessionize events, or query the same tables from a BI tool as they are consumed by Spark.

These tasks are not easy to execute in a data warehouse.

If you spun up 1,000 Python jobs at once to read the same warehouse table, chances of success are low. This is the “thundering herd” problem. You can fix it by scaling up query capacity, but making it work is not simple or cheap. Stream processing is at odds with a query-centric view of data, so it must depend on custom integration and protocols or on messy polling and deduplication.

By building support for Iceberg, data warehouses can skip the query layer and share data directly. Iceberg assumes there is no single query layer – that many different processes all share the same underlying data and coordinate through SQL behavior applied to a table format along with a very lightweight catalog.

How does Iceberg change modern data architecture?

With Iceberg, modern data platforms get a lot more options. An independent storage layer lets you use the processing pattern or query layer that is best for your task and for your team in a unified data architecture, without needing to maintain pipelines to copy data in or out for them.

Iceberg also protects you from lock-in. It is an open standard governed by the Apache Software Foundation. It has resulted from collaboration across a broad community of companies such as Netflix and Apple, and this community is trusted by vendors such as Google, AWS, Snowflake, and more. That trust is incredibly important because it means the companies you depend on can confidently invest time and R&D into Iceberg, as well as build support for it in their products.

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg