Introduction from the original creators of Iceberg

By Ryan Blue and Daniel Weeks, Iceberg PMC Members

Apache Iceberg is now the de facto open format for analytic tables. That has been a surprisingly swift rise, moving from primarily large tech companies like Netflix and Apple to near-universal support from major data warehouses for use by their customers in about 18 months. That’s a stunning development that speaks to the massive value of a truly open and ubiquitous table format.

Iceberg makes data teams productive

When Dan and I created Iceberg, our aim was simply to help people be productive. Our internal customers at Netflix were constantly running into problems that our Apache Hive-based data platform left unsolved:

Correctness – Hive tables lacked ACID transactions so updating a table would regularly corrupt query results, or the tables themselves, if there were concurrent writes or failures.
Performance and scale – Listing directories to find data files was too slow and lacked the metadata needed to filter files within partition directories.
Costly distractions – Regular schema evolution would corrupt table data and leak downstream, writers had to worry about data file sizes, and readers needed to understand a table’s physical layout to construct efficient queries.

These problems affect anyone using Hive-like tables, but at Netflix the effects of these flaws were amplified by using Amazon S3 as our source of truth since Hive was built for HDFS rather than object stores. 10x higher latencies made directory listing impractical and eventually consistent listing made correctness problems more common. The combination eroded trust in our data products.

For a while, we applied band-aid solutions: we hijacked listing to catch consistency issues, we forked the Hive Metastore (HMS) to make partition swapping safe, we used multipart uploads to avoid copies, and we turned off guarantees in our S3 file system to avoid extra requests. But we were only able to make certain changes safe, and the performance improvements were only incremental. We needed something more.

In all three areas — correctness, performance, and distractions — the Hive table format was the limiting factor. Its method of tracking files could not support transactions and prevented us from tracking the metadata needed for a step-function increase in performance and scale. Manual partitioning was a persistent source of human error. Type support was inconsistent across file formats and schema evolution was dangerous.

We created Iceberg to address these challenges.

We initially focused on ACID transactions and schema evolution — correctness was the top priority — but we knew this was an important once–in-a-decade opportunity. Replacing the foundation of our data platform was not something we wanted to do again, so we took the time to improve performance and solve many other challenges, too.

Iceberg has grown to avoid or solve many more problems:

Hidden partitioning eliminates unnecessary data quality errors.
Time travel makes results reproducible and helps build confidence in results.
MERGE and other row-level commands let people easily express update logic while also unlocking performance optimizations in query engines.
Branching enables new engineering patterns such as write-audit-publish for better testing and data validation.
Tagging keeps important table versions around without expensive copies.
In-place table evolution avoids disrupting data consumers as data size or shape changes.
Table-level configuration unlocks dynamic, automatic optimization.
Metadata indexing makes even massive datasets available without requiring a distributed engine — in Python or DuckDB, for example.

Looking back, the focus on data engineering and data consumer productivity has been a major factor in Iceberg’s success. Iceberg takes on hard problems so people can stop worrying about them.

In fact, people shouldn’t think about Iceberg either, just as they don’t think about the low-level details of storage in Postgres or Snowflake. Ideally, Iceberg should be invisible. What’s important is the SQL table abstraction: users should be allowed to focus on their data and use declarative tools, while the database handles the details.

That’s the core vision of the project: Iceberg is an open standard for tables with SQL behavior.

Shared storage unlocks modular data architecture

When we created Iceberg, we wanted to add the reliability and guarantees of a SQL data warehouse to our data lake. What we didn’t realize was that we would end up doing the opposite as well. Now, data warehouses are incorporating Iceberg to add an important capability from data lakes: shared storage.

This was a happy accident. We set out to solve concrete challenges: to ensure ETL pipelines didn’t cause correctness problems in concurrent ad-hoc queries, and to enable asynchronous compaction while committing new streaming data every few minutes. To do that, we needed a way for multiple query layers to work with the same table at the same time, safely. Coming from the data lake world, supporting multiple engines with one table was simply a necessity. But separating the storage layer from the query layer — without compromising performance or reliability — is fueling an unprecedented transformation in the data industry.

This transformation starts with the flexibility to use any engine or framework with one central table store But it doesn’t stop there. Sharing a storage layer is fundamentally different and it is going to reshape data lakes and data warehouses.

We are already seeing this happen. For instance, sharing tables has heightened the need for universal governance. As a result, access controls must move to the storage layer to secure data; locking down multiple access patterns individually and keeping them in sync is costly and unreliable.

The aggregate result of this and similar changes will be data infrastructure that is not just flexible but truly modular: components will connect easily and fit together like Legos.

As it is with Iceberg, the purpose of modular architecture is productivity. Configuring an engine shouldn’t require time-consuming manual integration or research to see if it plays nicely with the rest of your infrastructure. Security should be universal and on by default, not off until you have the time to wire it up. Your toolbox should span Python scripts to sophisticated data warehouses. Trying out new tools should take days, not weeks.

Iceberg tables are the foundational component of modular data architecture, but open protocols like OAuth2 and Iceberg’s new REST catalog spec are also essential.

Iceberg disrupts the data industry model

Architecturally, a shared storage layer accessible from multiple tools and engines clearly makes sense. Storage that is tightly coupled to one query engine is less useful because the engine can’t be good at everything. Until now, that has led to a difficult choice: copy data to a different engine, or use it where it’s stored.

Copying has significant costs: it duplicates data, requires ongoing synchronization, makes governance more difficult, and is harder to trust. Sharing data in place avoids these costs entirely. It’s much easier to use any tool for a task or to try out multiple options to see which one works best. In short, a zero-copy modular architecture relieves the pressure to choose where to run queries based on where data is already stored.

As a result, shared storage disrupts the monolithic business model that has dominated the database industry since its inception. When storage and compute are tightly coupled, compute workloads are won based on where the data lives, instead of what solution is most maintainable, cost-effective, or performant.

Shared storage puts you back in control. You permit the data warehouse to access your data, not the other way around. Instead of being stuck in silos, data is more accessible, more useful, more performant, and more secure.

The Iceberg Cookbook: a practical guide to Iceberg

Clearly, we’re excited about the modular data architecture that Iceberg unlocks. But the purpose of this cookbook is to return to our roots: helping people be productive. This book is intended to be a practical guide to help you understand where Iceberg fits, how to take advantage of its features, and how to troubleshoot issues when things go wrong. We hope you find it useful.

Lastly, remember that Iceberg isn’t just a project, it’s a community. If you’re new, welcome! If you’ve been here a while, see if you can help someone. Communities thrive when people step up, so please answer questions, share how you solve problems, and contribute pull requests.

And also to report errors or provide feedback on these recipes, please email [email protected]!

Dan & Ryan

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg