INTRODUCTION
If you have been working in a data lake, you’re probably very familiar with its drawbacks. You’re in luck: Iceberg was created to address the shortcomings of data lakes. Its purpose is to fix day-to-day problems — and then get out of the way — so you can focus on high-value work without worrying about the low-level details of the files, formats, and storage structure underneath your tables.
The Apache Hive table format has issues
Hive’s table format is the de facto standard for tracking where data resides in a file system in order to make queries more efficient. To do this, it stores files in a directory structure and uses that structure to determine what files to read to respond to a query.
The Hive approach is too simplistic and creates the following problems:
- There are no transactional guarantees.
- There isn’t enough metadata to perform well at scale.
- Schema evolution is unpredictable and leads to correctness errors.
- Partitioning is manual and error-prone.
Iceberg tables are safe, reliable, and performant
Iceberg was created to solve these problems and upgrade a data lake to deliver the guarantees and usability of a data warehouse.
- Iceberg table updates have ACID guarantees.
- Schema evolution follows SQL rules and Iceberg data types have dependable behavior across formats.
- Iceberg handles partitioning automatically. Data is written correctly according to the table configuration, eliminating extra work for writers and increasing data quality. Readers don’t need to understand the file layout to query a table.
- Iceberg indexes both data and metadata to speed up queries and scale to tens or hundreds of petabytes in a single table.
- Time travel and rollback – Iceberg’s design keeps track of old table versions so you can query old snapshots and roll back to historical states.
- Branching and tagging – Iceberg allows you to tag important table versions. You can also create branches to make testing and auditing easier.
Iceberg brings SQL behavior to the data lake
Iceberg’s key feature is correctness — it guarantees reliable transactions and safe schema changes. This enables declarative data engineering on a data lake. For instance, expressive row-level SQL commands such as MERGE are supported. Also, optimization services such as file compaction can safely run as background processes.
Fixing transactions for the data lake environment has made it easy for data warehouses to expand their supported storage options by adopting Iceberg. This is evidenced by the recent announcements by Snowflake, Amazon Redshift, and Google BigQuery. Iceberg unlocks the promise of shared storage, where a single storage layer serves both data lake and data warehouse use cases.
Benefits of adopting Iceberg
Iceberg delivers on the original promise of the data lake by providing the abstraction and guarantees of a database, without being captive to a particular compute layer. This has enormous implications for the data operation of a business, as detailed in the Introduction. The benefits of Iceberg include:
Data engineers’ jobs get a lot easier
Iceberg eliminates or simplifies a number of data engineering tasks traditionally required to build and maintain a data lake.
Data consumers can “move fast and not break things”
Data consumers get to “bring their own compute” and apply query engines based on their workload requirements. They can leave the details to Iceberg by using declarative SQL.
The business gets simpler and more affordable architecture, without lock-in
With Iceberg as a shared storage layer, adding new compute options doesn’t require copying data. A zero-copy data stack reduces the expense of storing data, simplifies governance by centralizing policy management, and eliminates the risk of failing to synchronize data or policies across multiple stores.
Iceberg transforms your data lake
In short, Iceberg fast forwards the data lake from the “duct tape and rubber bands” inherited from Hadoop’s organic development, to a safe and performant foundational storage layer. With Iceberg at the core, your data lake becomes as reliable as a data warehouse, while enabling teams to use the tool or framework that works best.