Data engineering with Apache Iceberg

DATA ENGINEERING

Data engineers starting at Netflix attend (or used to, at least) a few hours of orientation to become familiar with Netflix’s data platform and engineering best practices. Part of that orientation is a section on Iceberg, and the central point of that session is that you should not care about Iceberg; Iceberg should be invisible. That’s still a fundamental aim of the project: to free data engineers and analysts from needless distractions so they can focus on high-value work.

Avoiding distractions

Many of Iceberg’s core features aim directly to reduce distractions and get out of the way. As a data engineer, some of this things you can stop worrying about include:

Cleaning up avoidable mistakes – Hidden partitioning, schema evolution and SQL behavior eliminate many mistakes that cause silent correctness errors in Hive-like formats and leak bad data downstream. These features also save you time building and updating pipelines.
Training people to query your tables – Consumers don’t need to know about the physical layout of a table in order to efficiently query tables. This avoids the need to educate those consumers, and it eliminates the risk of them writing inefficient queries.
Copying, syncing and migrating tables – Metadata indexing enables tables to handle a broader range of query patterns, eliminating the need to create bespoke table copies with special layouts. Broad industry support for Iceberg avoids the need to copy (and continuously sync) data for use with multiple compute engines. In-place table maintenance helps you avoid painful table migrations that require rewriting every query that touches a table.
File format details and data types – Iceberg data type and schema evolution behavior is consistent and dependable across file formats.
Manual, file-level performance optimization – the ability to make changes safely enables compacting or clustering data asynchronously, rather than wasting engineering cycles manually tuning file sizes and attempting to balance write latency with query performance.

Better data engineering through declarative patterns

In addition to solving these tedious and time-consuming problems, Iceberg also unlocks better engineering practices by incorporating a declarative approach whenever possible. In Iceberg, you configure tables, not jobs or query engines.

A great example of a declarative approach is using MERGE to update tables. The MERGE command is expressive; it lets you easily specify how to update each row given the incoming data that matches it. The command is more understandable and easier to write than using a combination of JOIN and column-level expressions. MERGE is declarative: it tells the query engine what to do and lets the engine determine how to do it. The engine can then apply optimizations to improve performance beyond what engineers would produce by hand. That separation of concerns — the declarative approach — is what makes SQL powerful.

Hidden partitioning is another example of a declarative approach. It is an optional table configuration that engines handle automatically, instead of relying on an engineer to derive partition values in every write job. Similarly, Iceberg tables can declare properties for write order, target file size, and other tuning settings that engines use to prepare data at write time for better read performance. This declarative approach makes engines responsible for managing storage details and lets engineers focus on the task at hand, like data modeling, quality and governance.

This chapter includes recipes that demonstrate declarative tools, as well as patterns based on other features that improve the data engineering experience, including:

Creating tables with hidden partitioning and understanding the differences vs. Hive.
Setting table write order to ensure consistent ordering and clustering across all writers (including compaction and other background services).
Using the MERGE command to create an idempotent pipeline that applies incoming changes to existing rows in a table.
Investigating data quality issues with time travel and restoring to a healthy state with rollback.
Using branches to increase data quality with better auditing and testing patterns.
Using tagging to efficiently keep historical versions and easily manage their lifecycle.
Incrementally processing data from an Iceberg table.

Apache Iceberg Cookbook

Introduction

Getting Started

Basics