Data operations with Apache Iceberg

MENU – Apache Iceberg Cookbook

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg

Data operations with Apache Iceberg

Apache Iceberg tables require regular maintenance. This may be unexpected for many people that are new to Iceberg-based data architecture — why does Iceberg need maintenance when Hive tables don’t?

There are three good reasons:

Iceberg unlocks background updates – Iceberg solves the problem of coordinating multiple writers safely. That enables problems to be broken down into simpler and more reliable pieces. Before, writers had to balance making data available quickly (frequent writes) with the performance problems of small files, and would ideally also cluster data for downstream consumption. With Iceberg, a streaming writer can make data available quickly and a background maintenance task can cluster and compact for long-term performance.
Hive tables are unsafe – to support atomic updates to a table, Iceberg uses an optimistic approach. Writers create parallel snapshots of a table and use an atomic swap to switch between them. Old snapshots must be kept around until readers are no longer using them. The downside of this model is that snapshots need to be cleaned up later, or else old data files might accumulate indefinitely.
Actually, Hive does require maintenance – job failures can cause orphaned data files in Hive tables, too. And small files are a notorious performance problem. Without a clear solution — because maintenance is not safe — the problems are often accepted as unavoidable.

In short, table maintenance is unavoidable in modern formats and, in many cases, breaking work down into separate writes and data maintenance is a better operational pattern.

This chapter has recipes for the most common operations that are needed to keep tables performant and cost-effective with minimal effort:

Data compaction asynchronously rewrites data files to fix the small files problem, but can also cluster data to improve query performance and remove rows that have been soft-deleted.
Snapshot expiration removes old snapshots and deletes data files that are no longer needed.
Orphan file cleanup identifies and deletes data files that were written but never committed because of job failures.

The Iceberg library provides stored procedures in Spark for table maintenance, but there are a variety of ways to run these operations. For example, Trino also supports snapshot expiration, orphan file cleanup, and compaction using stored procedures that are easy to configure and run.

Most of the time, these operations are the responsibility of data platform administrators. They are often built into Iceberg-based platforms so you don’t need to worry about them. But what if you are the administrator, or a data engineer in a small company, or are just curious and want to learn more? Then this chapter is for you. This covers the Spark SQL procedures because they are the easiest to consume.