Clean up orphan files

DATA OPERATIONS

Cleaning up orphan files — data files that are not referenced by table metadata — is an important part of table maintenance that reduces storage expense. This recipe shows you how to use Apache Spark to identify and delete orphan files.

What are orphan files and what creates them?

Orphan files are files in the table’s data directory that are not part of the table state. As the name suggests, orphan files aren’t tracked by Iceberg, aren’t referenced by any snapshots in a table’s snapshot log, and are not used by queries.

Orphan files come from failures in the distributed systems that write to Iceberg tables. For example, if a Spark driver runs out of memory and crashes after some tasks have successfully created data files, those files will be left in storage, but will never be committed to the table.

The challenge with orphan files

Orphan files accumulate over time; if they’re not referenced in table metadata they can’t be removed by normal snapshot expiration. As they accumulate, storage costs continue to add up so it’s a good idea to find and delete them regularly. The recommended best practice is to run orphan file cleanup weekly or monthly.

Deleting orphan files can be tricky. It requires comparing the full set of referenced files in a table to the current set of files in the underlying object store. This also makes it a resource-intensive operation, especially if you have a large volume of files in data and metadata directories.

In addition, files may appear orphaned when they are part of an ongoing commit operation. Iceberg uses optimistic concurrency, so writers will create all of the files that are part of an operation before the commit. Until the commit succeeds, the files are unreferenced. To avoid deleting files that are part of an ongoing commit, maintenance procedures use an olderThan argument. Only files older than this threshold are considered orphans. By default, this time window is 3 days, which should be more than enough time for in-flight commits to succeed.

Deleting orphan files using Spark

To clean up orphan files using Spark, use the remove_orphan_files procedure.

CALL remove_orphan_files(
    table => 'examples.nyc_taxi_yellow')

This cleans up any unreferenced files under the table’s location. By default, this only considers files orphaned if they are older than 3 days. You can customize the threshold by passing a timestamp as older_than.

CALL remove_orphan_files(
    table => 'examples.nyc_taxi_yellow',
    older_than => TIMESTAMP '2020-01-19 03:14:07')

Dry run validation

Deleting orphan files can be dangerous in that valid data can be deleted.

If two tables share the same location, cleaning up orphan files in one table can delete files that are valid in the other table. It’s important that tables always use unique locations. If locations are shared, don’t run orphan file cleanup.
If file listing produces paths that differ from data file locations in the table. Iceberg will normalize paths to avoid mistakenly considering a file an orphan, but there are cases that may not be recognized by default, such as HDFS NameNode HA paths that are equivalent.

A best practice when configuring orphan file cleanup is to first inspect look at the files that would be deleted. You can do this by using the dry_run option in the Spark stored procedure.

CALL remove_orphan_files(
    table => 'examples.nyc_taxi_yellow',
    dry_run => true)

Spark will return a list of files considered orphans that you can validate. If the list is suspiciously long, make sure you check before removing the dry_run option!

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg