File compaction

DATA OPERATIONS

The primary motivation for creating Apache Iceberg was to make transactions safe and reliable. Without safe concurrent writes, pipelines have just one opportunity to write data to a table. Unnecessary changes are risky: queries might produce results from bad data and writers could permanently corrupt a table. In short, write jobs are responsible for too much and must make tradeoffs — often leading to lingering performance issues like the “small files” problem.

With the reliable updates Iceberg provides, you can break down data preparation into separate tasks. Writers are responsible for transformation and making the data available quickly. Performance optimizations like compaction are applied later as background tasks. And as it turns out, those deferred tasks are reusable, and also easier to apply, when they are separated from custom transformation logic.

This recipe shows how to run file compaction, the most useful maintenance and optimization task.

File compaction is not just a solution for the small files problem. Compaction rewrites data files, which is an opportunity to also recluster, repartition, and remove deleted rows.

Running simple compaction

To compact a table, use the rewrite_data_files stored procedure.

CALL system.rewrite_data_files(table => 'examples.nyc_taxi_yellow')

Without additional customization, rewrite_data_files will look for files in any partition that are either too small or too large and will rewrite them using a bin packing algorithm to combine files. It attempts to produce files that are the table’s target size, controlled by the table property write.target-file-size-bytes, which defaults to 512 MB.

For large tables, you can pass a filter to restrict the files that are considered for compaction in order to make rewrite planning faster. Filters are passed as a SQL string using the where argument. This example compacts the last 24 hours of data in the nyc_taxi_yellow table:

CALL system.rewrite_data_files(
    table => 'examples.nyc_taxi_yellow',
    where => 'tpep_pickup_time >= current_timestamp() - INTERVAL 24 HOURS')

This is also a good way to avoid wasting effort rewriting data that won’t be read. For pipelines with late-arriving data, it makes little sense to compact after the majority of consumers have run. At that point small files won’t be a significant problem, so compaction may not be worth the cost.

Reclustering with a sort order

Compaction and reclustering are very similar. The main difference is that reclustering uses a different strategy. Rather than bin packing, it applies a sort to the data as it is rewritten. To recluster data while compacting, set the strategy to sort.

CALL system.rewrite_data_files(
    table => 'examples.nyc_taxi_yellow',
    strategy => 'sort')

The sort strategy requires a sort order; it uses the table’s write order by default. If the table doesn’t have a write order, you can pass a sort order as a SQL string using sort_order.

CALL system.rewrite_data_files(
    table => 'examples.nyc_taxi_yellow',
    strategy => 'sort',
    sort_order => 'tpep_pickup_time')

When using rewrite_data_files to recluster, you may also want to change how data files are selected. Instead of only rewriting files that are too small or too large, you can set an option to rewrite files that match a where filter. The option to rewrite all data files is passed through the options argument, which is a string to string map.

CALL system.rewrite_data_files(
    table => 'examples.nyc_taxi_yellow',
    strategy => 'sort',
    where => 'tpep_pickup_time >= current_timestamp() - INTERVAL 24 HOURS',
    options => map('rewrite-all', 'true'))

There are many ways to change the behavior of rewrite_data_files using options, like customizing the size thresholds used to select files to rewrite. For a full list, see the Iceberg reference documentation.

Note: In addition to supporting a regular SQL sort order, rewrite_data_files can also apply a Z-order expression like this: sort_order => 'zorder(tpep_pickup_time, tpep_dropoff_time)'.

Compacting deletes into data files

Rewrites always apply deletes so rows that have been deleted are not written into the newly compacted files. This helps keep tables clean and performant, but in some cases you may want to rewrite _primarily _to make sure deletes have been applied. For example, complying with the GDPR “right to be forgotten” requires you ensure that rows have been completely removed after an initial soft-delete.

To use rewrite_data_files to compact deletes into data files, set the delete-file-threshold option. The delete file threshold sets the number of delete files to trigger rewriting a data file, regardless of its size. When set to 1, any data file that is affected by one or more delete files will be rewritten:

CALL system.rewrite_data_files(
    table => 'examples.nyc_taxi_yellow',
    options => map('delete-file-threshold', '1'))

By default, data files won’t be rewritten solely based on the number of delete files; the default threshold is Integer.MAX_VALUE. It’s a good idea to set a threshold that prevents too many delete files from accumulating (for example, 10).

More information than you require

Although simple to use, there is a lot of power packed into rewrite_data_files. The implementation automatically breaks down work into separate jobs by partition to avoid attempting to process too much data at once. It will also run the jobs in parallel and accumulate changes in order to commit them all at once.

This design exposes some useful options when running rewrites at scale:

  • max-file-group-size-bytes – size threshold to split partitions into multiple jobs
  • max-concurrent-file-group-rewrites – number of concurrent rewrite jobs
  • partial-progress.enabled – whether to use multiple commits as jobs complete
  • partial-progress.max-commits – number of commits to use if partial progress is enabled

You can also use job options to set specific thresholds for the target size, when to consider files too small or too large, etc.:

  • target-file-size-bytes – the target file size, defaults to the table’s target file size
  • min-file-size-bytes – files smaller than this will be rewritten, defaults to 75% of the target
  • max-file-size-bytes – files larger than this will be rewritten, defaults to 180% of the target
  • delete-file-threshold – number of delete files to trigger a rewrite, defaults to Integer.MAX_VALUE
  • rewrite-all – whether to rewrite all data files matching the filter, defaults to false