Using Hidden Partitioning

DATA ENGINEERING

This recipe shows how to use Apache Iceberg’s hidden partitioning to improve query performance while avoiding data quality and usability problems.

What is partitioning?

Partitioning is a data engineering technique for improving query performance by dividing rows into groups that enable efficient filtering. For example, you might store a table of log data partitioned into files by log level (DEBUG, INFO, WARN, and ERROR). That is, each file contains logs with just one level and a file’s metadata keeps track of what level it contains. That makes it easy to find ERROR logs without needing to sift through a much larger volume of INFO or DEBUG messages.

What is hidden partitioning?

Iceberg’s hidden partitioning is an improvement over Hive’s approach that makes partitioning declarative. Readers and writers use the table like any other SQL table and Iceberg handles the rest.

Hive partitions data using special columns where every unique value creates a separate partition. This works well for categorical columns like log level, but can’t directly handle columns with high cardinality like timestamps. In practice, columns are transformed to produce values that are suitable for partitioning. For instance, a log table might be partitioned by a date column, event_date, derived from the event_ts column.

This leads to several problems:

  1. Partition values are not validated so it’s easy for writers to make mistakes – for example, using the wrong time zone to produce event_date, or assuming that all rows in a data file are from the same day.
  2. Readers need to know about partitioning and take extra steps to take advantage of it. The reader needs to know to filter on both event_ts and event_date — and is likely to make the same mistakes that trip up writers. Often, the extra steps are accidentally overlooked.
  3. Since queries reference the extra partition columns, it is impractical to change partitioning schemes, because it requires rewriting every query to account for the new table layout.

Iceberg solves these problems by hiding partitioning from readers and writers. Iceberg is responsible for producing partition values and for using them at query time.

When you create a table, you can configure how to partition it using expressions, like day(event_ts). These expressions tell Iceberg how to derive partition values at write time. At read time, Iceberg uses the relationships to automatically convert data filters into partition filters.

Creating partitioned tables

To create a partitioned table in Apache Spark, add the PARTITIONED BY clause with one or more partition expressions. Here’s the command for the example log table that partitions by the log level and timestamp:

CREATE TABLE logs (
  level string,
  event_ts timestamptz,
  message string
) PARTITIONED BY (
  level,
  days(event_ts)
)

Partition transforms

Iceberg has a well-defined set of transformations that can be used in partition expressions. It was important to define the set of transforms in the Iceberg spec because it ensures that tables are portable across languages and frameworks.

The available partition transforms include:

  • yearmonthday, and hour – Partitions date/time values by hour, day, and so on
  • bucket(N, any_col) – Pseudo-randomly divides rows across N buckets
  • truncate(L, string_col) – Truncates a string value to length L
  • truncate(W, numeric_col) – Truncates numbers to width W; for example truncate(100, 314) produces 300

Note: Spark uses slightly different names than Iceberg for year, month, day, and hour transforms. This differentiates them from built-in functions with the same name. In Spark, the transforms use the plural form: yearsmonthsdays, and hours.

Writing with hidden partitioning

To write into the logs table, you only need to provide values for the data columns. This avoids human error, but also simplifies MERGE and INSERT statements that can become large and complex.

INSERT INTO logs VALUES
    ('ERROR', TIMESTAMP '2023-12-25T10:31:40', 'Failed to ...'),
    ('INFO', TIMESTAMP '2023-12-25T11:14:51', 'Updating ...'),
    ('INFO', TIMESTAMP '2023-12-25T11:14:52', 'Loading ...')

Engines such as Spark will automatically redistribute rows across tasks to keep the number of data files small. To learn more about distribution, see the recipe on write order and clustering.

Updating table partitioning

Queries that filter on partition source columns automatically use partition filters to avoid reading unnecessary data and metadata files:

SELECT
    count(*) AS error_count,
    CAST(event_ts AS date) AS log_date
WHERE level = 'ERROR' AND
    event_ts >= DATE '2023-12-25' AND
    event_ts < DATE '2023-12-26'

Not requiring additional filters for a duplicative event_date column makes it easier for readers to work with Iceberg tables, but there is an additional benefit. Queries are not tied to a specific set of partition columns. As a result, table partitioning can change and a table can manage data in multiple partition layouts.

To update a table’s partition structure, run an ALTER TABLE command with ADD or DROP PARTITION FIELD. For example, these two commands remove partitioning by level and increase the granularity of the time partition to hourly:

-- Remove level from table partitioning
ALTER TABLE logs DROP PARTITION FIELD level
-- Add hourly partitioning on event time
ALTER TABLE logs ADD PARTITION FIELD hours(event_ts)

Updating the partition configuration for a table is solely a metadata operation; it doesn’t change existing data. Historical data files stay in the old partition structure unless you choose to rewrite them. New data is written into the new partition structure automatically.

Also, note that the change above added a partition field, hours(event_ts), without removing days(event_ts). This is recommended when moving to finer-granularity time partitions because it preserves the older partition data in metadata tables. Because time-based partition transforms align at partition boundaries, there is very little added overhead to keeping daily partition values.