Querying an Iceberg Table

BASICS

This recipe demonstrates simple queries with Iceberg tables.

Running queries in Apache Spark

Spark supports two interfaces to query a table: SQL and DataFrames. Tables are loaded by name in both interfaces, using the current session to fill in missing information. For instance, if the table name (or identifier) is a single part, like nyc_taxi_yellow then Spark looks up the table in the current session’s catalog and namespace.

SHOW CURRENT NAMESPACE
-- sandbox  examples
SELECT count(*) FROM nyc_taxi_yellow
-- 2853020

The query above counts the rows sandbox.examples.nyc_taxi_yellow. You can change the current catalog and namespace with the USE command.

If a table identifier has more than one part (separated by .), Spark assumes it is a full table name and won’t add the current namespace to it. If the first part of an identifier is a catalog name, Spark uses that catalog to load the table and uses the session’s current catalog otherwise.

To run the same query in PySpark, call spark.table to load the table as a DataFrame, and the count action to fetch the number of rows in it.

spark.table('nyc_taxi_yellow').count()
# 2853020

You can also use spark.sql to create a DataFrame from a SQL query. If the SQL is a SELECT query (not a command like ALTER or MERGE) the result is a DataFrame you can refine further or run using an action method like show or collect.

Although the rest of this recipe uses the SQL interface, everything can be run using DataFrames.

Filtering Iceberg tables

When working with Hive tables and other Hive-like formats, it’s important to add additional filters for the table’s partition columns. Otherwise, queries scan the entire table.

Iceberg doesn’t require additional filters on special partition columns. Instead, it keeps track of the relationship between the columns of a table and the table layout. For example, if a table is partitioned by days(event_ts), Iceberg filters on event_ts automatically to make the query more efficient. Filters are used to select partitions and then to further filter data files based on column value ranges, which are also stored in Iceberg metadata.

Certain queries may still lead to a scan of every file in the table, but only when filters match every file, not because the user neglected to add a partition filter.