BASICS
This recipe demonstrates simple queries with Iceberg tables.
Running queries in Apache Spark
Spark supports two interfaces to query a table: SQL and DataFrames. Tables are loaded by name in both interfaces, using the current session to fill in missing information. For instance, if the table name (or identifier) is a single part, like nyc_taxi_yellow
then Spark looks up the table in the current session’s catalog and namespace.
SHOW CURRENT NAMESPACE
-- sandbox examples
SELECT count(*) FROM nyc_taxi_yellow
-- 2853020
The query above counts the rows sandbox.examples.nyc_taxi_yellow
. You can change the current catalog and namespace with the USE
command.
If a table identifier has more than one part (separated by .
), Spark assumes it is a full table name and won’t add the current namespace to it. If the first part of an identifier is a catalog name, Spark uses that catalog to load the table and uses the session’s current catalog otherwise.
To run the same query in PySpark, call spark.table
to load the table as a DataFrame, and the count
action to fetch the number of rows in it.
spark.table('nyc_taxi_yellow').count()
# 2853020
You can also use spark.sql
to create a DataFrame from a SQL query. If the SQL is a SELECT
query (not a command like ALTER
or MERGE
) the result is a DataFrame you can refine further or run using an action method like show
or collect
.
Although the rest of this recipe uses the SQL interface, everything can be run using DataFrames.
Filtering Iceberg tables
When working with Hive tables and other Hive-like formats, it’s important to add additional filters for the table’s partition columns. Otherwise, queries scan the entire table.
Iceberg doesn’t require additional filters on special partition columns. Instead, it keeps track of the relationship between the columns of a table and the table layout. For example, if a table is partitioned by days(event_ts)
, Iceberg filters on event_ts
automatically to make the query more efficient. Filters are used to select partitions and then to further filter data files based on column value ranges, which are also stored in Iceberg metadata.
Certain queries may still lead to a scan of every file in the table, but only when filters match every file, not because the user neglected to add a partition filter.