This recipe demonstrates simple queries with Iceberg tables.
Running queries in Apache Spark
Spark supports two interfaces to query a table: SQL and DataFrames. Tables are loaded by name in both interfaces, using the current session to fill in missing information. For instance, if the table name (or identifier) is a single part, like
nyc_taxi_yellow then Spark looks up the table in the current session’s catalog and namespace.
SHOW CURRENT NAMESPACE
-- sandbox examples
SELECT count(*) FROM nyc_taxi_yellow
The query above counts the rows
sandbox.examples.nyc_taxi_yellow. You can change the current catalog and namespace with the
If a table identifier has more than one part (separated by
.), Spark assumes it is a full table name and won’t add the current namespace to it. If the first part of an identifier is a catalog name, Spark uses that catalog to load the table and uses the session’s current catalog otherwise.
To run the same query in PySpark, call
spark.table to load the table as a DataFrame, and the
count action to fetch the number of rows in it.
You can also use
spark.sql to create a DataFrame from a SQL query. If the SQL is a
SELECT query (not a command like
MERGE) the result is a DataFrame you can refine further or run using an action method like
Although the rest of this recipe uses the SQL interface, everything can be run using DataFrames.
Filtering Iceberg tables
When working with Hive tables and other Hive-like formats, it’s important to add additional filters for the table’s partition columns. Otherwise, queries scan the entire table.
Iceberg doesn’t require additional filters on special partition columns. Instead, it keeps track of the relationship between the columns of a table and the table layout. For example, if a table is partitioned by
days(event_ts), Iceberg filters on
event_ts automatically to make the query more efficient. Filters are used to select partitions and then to further filter data files based on column value ranges, which are also stored in Iceberg metadata.
Certain queries may still lead to a scan of every file in the table, but only when filters match every file, not because the user neglected to add a partition filter.