GETTING STARTED
The Apache Iceberg REST catalog protocol is a standard API for interacting with any Iceberg catalog. The REST catalog client is the recommended option for connecting to a catalog because it supports many new catalog features, such as fine-grained deconfliction and multi-table commits, and also provides for the broadest support across languages and commercial query engines.
Iceberg includes a built-in REST catalog client. To use the built-in client, set the catalog type
property to rest
. For more information on catalog configuration, read the Catalog background article.
Here’s a full example that connects Spark to a REST catalog (hosted by Tabular) with warehouse name sandbox
:
# conf/spark-defaults.conf
# Create catalog sandbox that uses Iceberg's Spark catalog implementation
spark.sql.catalog.sandbox=org.apache.iceberg.spark.SparkCatalog
# Configure prod to use the REST catalog client
spark.sql.catalog.sandbox.type=rest
spark.sql.catalog.sandbox.uri=https://api.www.tabular.io/ws
# Select the warehouse and set a credential for OAuth2 authentication
spark.sql.catalog.sandbox.warehouse=sandbox
spark.sql.catalog.sandbox.credential=env:REST_CREDENTIAL
# Add optional defaults for database and catalog
spark.sql.catalog.sandbox.default-namespace=examples
spark.sql.defaultCatalog=sandbox
# Add Iceberg SQL extensions
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Note that in this example, the credential
value is kept private and out of the config file by loading it from an environment variable, REST_CREDENTIAL
.
REST catalog OAuth2 configuration
An important feature of the REST catalog protocol is that it supports authentication schemes to pass the caller’s identity to the catalog. That enables the catalog to make authorization decisions, such as failing if the caller does not have permission to read a table. The REST protocol also supports ways to authorize the client to read a table’s files in AWS S3 for table sharing.
Authorization is the responsibility of the REST catalog service, but callers need to be able to pass their identity via the built-in client. The client implementation supports OAuth2 and AWS’s SigV4 schemes. OAuth2 is configured by the following catalog config properties:
Property | Description |
---|---|
credential | A key and secret pair separated by : (key is optional) |
token | A bearer token passed in the Authorization header |
scope | Additional OAuth2 scopes; catalog is always included |
The first two properties are the primary way to pass identity. If a token
is set, HTTP requests use the value as a bearer token in the HTTP Authorization
header. If credential
is used, then the key and secret are used to fetch a token using the OAuth2 client credentials flow. The resulting token is used as the bearer token for subsequent requests.
The REST client respects token expiration and attempts to refresh tokens before they expire. For more information on OAuth2 endpoints and requests, refer to the REST catalog documentation (OpenAPI spec).
Checking connectivity
An easy way to test your configuration is to use spark-sql
or pyspark
. You can learn more about configuring Spark in the next recipe. For quick validation, use --conf
arguments to pass the options:
./bin/spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.sandbox=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.sandbox.type=rest \
--conf spark.sql.catalog.sandbox.uri=https://api.www.tabular.io/ws \
--conf spark.sql.catalog.sandbox.credential=env:REST_CREDENTIAL \
--conf spark.sql.catalog.sandbox.warehouse=sandbox \
--conf spark.sql.catalog.sandbox.default-namespace=examples \
--conf spark.sql.defaultCatalog=sandbox
Once the Spark SQL or PySpark REPL is running, you can run simple SQL commands to check connectivity:
SHOW TABLES;
-- backblaze_drive_stats
-- nyc_taxi_locations
-- nyc_taxi_yellow
SELECT * FROM nyc_taxi_yellow LIMIT 10;
-- ...