Connecting to a REST Catalog


The Apache Iceberg REST catalog protocol is a standard API for interacting with any Iceberg catalog. The REST catalog client is the recommended option for connecting to a catalog because it supports many new catalog features, such as fine-grained deconfliction and multi-table commits, and also provides for the broadest support across languages and commercial query engines.

Iceberg includes a built-in REST catalog client. To use the built-in client, set the catalog type property to rest. For more information on catalog configuration, read the Catalog background article.

Here’s a full example that connects Spark to a REST catalog (hosted by Tabular) with warehouse name sandbox:

# conf/spark-defaults.conf
# Create catalog sandbox that uses Iceberg's Spark catalog implementation
# Configure prod to use the REST catalog client
# Select the warehouse and set a credential for OAuth2 authentication
# Add optional defaults for database and catalog
# Add Iceberg SQL extensions

Note that in this example, the credential value is kept private and out of the config file by loading it from an environment variable, REST_CREDENTIAL.

REST catalog OAuth2 configuration

An important feature of the REST catalog protocol is that it supports authentication schemes to pass the caller’s identity to the catalog. That enables the catalog to make authorization decisions, such as failing if the caller does not have permission to read a table. The REST protocol also supports ways to authorize the client to read a table’s files in AWS S3 for table sharing.

Authorization is the responsibility of the REST catalog service, but callers need to be able to pass their identity via the built-in client. The client implementation supports OAuth2 and AWS’s SigV4 schemes. OAuth2 is configured by the following catalog config properties:

credentialA key and secret pair separated by : (key is optional)
tokenA bearer token passed in the Authorization header
scopeAdditional OAuth2 scopes; catalog is always included

The first two properties are the primary way to pass identity. If a token is set, HTTP requests use the value as a bearer token in the HTTP Authorization header. If credential is used, then the key and secret are used to fetch a token using the OAuth2 client credentials flow. The resulting token is used as the bearer token for subsequent requests.

The REST client respects token expiration and attempts to refresh tokens before they expire. For more information on OAuth2 endpoints and requests, refer to the REST catalog documentation (OpenAPI spec).

Checking connectivity

An easy way to test your configuration is to use spark-sql or pyspark. You can learn more about configuring Spark in the next recipe. For quick validation, use --conf arguments to pass the options:

./bin/spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.sandbox=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.sandbox.type=rest \
--conf spark.sql.catalog.sandbox.uri= \
--conf spark.sql.catalog.sandbox.credential=env:REST_CREDENTIAL \
--conf spark.sql.catalog.sandbox.warehouse=sandbox \
--conf spark.sql.catalog.sandbox.default-namespace=examples \
--conf spark.sql.defaultCatalog=sandbox

Once the Spark SQL or PySpark REPL is running, you can run simple SQL commands to check connectivity:

-- backblaze_drive_stats
-- nyc_taxi_locations
-- nyc_taxi_yellow

SELECT * FROM nyc_taxi_yellow LIMIT 10;  
-- ...