Connecting to a REST Catalog

GETTING STARTED

The Apache Iceberg REST catalog protocol is a standard API for interacting with any Iceberg catalog. The REST catalog client is the recommended option for connecting to a catalog because it supports many new catalog features, such as fine-grained deconfliction and multi-table commits, and also provides for the broadest support across languages and commercial query engines.

Iceberg includes a built-in REST catalog client. To use the built-in client, set the catalog type property to rest. For more information on catalog configuration, read the Catalog background article.

Here’s a full example that connects Spark to a REST catalog (hosted by Tabular) with warehouse name sandbox:

# conf/spark-defaults.conf
# Create catalog sandbox that uses Iceberg's Spark catalog implementation
spark.sql.catalog.sandbox=org.apache.iceberg.spark.SparkCatalog
# Configure prod to use the REST catalog client
spark.sql.catalog.sandbox.type=rest
spark.sql.catalog.sandbox.uri=https://api.www.tabular.io/ws
# Select the warehouse and set a credential for OAuth2 authentication
spark.sql.catalog.sandbox.warehouse=sandbox
spark.sql.catalog.sandbox.credential=env:REST_CREDENTIAL
# Add optional defaults for database and catalog
spark.sql.catalog.sandbox.default-namespace=examples
spark.sql.defaultCatalog=sandbox
# Add Iceberg SQL extensions
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Note that in this example, the credential value is kept private and out of the config file by loading it from an environment variable, REST_CREDENTIAL.

REST catalog OAuth2 configuration

An important feature of the REST catalog protocol is that it supports authentication schemes to pass the caller’s identity to the catalog. That enables the catalog to make authorization decisions, such as failing if the caller does not have permission to read a table. The REST protocol also supports ways to authorize the client to read a table’s files in AWS S3 for table sharing.

Authorization is the responsibility of the REST catalog service, but callers need to be able to pass their identity via the built-in client. The client implementation supports OAuth2 and AWS’s SigV4 schemes. OAuth2 is configured by the following catalog config properties:

PropertyDescription
credentialA key and secret pair separated by : (key is optional)
tokenA bearer token passed in the Authorization header
scopeAdditional OAuth2 scopes; catalog is always included

The first two properties are the primary way to pass identity. If a token is set, HTTP requests use the value as a bearer token in the HTTP Authorization header. If credential is used, then the key and secret are used to fetch a token using the OAuth2 client credentials flow. The resulting token is used as the bearer token for subsequent requests.

The REST client respects token expiration and attempts to refresh tokens before they expire. For more information on OAuth2 endpoints and requests, refer to the REST catalog documentation (OpenAPI spec).

Checking connectivity

An easy way to test your configuration is to use spark-sql or pyspark. You can learn more about configuring Spark in the next recipe. For quick validation, use --conf arguments to pass the options:

./bin/spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.sandbox=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.sandbox.type=rest \
--conf spark.sql.catalog.sandbox.uri=https://api.www.tabular.io/ws \
--conf spark.sql.catalog.sandbox.credential=env:REST_CREDENTIAL \
--conf spark.sql.catalog.sandbox.warehouse=sandbox \
--conf spark.sql.catalog.sandbox.default-namespace=examples \
--conf spark.sql.defaultCatalog=sandbox

Once the Spark SQL or PySpark REPL is running, you can run simple SQL commands to check connectivity:

SHOW TABLES;
-- backblaze_drive_stats
-- nyc_taxi_locations
-- nyc_taxi_yellow

SELECT * FROM nyc_taxi_yellow LIMIT 10;  
-- ...