Iceberg’s REST Catalog: A Spark Demo

penguin in ice

Earlier this year, we released a blog post containing a docker compose configuration that allows you to easily get Iceberg and Spark up and running.
It even provides a Jupyter notebook server with a number of fully-runnable example notebooks and walks through some of Iceberg’s biggest features. With the release
of the Iceberg REST catalog, we’ve decided to update that example (which exists in this tabular-io/docker-spark-iceberg
repo) to use the REST catalog. The easiest way to begin using this new setup containing the REST catalog is to clone the repo and run docker compose up. This blog post, however,
aims to walk through what’s actually been changed and how the environment has been reconfigured to use a REST catalog.

REST Catalog

Iceberg clients ship with support for a number of different catalog implementations. Here’s a list of just some of them:

  • JDBC catalog
  • Hive catalog
  • Nessie Catalog
  • Hadoop catalog
  • Glue catalog
  • DynamoDB catalog

Although the wide array of Iceberg catalog implementations has allowed Iceberg support to grow over the years, the new REST catalog provides a much greater level of connectivity
to different clients and applications. Furthermore, the REST catalog decreases commit conflicts using change-based commits and also allows the catalog to handle different
version of Iceberg clients.

These are a few reasons why the Iceberg community developed the REST catalog implementation. The
REST catalog specification allows a server to own all the
catalog implementation details, while exposing a predictable REST API for Iceberg clients to connect to. In addition to making the catalog accessible
to more languages and applications, standardizing on a REST API makes it much easier to build Iceberg clients in other languages.

Update to the Demo Environment

The latest tabulario/spark-iceberg image uses a REST catalog and a new spark-defaults.conf. You can run docker compose pull spark-iceberg to
make sure you have the latest image.

The original docker compose environment used a Postgres backed JDBC catalog. The new setup uses the Iceberg REST server provided by our tabulario/iceberg-rest:0.2.0 image.
In the repo, you’ll find the updated spark configuration that reflects this new configuration.


- spark.sql.catalog.demo.catalog-impl    org.apache.iceberg.jdbc.JdbcCatalog
- spark.sql.catalog.demo.uri             jdbc:postgresql://postgres:5432/demo_catalog
- spark.sql.catalog.demo.jdbc.user       admin
- spark.sql.catalog.demo.jdbc.password   password
+ spark.sql.catalog.demo.catalog-impl    org.apache.iceberg.rest.RESTCatalog
+ spark.sql.catalog.demo.uri             http://rest:8181

The iceberg-rest Docker Image

As previously described, the great thing about the REST catalog is that much of the catalog implementation details are handled by the server. That can
be seen in action here where the iceberg-rest image simply exposes a REST interface for Spark to connect
to. Spark doesn’t need to be concerned with connecting to the actual datastore which can be anything ranging from a Hive metastore to a MySQL database.
By default, this image actually uses a SQLite database which is great for tests and small demos.

However, using something more powerful, such as a DynamoDB catalog, is just as easy!

  image: tabulario/iceberg-rest:0.2.0
    - 8181:8181
    - CATALOG_WAREHOUSE=s3a://warehouse/wh/
    - CATALOG_IO__IMPL=org.apache.iceberg.aws.dynamodb.DynamoDbCatalog
    - CATALOG_DYNAMODB_TABLE__NAME=demo_warehouse
    - CATALOG_S3_ENDPOINT=http://minio:9000

What’s Next?

The REST catalog is a big achievement for Iceberg and will contribute towards the goal of making Iceberg the open table format that’s supported by all query
engines and frameworks. The catalog is a core component of an Iceberg backed data warehouse and making it accessible through a REST API enables integration
of Iceberg into the wide ecosystem of data tools that a modern organization must inevitably adopt.