Earlier this year, we released a blog post containing a docker compose configuration that allows you to easily get Iceberg and Spark up and running.
It even provides a Jupyter notebook server with a number of fully-runnable example notebooks and walks through some of Iceberg’s biggest features. With the release
of the Iceberg REST catalog, we’ve decided to update that example (which exists in this tabular-io/docker-spark-iceberg
repo) to use the REST catalog. The easiest way to begin using this new setup containing the REST catalog is to clone the repo and run
docker compose up. This blog post, however,
aims to walk through what’s actually been changed and how the environment has been reconfigured to use a REST catalog.
Iceberg clients ship with support for a number of different catalog implementations. Here’s a list of just some of them:
- JDBC catalog
- Hive catalog
- Nessie Catalog
- Hadoop catalog
- Glue catalog
- DynamoDB catalog
Although the wide array of Iceberg catalog implementations has allowed Iceberg support to grow over the years, the new REST catalog provides a much greater level of connectivity to different clients and applications. Furthermore, the REST catalog decreases commit conflicts using change-based commits and also allows the catalog to handle different version of Iceberg clients.
These are a few reasons why the Iceberg community developed the REST catalog implementation. The REST catalog specification allows a server to own all the catalog implementation details, while exposing a predictable REST API for Iceberg clients to connect to. In addition to making the catalog accessible to more languages and applications, standardizing on a REST API makes it much easier to build Iceberg clients in other languages.
Update to the Demo Environment
tabulario/spark-iceberg image uses a REST catalog and a new
spark-defaults.conf. You can run
docker compose pull spark-iceberg to
make sure you have the latest image.
The original docker compose environment used a Postgres backed JDBC catalog. The new setup uses the Iceberg REST server provided by our
In the repo, you’ll find the updated spark configuration that reflects this new configuration.
- spark.sql.catalog.demo.catalog-impl org.apache.iceberg.jdbc.JdbcCatalog - spark.sql.catalog.demo.uri jdbc:postgresql://postgres:5432/demo_catalog - spark.sql.catalog.demo.jdbc.user admin - spark.sql.catalog.demo.jdbc.password password + spark.sql.catalog.demo.catalog-impl org.apache.iceberg.rest.RESTCatalog + spark.sql.catalog.demo.uri http://rest:8181
iceberg-rest Docker Image
As previously described, the great thing about the REST catalog is that much of the catalog implementation details are handled by the server. That can be seen in action here where the iceberg-rest image simply exposes a REST interface for Spark to connect to. Spark doesn’t need to be concerned with connecting to the actual datastore which can be anything ranging from a Hive metastore to a MySQL database. By default, this image actually uses a SQLite database which is great for tests and small demos.
However, using something more powerful, such as a
DynamoDB catalog, is just as easy!
rest: image: tabulario/iceberg-rest:0.2.0 ports: - 8181:8181 environment: - CATALOG_WAREHOUSE=s3a://warehouse/wh/ - CATALOG_IO__IMPL=org.apache.iceberg.aws.dynamodb.DynamoDbCatalog - CATALOG_DYNAMODB_TABLE__NAME=demo_warehouse - CATALOG_S3_ENDPOINT=http://minio:9000
The REST catalog is a big achievement for Iceberg and will contribute towards the goal of making Iceberg the open table format that’s supported by all query engines and frameworks. The catalog is a core component of an Iceberg backed data warehouse and making it accessible through a REST API enables integration of Iceberg into the wide ecosystem of data tools that a modern organization must inevitably adopt.