Earlier this year, we released a blog post containing a docker compose configuration that allows you to easily get Iceberg and Spark up and running.
It even provides a Jupyter notebook server with a number of fully-runnable example notebooks and walks through some of Iceberg’s biggest features. With the release
of the Iceberg REST catalog, we’ve decided to update that example (which exists in this tabular-io/docker-spark-iceberg
repo) to use the REST catalog. The easiest way to begin using this new setup containing the REST catalog is to clone the repo and run docker compose up
. This blog post, however,
aims to walk through what’s actually been changed and how the environment has been reconfigured to use a REST catalog.
REST Catalog
Iceberg clients ship with support for a number of different catalog implementations. Here’s a list of just some of them:
- JDBC catalog
- Hive catalog
- Nessie Catalog
- Hadoop catalog
- Glue catalog
- DynamoDB catalog
Although the wide array of Iceberg catalog implementations has allowed Iceberg support to grow over the years, the new REST catalog provides a much greater level of connectivity
to different clients and applications. Furthermore, the REST catalog decreases commit conflicts using change-based commits and also allows the catalog to handle different
version of Iceberg clients.
These are a few reasons why the Iceberg community developed the REST catalog implementation. The
REST catalog specification allows a server to own all the
catalog implementation details, while exposing a predictable REST API for Iceberg clients to connect to. In addition to making the catalog accessible
to more languages and applications, standardizing on a REST API makes it much easier to build Iceberg clients in other languages.
Update to the Demo Environment
The latest tabulario/spark-iceberg
image uses a REST catalog and a new spark-defaults.conf
. You can run docker compose pull spark-iceberg
to
make sure you have the latest image.
The original docker compose environment used a Postgres backed JDBC catalog. The new setup uses the Iceberg REST server provided by our tabulario/iceberg-rest:0.2.0
image.
In the repo, you’ll find the updated spark configuration that reflects this new configuration.
spark-defaults.conf
- spark.sql.catalog.demo.catalog-impl org.apache.iceberg.jdbc.JdbcCatalog
- spark.sql.catalog.demo.uri jdbc:postgresql://postgres:5432/demo_catalog
- spark.sql.catalog.demo.jdbc.user admin
- spark.sql.catalog.demo.jdbc.password password
+ spark.sql.catalog.demo.catalog-impl org.apache.iceberg.rest.RESTCatalog
+ spark.sql.catalog.demo.uri http://rest:8181
The iceberg-rest
Docker Image
As previously described, the great thing about the REST catalog is that much of the catalog implementation details are handled by the server. That can
be seen in action here where the iceberg-rest image simply exposes a REST interface for Spark to connect
to. Spark doesn’t need to be concerned with connecting to the actual datastore which can be anything ranging from a Hive metastore to a MySQL database.
By default, this image actually uses a SQLite database which is great for tests and small demos.
However, using something more powerful, such as a DynamoDB
catalog, is just as easy!
rest:
image: tabulario/iceberg-rest:0.2.0
ports:
- 8181:8181
environment:
- CATALOG_WAREHOUSE=s3a://warehouse/wh/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.dynamodb.DynamoDbCatalog
- CATALOG_DYNAMODB_TABLE__NAME=demo_warehouse
- CATALOG_S3_ENDPOINT=http://minio:9000
What’s Next?
The REST catalog is a big achievement for Iceberg and will contribute towards the goal of making Iceberg the open table format that’s supported by all query
engines and frameworks. The catalog is a core component of an Iceberg backed data warehouse and making it accessible through a REST API enables integration
of Iceberg into the wide ecosystem of data tools that a modern organization must inevitably adopt.