Iceberg's REST Catalog: A Spark Demo

blog-image

Earlier this year, we released a blog post containing a docker compose configuration that allows you to easily get Iceberg and Spark up and running. It even provides a Jupyter notebook server with a number of fully-runnable example notebooks and walks through some of Iceberg’s biggest features. With the release of the Iceberg REST catalog, we’ve decided to update that example (which exists in this tabular-io/docker-spark-iceberg repo) to use the REST catalog. The easiest way to begin using this new setup containing the REST catalog is to clone the repo and run docker compose up. This blog post, however, aims to walk through what’s actually been changed and how the environment has been reconfigured to use a REST catalog.

REST Catalog

Iceberg clients ship with support for a number of different catalog implementations. Here’s a list of just some of them:

  • JDBC catalog
  • Hive catalog
  • Nessie Catalog
  • Hadoop catalog
  • Glue catalog
  • DynamoDB catalog

Although the wide array of Iceberg catalog implementations has allowed Iceberg support to grow over the years, the new REST catalog provides a much greater level of connectivity to different clients and applications. Furthermore, the REST catalog decreases commit conflicts using change-based commits and also allows the catalog to handle different version of Iceberg clients.

These are a few reasons why the Iceberg community developed the REST catalog implementation. The REST catalog specification allows a server to own all the catalog implementation details, while exposing a predictable REST API for Iceberg clients to connect to. In addition to making the catalog accessible to more languages and applications, standardizing on a REST API makes it much easier to build Iceberg clients in other languages.

Update to the Demo Environment

The latest tabulario/spark-iceberg image uses a REST catalog and a new spark-defaults.conf. You can run docker compose pull spark-iceberg to make sure you have the latest image.

The original docker compose environment used a Postgres backed JDBC catalog. The new setup uses the Iceberg REST server provided by our tabulario/iceberg-rest:0.2.0 image. In the repo, you’ll find the updated spark configuration that reflects this new configuration.

spark-defaults.conf

- spark.sql.catalog.demo.catalog-impl    org.apache.iceberg.jdbc.JdbcCatalog
- spark.sql.catalog.demo.uri             jdbc:postgresql://postgres:5432/demo_catalog
- spark.sql.catalog.demo.jdbc.user       admin
- spark.sql.catalog.demo.jdbc.password   password
+ spark.sql.catalog.demo.catalog-impl    org.apache.iceberg.rest.RESTCatalog
+ spark.sql.catalog.demo.uri             http://rest:8181

The iceberg-rest Docker Image

As previously described, the great thing about the REST catalog is that much of the catalog implementation details are handled by the server. That can be seen in action here where the iceberg-rest image simply exposes a REST interface for Spark to connect to. Spark doesn’t need to be concerned with connecting to the actual datastore which can be anything ranging from a Hive metastore to a MySQL database. By default, this image actually uses a SQLite database which is great for tests and small demos.

However, using something more powerful, such as a DynamoDB catalog, is just as easy!

rest:
  image: tabulario/iceberg-rest:0.2.0
  ports:
    - 8181:8181
  environment:
    - CATALOG_WAREHOUSE=s3a://warehouse/wh/
    - CATALOG_IO__IMPL=org.apache.iceberg.aws.dynamodb.DynamoDbCatalog
    - CATALOG_DYNAMODB_TABLE__NAME=demo_warehouse
    - CATALOG_S3_ENDPOINT=http://minio:9000

What’s Next?

The REST catalog is a big achievement for Iceberg and will contribute towards the goal of making Iceberg the open table format that’s supported by all query engines and frameworks. The catalog is a core component of an Iceberg backed data warehouse and making it accessible through a REST API enables integration of Iceberg into the wide ecosystem of data tools that a modern organization must inevitably adopt.

Careers

Senior Software Engineer, OSS

Improve Apache Iceberg by building new capabilities for Tabular and the community

Senior Software Engineer, Product

Design services and using cloud infrastructure to build a resilient and scalable data platform

Senior UI Engineer

Design and implement Tabular’s user experience, where people will create, monitor, and manage their data platform

Developer Advocate

Build examples to solve real-world challenges, write tutorials that help developers succeed, and be a community liaison

Developer Experience Engineer

Build technical documentation and tutorials, assist in maintaining the release processes, and lower the time to dopamine (TTD) of developers using Apache Iceberg