Using Iceberg's S3FileIO Implementation to Store Your Data in MinIO

blog-image

In a previous post , we covered how to use docker for an easy way to get up and running with Iceberg and its feature-rich Spark integration. In that post, we selected the hadoop file-io implementation, mainly because it supported reading/writing to local files (check out this post to learn more about the FileIO interface.) In this blog post, we’ll take one step towards a more typical, modern, cloud-based architecture and switch to using Iceberg’s S3 file-io implementation, backed by a MinIO instance which supports the S3 API.

If you’re not familiar with what MinIO is, it’s a flexible and performant object store that’s powered by Kubernetes. To learn more about it you can head over to their site at min.io !

Adding the MinIO Container

The easiest way to get a MinIO instance is using the official minio/minio image. Here’s what your docker compose file should look like after following the steps in the Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg! post.

version: "3"

services:
  spark-iceberg:
    image: tabulario/spark-iceberg
    depends_on:
      - postgres
    container_name: spark-iceberg
    environment:
      - SPARK_HOME=/opt/spark
      - PYSPARK_PYTON=/usr/bin/python3.9
      - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/spark/bin
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
    ports:
      - 8888:8888
      - 8080:8080
      - 18080:18080
  postgres:
    image: postgres:13.4-bullseye
    container_name: postgres
    environment:
      - POSTGRES_USER=admin
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=demo_catalog
    volumes:
      - ./postgres/data:/var/lib/postgresql/data

Add the following to include a container with the official minio/minio image.

...
  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]
...

Using environment variables, this sets the MinIO root username and password to “admin” and “password”, respectively.

Creating a Bucket on Startup

Next, we’ll want to use the MinIO CLI to bootstrap our MinIO instance with a bucket. MinIO also offers an official image for the CLI, minio/mc. By defining a simple entrypoint, we’ll also configure the CLI to connect to the MinIO instance and create a bucket which we’ll call ‘warehouse’.

...
  mc:
    depends_on:
      - minio
    image: minio/mc
    container_name: mc
    environment:
      - AWS_ACCESS_KEY_ID=demo
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc rm -r --force minio/warehouse;
      /usr/bin/mc mb minio/warehouse;
      /usr/bin/mc policy set public minio/warehouse;
      exit 0;
      "      
...

If the bucket already exists, the CLI container will fail gracefully.

Configuring S3FileIO

The file-io for a catalog can be set and configured through Spark properties. We’ll need to change three properties on the demo catalog to use the S3FileIO implementation and connect it to our MinIO container.

spark.sql.catalog.demo.io-impl=org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.demo.warehouse=s3://warehouse
spark.sql.catalog.demo.s3.endpoint=http://minio:9000

We can append these property changes to our spark-defaults.conf in the tabulario/spark-iceberg image by overriding the entrypoint for our spark-iceberg container. Additionally, we’ll need to set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION environment variables for our MinIO cluster. The region must be set but the value doesn’t matter since we’re running locally. We’ll just use us-east-1.

  spark-iceberg:
    image: tabulario/spark-iceberg
    depends_on:
      - postgres
    container_name: spark-iceberg
    environment:
      - SPARK_HOME=/opt/spark
      - PYSPARK_PYTON=/usr/bin/python3.9
      - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/spark/bin:/opt/spark/sbin
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
    ports:
      - 8888:8888
      - 8080:8080
      - 18080:18080
    entrypoint: /bin/sh
    command: >
      -c "
      echo \"
      spark.sql.catalog.demo.io-impl         org.apache.iceberg.aws.s3.S3FileIO \n
      spark.sql.catalog.demo.warehouse       s3://warehouse \n
      spark.sql.catalog.demo.s3.endpoint     http://minio:9000 \n
      \" >> /opt/spark/conf/spark-defaults.conf && ./entrypoint.sh notebook
      "      

At this point, the full docker compose file should look like this:

version: "3"

services:
  spark-iceberg:
    image: tabulario/spark-iceberg
    depends_on:
      - postgres
    container_name: spark-iceberg
    environment:
      - SPARK_HOME=/opt/spark
      - PYSPARK_PYTON=/usr/bin/python3.9
      - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/spark/bin:/opt/spark/sbin
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
    ports:
      - 8888:8888
      - 8080:8080
      - 18080:18080
    entrypoint: /bin/sh
    command: >
      -c "
      echo \"
      spark.sql.catalog.demo.io-impl         org.apache.iceberg.aws.s3.S3FileIO \n
      spark.sql.catalog.demo.warehouse       s3://warehouse \n
      spark.sql.catalog.demo.s3.endpoint     http://minio:9000 \n
      \" >> /opt/spark/conf/spark-defaults.conf && ./entrypoint.sh notebook
      "      
  postgres:
    image: postgres:13.4-bullseye
    container_name: postgres
    environment:
      - POSTGRES_USER=admin
      - POSTGRES_PASSWORD=password
      - POSTGRES_DB=demo_catalog
    volumes:
      - ./postgres/data:/var/lib/postgresql/data
  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]
  mc:
    depends_on:
      - minio
    image: minio/mc
    container_name: mc
    environment:
      - AWS_ACCESS_KEY_ID=demo
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc rm -r --force minio/warehouse;
      /usr/bin/mc mb minio/warehouse;
      /usr/bin/mc policy set public minio/warehouse;
      exit 0;
      "      

Start it Up!

Finally, we can fire up the containers!

docker-compose up

You can find the MinIO UI at http://localhost:9001 where you should see the ‘warehouse’ bucket. Now you can launch a spark shell or the notebook server, run any of the example notebooks, and watch the data and metadata appear in the MinIO bucket!

minio-warehouse-bucket