Demonstrating PyIceberg

blog-image

This is a walkthrough on how to use the PyIceberg CLI. For this walkthrough, we’re going to use the docker-spark-iceberg repository that contains a full stack of Spark with Iceberg support, MinIO as a storage backend and a REST catalog as a catalog. This is a companion piece to our PyIceberg video . It requires docker and docker-compose installed.

First, we’re going to clone the repository and start the stack:

git clone https://github.com/tabular-io/docker-spark-iceberg.git
cd docker-spark-iceberg
docker-compose up

It takes a couple of seconds to start all the services, and next we can open up a web browser to go to the Jupyter notebook UI that comes with the Spark images. This UI is available at http://localhost:8888/ .

We can use the notebook Iceberg - Getting Started.ipynb to create a table that we then can query using PyIceberg. First, we need to create the database nyc:

%%sql
CREATE DATABASE nyc;

And next we’re going to read in a single parquet file, and write it to the table nyc.taxis:

df = spark.read.parquet("/home/iceberg/data/yellow_tripdata_2021-04.parquet")
df.write.saveAsTable("nyc.taxis")

We don’t always want to use a full blown JVM to interact with the Iceberg tables. This is where PyIceberg comes in. Next, we’ll jump into the Spark container, and get PyIceberg installed.

docker exec -t -i `docker ps -q -f "expose=8888"` bash

Now we’re inside of the spark container, and we can install PyIceberg:

pip3 install "pyiceberg[pyarrow]"

In this instance we’re install PyIceberg with PyArrow support, but there are also other backends such as s3fs. For all the latest information, please check the docs .

And next, we can query the rest catalog:

pyiceberg --uri http://rest:8181 list
nyc

Now, we don’t want to provide the --uri all the time, instead we can also create a configuration file:

vi ~/.pyiceberg.yaml

We can set the following yaml:

catalog:
    default:
        uri: http://rest:8181
        s3.endpoint: http://minio:9000

You can escape vi using by pressing :, and then do the command wq, which means write and quite. Now we’re able to query the catalog without having to provide a uri:

pyiceberg list
nyc

To list all the commands, we can just run pyiceberg:

pyiceberg

That gives an overview of all the commands:

Usage: pyiceberg [OPTIONS] COMMAND [ARGS]...

Options:
  --catalog TEXT
  --verbose BOOLEAN
  --output [text|json]
  --uri TEXT
  --credential TEXT
  --help                Show this message and exit.

Commands:
  describe    Describes a namespace xor table
  drop        Operations to drop a namespace or table
  files       Lists all the files of the table
  list        Lists tables or namespaces
  location    Returns the location of the table
  properties  Properties on tables/namespaces
  rename      Renames a table
  schema      Gets the schema of the table
  spec        Returns the partition spec of the table
  uuid        Returns the UUID of the table

Next wel can describe the table that we created in the beginning:

pyiceberg describe nyc.taxis

Which returns:

Table format version  1
Metadata location     s3a://warehouse/wh/nyc/taxis/metadata/00000-d8c5f2af-b445-4814-a7ae-133b07be7a85.metadata.json
Table UUID            265415d3-6d3d-47dd-86f6-ee4cfb6e4ce9
Last Updated          1668444293891
Partition spec        []
Sort order            []
Current schema        Schema, id=0
                      ├── 1: VendorID: optional long
                      ├── 2: tpep_pickup_datetime: optional timestamptz
                      ├── 3: tpep_dropoff_datetime: optional timestamptz
                      ├── 4: passenger_count: optional double
                      ├── 5: trip_distance: optional double
                      ├── 6: RatecodeID: optional double
                      ├── 7: store_and_fwd_flag: optional string
                      ├── 8: PULocationID: optional long
                      ├── 9: DOLocationID: optional long
                      ├── 10: payment_type: optional long
                      ├── 11: fare_amount: optional double
                      ├── 12: extra: optional double
                      ├── 13: mta_tax: optional double
                      ├── 14: tip_amount: optional double
                      ├── 15: tolls_amount: optional double
                      ├── 16: improvement_surcharge: optional double
                      ├── 17: total_amount: optional double
                      ├── 18: congestion_surcharge: optional double
                      └── 19: airport_fee: optional double
Current snapshot      Operation.APPEND: id=8152339288720205593, schema_id=0
Snapshots             Snapshots
                      └── Snapshot 8152339288720205593, schema 0: s3a://warehouse/wh/nyc/taxis/metadata/snap-8152339288720205593-1-bcce7ed6-5396-4285-9892-a1e89ae286f8.avro
Properties            owner                 root
                      created-at            2022-11-14T16:44:46.939799552Z
                      write.format.default  parquet

In the case when you’re scripting things, it is often nice to have the output in json, which is also possible:

pyiceberg --output json describe nyc.taxis

Which returns all the information in json:

{
	"identifier": ["default", "nyc", "taxis"],
	"metadata_location": "s3a://warehouse/wh/nyc/taxis/metadata/00000-d8c5f2af-b445-4814-a7ae-133b07be7a85.metadata.json",
	"metadata": {
		"location": "s3a://warehouse/wh/nyc/taxis",
		"table-uuid": "265415d3-6d3d-47dd-86f6-ee4cfb6e4ce9",
		"last-updated-ms": 1668444293891,
		"last-column-id": 19,
		"schemas": [{
			"type": "struct",
			"fields": [{
				"id": 1,
				"name": "VendorID",
				"type": "long",
				"required": false
			}, {
				"id": 2,
				"name": "tpep_pickup_datetime",
				"type": "timestamptz",
				"required": false
			}, ...],
			"schema-id": 0,
			"identifier-field-ids": []
		}],
		"current-schema-id": 0,
		"partition-specs": [{
			"spec-id": 0,
			"fields": []
		}],
		"default-spec-id": 0,
		"last-partition-id": 999,
		"properties": {
			"owner": "root",
			"created-at": "2022-11-14T16:44:46.939799552Z",
			"write.format.default": "parquet"
		},
		"current-snapshot-id": 8152339288720205593,
		"snapshots": [{
			"snapshot-id": 8152339288720205593,
			"timestamp-ms": 1668444293891,
			"manifest-list": "s3a://warehouse/wh/nyc/taxis/metadata/snap-8152339288720205593-1-bcce7ed6-5396-4285-9892-a1e89ae286f8.avro",
			"summary": {
				"operation": "append",
				"spark.app.id": "local-1668442408867",
				"added-data-files": "1",
				"added-records": "2171187",
				"added-files-size": "34536512",
				"changed-partition-count": "1",
				"total-records": "2171187",
				"total-files-size": "34536512",
				"total-data-files": "1",
				"total-delete-files": "0",
				"total-position-deletes": "0",
				"total-equality-deletes": "0"
			},
			"schema-id": 0
		}],
		"snapshot-log": [{
			"snapshot-id": "8152339288720205593",
			"timestamp-ms": 1668444293891
		}],
		"metadata-log": [],
		"sort-orders": [{
			"order-id": 0,
			"fields": []
		}],
		"default-sort-order-id": 0,
		"refs": {
			"main": {
				"snapshot-id": 8152339288720205593,
				"type": "branch"
			}
		},
		"format-version": 1,
		"schema": {
			"type": "struct",
			"fields": [{
				"id": 1,
				"name": "VendorID",
				"type": "long",
				"required": false
			}, {
				"id": 2,
				"name": "tpep_pickup_datetime",
				"type": "timestamptz",
				"required": false
			} ... ],
			"schema-id": 0,
			"identifier-field-ids": []
		},
		"partition-spec": []
	}
}

Also, it is very easy to accidentally drop the table:

pyiceberg drop table nyc.taxis

And if we try to query the table:

pyiceberg --output json describe nyc.taxis

We can see that it is not available anymore:

{
	"type": "NoSuchTableError",
	"message": "Table or namespace does not exist: nyc.taxis"
}

Hope you’ve learned something about PyIceberg today. If you run into anything, please don’t hesitate to reach out on the #python channel on Slack , or create an issue on Github .