This is a walkthrough on how to use the PyIceberg CLI. For this walkthrough, we’re going to use the docker-spark-iceberg repository that contains a full stack of Spark with Iceberg support, MinIO as a storage backend and a REST catalog as a catalog. This is a companion piece to our PyIceberg video. It requires docker and docker-compose installed.
First, we’re going to clone the repository and start the stack:
git clone https://github.com/tabular-io/docker-spark-iceberg.git
cd docker-spark-iceberg
docker-compose up
It takes a couple of seconds to start all the services, and next we can open up a web browser to go to the Jupyter notebook UI that comes with the Spark images. This UI is available at http://localhost:8888/.
We can use the notebook Iceberg - Getting Started.ipynb
to create a table that we then can query using PyIceberg. First, we need to create the database nyc
:
%%sql
CREATE DATABASE nyc;
And next we’re going to read in a single parquet file, and write it to the table nyc.taxis
:
df = spark.read.parquet("/home/iceberg/data/yellow_tripdata_2021-04.parquet")
df.write.saveAsTable("nyc.taxis")
We don’t always want to use a full blown JVM to interact with the Iceberg tables. This is where PyIceberg comes in. Next, we’ll jump into the Spark container, and get PyIceberg installed.
docker exec -t -i `docker ps -q -f "expose=8888"` bash
Now we’re inside of the spark container, and we can install PyIceberg:
pip3 install "pyiceberg[pyarrow]"
In this instance we’re install PyIceberg with PyArrow support, but there are also other backends such as s3fs
. For all the latest information, please check the docs.
And next, we can query the rest catalog:
pyiceberg --uri http://rest:8181 list
nyc
Now, we don’t want to provide the --uri
all the time, instead we can also create a configuration file:
vi ~/.pyiceberg.yaml
We can set the following yaml:
catalog:
default:
uri: http://rest:8181
s3.endpoint: http://minio:9000
You can escape vi
using by pressing :
, and then do the command wq
, which means write and quite. Now we’re able to query the catalog without having to provide a uri
:
pyiceberg list
nyc
To list all the commands, we can just run pyiceberg
:
pyiceberg
That gives an overview of all the commands:
Usage: pyiceberg [OPTIONS] COMMAND [ARGS]...
Options:
--catalog TEXT
--verbose BOOLEAN
--output [text|json]
--uri TEXT
--credential TEXT
--help Show this message and exit.
Commands:
describe Describes a namespace xor table
drop Operations to drop a namespace or table
files Lists all the files of the table
list Lists tables or namespaces
location Returns the location of the table
properties Properties on tables/namespaces
rename Renames a table
schema Gets the schema of the table
spec Returns the partition spec of the table
uuid Returns the UUID of the table
Next wel can describe the table that we created in the beginning:
pyiceberg describe nyc.taxis
Which returns:
Table format version 1
Metadata location s3a://warehouse/wh/nyc/taxis/metadata/00000-d8c5f2af-b445-4814-a7ae-133b07be7a85.metadata.json
Table UUID 265415d3-6d3d-47dd-86f6-ee4cfb6e4ce9
Last Updated 1668444293891
Partition spec []
Sort order []
Current schema Schema, id=0
├── 1: VendorID: optional long
├── 2: tpep_pickup_datetime: optional timestamptz
├── 3: tpep_dropoff_datetime: optional timestamptz
├── 4: passenger_count: optional double
├── 5: trip_distance: optional double
├── 6: RatecodeID: optional double
├── 7: store_and_fwd_flag: optional string
├── 8: PULocationID: optional long
├── 9: DOLocationID: optional long
├── 10: payment_type: optional long
├── 11: fare_amount: optional double
├── 12: extra: optional double
├── 13: mta_tax: optional double
├── 14: tip_amount: optional double
├── 15: tolls_amount: optional double
├── 16: improvement_surcharge: optional double
├── 17: total_amount: optional double
├── 18: congestion_surcharge: optional double
└── 19: airport_fee: optional double
Current snapshot Operation.APPEND: id=8152339288720205593, schema_id=0
Snapshots Snapshots
└── Snapshot 8152339288720205593, schema 0: s3a://warehouse/wh/nyc/taxis/metadata/snap-8152339288720205593-1-bcce7ed6-5396-4285-9892-a1e89ae286f8.avro
Properties owner root
created-at 2022-11-14T16:44:46.939799552Z
write.format.default parquet
In the case when you’re scripting things, it is often nice to have the output in json, which is also possible:
pyiceberg --output json describe nyc.taxis
Which returns all the information in json:
{
"identifier": ["default", "nyc", "taxis"],
"metadata_location": "s3a://warehouse/wh/nyc/taxis/metadata/00000-d8c5f2af-b445-4814-a7ae-133b07be7a85.metadata.json",
"metadata": {
"location": "s3a://warehouse/wh/nyc/taxis",
"table-uuid": "265415d3-6d3d-47dd-86f6-ee4cfb6e4ce9",
"last-updated-ms": 1668444293891,
"last-column-id": 19,
"schemas": [{
"type": "struct",
"fields": [{
"id": 1,
"name": "VendorID",
"type": "long",
"required": false
}, {
"id": 2,
"name": "tpep_pickup_datetime",
"type": "timestamptz",
"required": false
}, ...],
"schema-id": 0,
"identifier-field-ids": []
}],
"current-schema-id": 0,
"partition-specs": [{
"spec-id": 0,
"fields": []
}],
"default-spec-id": 0,
"last-partition-id": 999,
"properties": {
"owner": "root",
"created-at": "2022-11-14T16:44:46.939799552Z",
"write.format.default": "parquet"
},
"current-snapshot-id": 8152339288720205593,
"snapshots": [{
"snapshot-id": 8152339288720205593,
"timestamp-ms": 1668444293891,
"manifest-list": "s3a://warehouse/wh/nyc/taxis/metadata/snap-8152339288720205593-1-bcce7ed6-5396-4285-9892-a1e89ae286f8.avro",
"summary": {
"operation": "append",
"spark.app.id": "local-1668442408867",
"added-data-files": "1",
"added-records": "2171187",
"added-files-size": "34536512",
"changed-partition-count": "1",
"total-records": "2171187",
"total-files-size": "34536512",
"total-data-files": "1",
"total-delete-files": "0",
"total-position-deletes": "0",
"total-equality-deletes": "0"
},
"schema-id": 0
}],
"snapshot-log": [{
"snapshot-id": "8152339288720205593",
"timestamp-ms": 1668444293891
}],
"metadata-log": [],
"sort-orders": [{
"order-id": 0,
"fields": []
}],
"default-sort-order-id": 0,
"refs": {
"main": {
"snapshot-id": 8152339288720205593,
"type": "branch"
}
},
"format-version": 1,
"schema": {
"type": "struct",
"fields": [{
"id": 1,
"name": "VendorID",
"type": "long",
"required": false
}, {
"id": 2,
"name": "tpep_pickup_datetime",
"type": "timestamptz",
"required": false
} ... ],
"schema-id": 0,
"identifier-field-ids": []
},
"partition-spec": []
}
}
Also, it is very easy to accidentally drop the table:
pyiceberg drop table nyc.taxis
And if we try to query the table:
pyiceberg --output json describe nyc.taxis
We can see that it is not available anymore:
{
"type": "NoSuchTableError",
"message": "Table or namespace does not exist: nyc.taxis"
}
Hope you’ve learned something about PyIceberg
today. If you run into anything, please don’t hesitate to reach out on the #python
channel on Slack, or create an issue on Github.