PYICEBERG
The PyIceberg CLI allows you to easily inspect table metadata through Apache Iceberg catalogs. This recipe shows commonly-used commands.
First, make sure that you have PyIceberg installed and configured.
You can list the available commands using the --help
option:
pyiceberg --help
Usage: pyiceberg [OPTIONS] COMMAND [ARGS]...
Options:
--catalog TEXT
--verbose BOOLEAN
--output [text|json]
--uri TEXT
--credential TEXT
--help Show this message and exit.
Commands:
describe Describe a namespace or a table.
drop Operations to drop a namespace or table.
files List all the files of the table.
list List tables or namespaces.
location Return the location of the table.
properties Properties on tables/namespaces.
rename Rename a table.
schema Get the schema of the table.
spec Return the partition spec of the table.
uuid Return the UUID of the table.
version Print pyiceberg version.
Exploring a catalog
The CLI is great for exploring the namespaces and tables in an Iceberg catalog. You can easily find what namespaces and tables exist, and then dive deeper into the details. For example, you can list a table’s files, show current and historical schemas, and inspect partition configurations.
The list
command shows the contents of Iceberg namespaces. Without an argument, it shows the top level:
pyiceberg list
# default
# examples
Add a name to list the contents of that namespace:
pyiceberg list examples
# examples.backblaze_drive_stats
# examples.nyc_taxi_locations
# examples.nyc_taxi_yellow
The describe
command shows a summary of an object in the catalog. For a namespace, it shows configuration properties:
pyiceberg describe examples
location s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/
Namespace properties
You can also get and set properties using the properties
command:
pyiceberg properties get examples
location s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/
pyiceberg properties get examples location
s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/
For this namespace, the only property is its location. The location determines where newly-created tables are stored.
To set a property, use the set
subcommand:
pyiceberg properties set namespace examples created "$(date)"
This creates a new property named “created” and assigns the local timestamp to it.
Inspecting tables
When run on a table, describe
shows a summary of table metadata.
pyiceberg describe examples.nyc_taxi_yellow
Table format version 1
Metadata location s3://tabular-public-us-west-2-dev/bb30733e-8769-4dab-aa1b-e76245bb2bd4/5f29c1cf-32cd-463d-8566-3171ebfa0b2a/metadata/00023-c92a1992-9316-442a-8546-c959115b0f5e.gz.metadata.json
Table UUID 5f29c1cf-32cd-463d-8566-3171ebfa0b2a
Last Updated 1687192820912
Partition spec [
1000: pickup_time_month: month(2)
]
Sort order [
month(2) ASC NULLS FIRST
9 ASC NULLS FIRST
2 ASC NULLS FIRST
3 ASC NULLS FIRST
]
Current schema Schema, id=0
├── 1: vendor_id: optional int (TPEP provider code; 1=Creative Mobile Technologies, LLC; 2=VeriFone Inc.)
├── 2: pickup_time: optional timestamptz (Date and time when the meter was engaged)
├── 3: pickup_location_id: optional int (Location ID where the meter was engaged)
├── 4: dropoff_time: optional timestamptz (Date and time when the meter was disengaged)
├── 5: dropoff_location_id: optional int (Location ID where the meter was disengaged)
├── 6: passenger_count: optional int (Number of passengers in the vehicle (driver entered))
├── 7: trip_distance: optional double (Elapsed trip distance in miles reported by the meter)
├── 8: ratecode_id: optional int (Final rate code in effect when the trip ended; 1=Standard Rate; 2=JFK; 3=Newark; 4=Nassau or Westchester; 5=Negotiated fare; 6=Group ride)
├── 9: payment_type: optional int (How the passenger paid; 1=Credit card; 2=Cash; 3=No charge; 4=Dispute; 5=Unknown; 6=Voided trip)
├── 10: total_amount: optional double (Total amount charged to passengers; cash tips not included)
├── 11: fare_amount: optional double (Time-and-distance fare in USD calculated by the meter)
├── 12: tip_amount: optional double (Tip amount; automatically populated for credit card tips; cash tips not included)
├── 13: tolls_amount: optional double (Total amount of all tolls paid in trip)
├── 14: mta_tax: optional double (MTA tax automatically triggered based on the metered rate in use; $0.50)
├── 15: improvement_surcharge: optional double (Improvement surcharge assessed trips at the flag drop; $0.30)
├── 16: congestion_surcharge: optional double (Congestion surcharge)
├── 17: airport_fee: optional double (Airport fee)
├── 18: extra_surcharges: optional double (Misc. extras and surcharges; $0.50 and $1.00 rush hour and overnight charges)
└── 19: store_and_forward_flag: optional string (Whether the trip record was held in vehicle memory; Y(es)/N(o))
Current snapshot Operation.REPLACE: id=3144914115538881591, parent_id=5771978746571165285, schema_id=0
Snapshots Snapshots
└── Snapshot 3144914115538881591, schema 0:
s3://tabular-public-us-west-2-dev/bb30733e-8769-4dab-aa1b-e76245bb2bd4/5f29c1cf-32cd-463d-8566-3171ebfa0b2a/metadata/snap-3144914115538881591-1-080ee876-d5b9-43d7-9917-d534a9f3ce…
Properties optimizer.enabled true
manifest-rewrite.submitted 2023-02-22T17:51:25.599481397Z
optimizer.lastStartTs 1687192800358
write.delete.parquet.compression-codec zstd
write.metadata.compression-codec gzip
write.parquet.compression-codec zstd
comment NYC Yellow Taxi trip records dataset from the NYC Taxi & Limousine Commission; https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
write.distribution-mode range
write.object-storage.enabled true
write.summary.partition-limit 100
This exposes a lot of information about the table, such as:
- Format version: Whether the table uses the initial v1 Iceberg specification or the newer v2 version that also supports merge-on-read deletes
- Table location: Where the table is stored on the object store
- Table UUID: The current UUID that identifies the current version of the table
- Last updated: When the table was updated for the last time since epoch, in milliseconds
- Partition spec: The current partition strategy of the table
- Sort order: When writing to the table, how the data is being sorted
- Current schema: The current schema of the table
- Current snapshot: The latest snapshot of the table, including the operation that was done, and the parent snapshot id
- Properties: The properties of the table
You can also use more specific commands, such as schema
, version
, and files
to show just one part of table metadata.
Scripting with the PyIceberg CLI
If you’re scripting anything, another helpful feature is to output JSON instead of text to make the output easier to parse:
pyiceberg --output json properties get table examples.nyc_taxi_yellow
{
"optimizer.enabled": "true",
"manifest-rewrite.submitted": "2023-02-22T17:51:25.599481397Z",
"optimizer.lastStartTs": "1687192800358",
"write.delete.parquet.compression-codec": "zstd",
"write.metadata.compression-codec": "gzip",
"write.parquet.compression-codec": "zstd",
"comment": "NYC Yellow Taxi trip records dataset",
"write.distribution-mode": "range",
"write.object-storage.enabled": "true",
"write.summary.partition-limit": "100"
}
When displaying table metadata, this output format conforms to the JSON serialization requirements from the Iceberg table spec and the REST catalog protocol definition.