Getting started with PyIceberg CLI

PYICEBERG

The PyIceberg CLI allows you to easily inspect table metadata through Apache Iceberg catalogs. This recipe shows commonly-used commands.

First, make sure that you have PyIceberg installed and configured.

You can list the available commands using the --help option:

pyiceberg --help

Usage: pyiceberg [OPTIONS] COMMAND [ARGS]...
Options:
  --catalog TEXT
  --verbose BOOLEAN
  --output [text|json]
  --uri TEXT
  --credential TEXT
  --help                Show this message and exit.
Commands:
  describe    Describe a namespace or a table.
  drop        Operations to drop a namespace or table.
  files       List all the files of the table.
  list        List tables or namespaces.
  location    Return the location of the table.
  properties  Properties on tables/namespaces.
  rename      Rename a table.
  schema      Get the schema of the table.
  spec        Return the partition spec of the table.
  uuid        Return the UUID of the table.
  version     Print pyiceberg version.

Exploring a catalog

The CLI is great for exploring the namespaces and tables in an Iceberg catalog. You can easily find what namespaces and tables exist, and then dive deeper into the details. For example, you can list a table’s files, show current and historical schemas, and inspect partition configurations.

The list command shows the contents of Iceberg namespaces. Without an argument, it shows the top level:

pyiceberg list
# default
# examples

Add a name to list the contents of that namespace:

pyiceberg list examples
# examples.backblaze_drive_stats
# examples.nyc_taxi_locations
# examples.nyc_taxi_yellow

The describe command shows a summary of an object in the catalog. For a namespace, it shows configuration properties:

pyiceberg describe examples
location  s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/

Namespace properties

You can also get and set properties using the properties command:

pyiceberg properties get examples
location  s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/

pyiceberg properties get examples location
s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/

For this namespace, the only property is its location. The location determines where newly-created tables are stored.

To set a property, use the set subcommand:

pyiceberg properties set namespace examples created "$(date)"

This creates a new property named “created” and assigns the local timestamp to it.

Inspecting tables

When run on a table, describe shows a summary of table metadata.

pyiceberg describe examples.nyc_taxi_yellow

Table format version  1                                                                                                                                                                                      
Metadata location     s3://tabular-public-us-west-2-dev/bb30733e-8769-4dab-aa1b-e76245bb2bd4/5f29c1cf-32cd-463d-8566-3171ebfa0b2a/metadata/00023-c92a1992-9316-442a-8546-c959115b0f5e.gz.metadata.json       
Table UUID            5f29c1cf-32cd-463d-8566-3171ebfa0b2a                                                                                                                                                   
Last Updated          1687192820912                                                                                                                                                                          
Partition spec        [                                                                                                                                                                                      
                        1000: pickup_time_month: month(2)                                                                                                                                                    
                      ]                                                                                                                                                                                      
Sort order            [                                                                                                                                                                                      
                        month(2) ASC NULLS FIRST                                                                                                                                                             
                        9 ASC NULLS FIRST                                                                                                                                                                    
                        2 ASC NULLS FIRST                                                                                                                                                                    
                        3 ASC NULLS FIRST                                                                                                                                                                    
                      ]                                                                                                                                                                                      
Current schema        Schema, id=0                                                                                                                                                                           
                      ├── 1: vendor_id: optional int (TPEP provider code; 1=Creative Mobile Technologies, LLC; 2=VeriFone Inc.)                                                                              
                      ├── 2: pickup_time: optional timestamptz (Date and time when the meter was engaged)                                                                                                    
                      ├── 3: pickup_location_id: optional int (Location ID where the meter was engaged)                                                                                                      
                      ├── 4: dropoff_time: optional timestamptz (Date and time when the meter was disengaged)                                                                                                
                      ├── 5: dropoff_location_id: optional int (Location ID where the meter was disengaged)                                                                                                  
                      ├── 6: passenger_count: optional int (Number of passengers in the vehicle (driver entered))                                                                                            
                      ├── 7: trip_distance: optional double (Elapsed trip distance in miles reported by the meter)                                                                                           
                      ├── 8: ratecode_id: optional int (Final rate code in effect when the trip ended; 1=Standard Rate; 2=JFK; 3=Newark; 4=Nassau or Westchester; 5=Negotiated fare; 6=Group ride)           
                      ├── 9: payment_type: optional int (How the passenger paid; 1=Credit card; 2=Cash; 3=No charge; 4=Dispute; 5=Unknown; 6=Voided trip)                                                    
                      ├── 10: total_amount: optional double (Total amount charged to passengers; cash tips not included)                                                                                     
                      ├── 11: fare_amount: optional double (Time-and-distance fare in USD calculated by the meter)                                                                                           
                      ├── 12: tip_amount: optional double (Tip amount; automatically populated for credit card tips; cash tips not included)                                                                 
                      ├── 13: tolls_amount: optional double (Total amount of all tolls paid in trip)                                                                                                         
                      ├── 14: mta_tax: optional double (MTA tax automatically triggered based on the metered rate in use; $0.50)                                                                             
                      ├── 15: improvement_surcharge: optional double (Improvement surcharge assessed trips at the flag drop; $0.30)                                                                          
                      ├── 16: congestion_surcharge: optional double (Congestion surcharge)                                                                                                                   
                      ├── 17: airport_fee: optional double (Airport fee)                                                                                                                                     
                      ├── 18: extra_surcharges: optional double (Misc. extras and surcharges; $0.50 and $1.00 rush hour and overnight charges)                                                               
                      └── 19: store_and_forward_flag: optional string (Whether the trip record was held in vehicle memory; Y(es)/N(o))                                                                       
Current snapshot      Operation.REPLACE: id=3144914115538881591, parent_id=5771978746571165285, schema_id=0                                                                                                  
Snapshots             Snapshots                                                                                                                                                                              
                      └── Snapshot 3144914115538881591, schema 0:                                                                                                                                            
                          s3://tabular-public-us-west-2-dev/bb30733e-8769-4dab-aa1b-e76245bb2bd4/5f29c1cf-32cd-463d-8566-3171ebfa0b2a/metadata/snap-3144914115538881591-1-080ee876-d5b9-43d7-9917-d534a9f3ce…
Properties            optimizer.enabled                       true                                                                                                                                           
                      manifest-rewrite.submitted              2023-02-22T17:51:25.599481397Z                                                                                                                 
                      optimizer.lastStartTs                   1687192800358                                                                                                                                  
                      write.delete.parquet.compression-codec  zstd                                                                                                                                           
                      write.metadata.compression-codec        gzip                                                                                                                                           
                      write.parquet.compression-codec         zstd                                                                                                                                           
                      comment                                 NYC Yellow Taxi trip records dataset from the NYC Taxi & Limousine Commission; https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page   
                      write.distribution-mode                 range                                                                                                                                          
                      write.object-storage.enabled            true                                                                                                                                           
                      write.summary.partition-limit           100    

This exposes a lot of information about the table, such as:

  • Format version: Whether the table uses the initial v1 Iceberg specification or the newer v2 version that also supports merge-on-read deletes
  • Table location: Where the table is stored on the object store
  • Table UUID: The current UUID that identifies the current version of the table
  • Last updated: When the table was updated for the last time since epoch, in milliseconds
  • Partition spec: The current partition strategy of the table
  • Sort order: When writing to the table, how the data is being sorted
  • Current schema: The current schema of the table
  • Current snapshot: The latest snapshot of the table, including the operation that was done, and the parent snapshot id
  • Properties: The properties of the table

You can also use more specific commands, such as schemaversion, and files to show just one part of table metadata.

Scripting with the PyIceberg CLI

If you’re scripting anything, another helpful feature is to output JSON instead of text to make the output easier to parse:

pyiceberg --output json properties get table examples.nyc_taxi_yellow
{
    "optimizer.enabled": "true",
    "manifest-rewrite.submitted": "2023-02-22T17:51:25.599481397Z",
    "optimizer.lastStartTs": "1687192800358",
    "write.delete.parquet.compression-codec": "zstd",
    "write.metadata.compression-codec": "gzip",
    "write.parquet.compression-codec": "zstd",
    "comment": "NYC Yellow Taxi trip records dataset",
    "write.distribution-mode": "range",
    "write.object-storage.enabled": "true",
    "write.summary.partition-limit": "100"
}

When displaying table metadata, this output format conforms to the JSON serialization requirements from the Iceberg table spec and the REST catalog protocol definition.