Configuring Python

GETTING STARTED

PyIceberg is a native Python implementation of Apache Iceberg that enables access to a wide range of scientific and analytical systems in the python ecosystem.

PyIceberg supports multiple catalog implementations out of the box with a number of options for configuring the catalog.

The easiest way to get started using PyIceberg is to install via pip:

pip install -U "pyiceberg[pyarrow]"

Some catalogs require extras. For instance, to add support for Apache Hive metastore catalogs, add the hive extra. You can find more details on installation options in the Installing PyIceberg recipe.

Catalog configurations

The recommended way to configure PyIceberg is using a YAML config file. You can also use environment variables or pass a dict programmatically through to catalog API. The following are examples of how to configure using each of these methods.

YAML configuration

For a standardized configuration that allows you to define multiple catalogs and access them without needing to define them in code, create a .pyiceberg.yaml file in your home directory. You can change the directory by setting the PYICEBERG_HOME environment variable.

# $HOME/.pyiceberg.yaml

catalog:
  sandbox:
    uri: https://api.www.tabular.io/ws
    warehouse: sandbox
    credential: <YOUR-CREDENTIAL>

  my_hive_catalog:
    uri: thrift://hive-metastore:9083

PySpark selects the type of catalog using the URI scheme. REST catalogs use https (or http), Hive uses thrift, and JDBC uses jdbc.

Each catalog name under catalog will be addressable via the load_catalog method as they’re defined in the config file. With the above definitions, the catalogs can be loaded programmatically like the following:

from pyiceberg.catalog import load_catalog

rest = load_catalog('sandbox')
hive = load_catalog('my_hive_catalog')

If you name a catalog, default, that catalog will be loaded if no name is provided.

Environment variable configuration

PyIceberg can also define catalogs via environment variables which can be useful in scheduling applications and containerized environments where loading a file may prove difficult. The environment variables need to be formatted as PYICEBERG_CATALOG__<CATALOG_NAME>__<PROPERTY> and will be interpreted similar to the YAML configuration.

# REST Catalog Example
export PYICEBERG_CATALOG__SANDBOX__URI=https://api.www.tabular.io/ws
export PYICEBERG_CATALOG__SANDBOX__CREDENTIAL=<YOUR-CREDENTIAL>
export PYICEBERG_CATALOG__SANDBOX__WAREHOUSE=sandbox

# Hive Metastore Example
export PYICEBERG_CATALOG__MY_HIVE_CATALOG__URI=thrift://hive-metastore:9083

Programmatic configuration

PyIceberg also makes it possible to programmatically configure catalogs by passing properties to the load_catalog function.

from pyiceberg.catalog import load_catalog

rest = load_catalog("sandbox",
    uri="https://api.www.tabular.io/ws",
    warehouse="sandbox")

Testing catalog configuration

To validate your YAML or environment catalog settings, you can use the PyIceberg CLI.

$ pyiceberg list
default
examples

$ pyiceberg list examples
examples.backblaze_drive_stats
examples.nyc_taxi_locations
examples.nyc_taxi_yellow

$ pyiceberg show

You can see more examples in the PyIceberg CLI recipe.

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg