GETTING STARTED
PyIceberg is a native Python implementation of Apache Iceberg that enables access to a wide range of scientific and analytical systems in the python ecosystem.
PyIceberg supports multiple catalog implementations out of the box with a number of options for configuring the catalog.
The easiest way to get started using PyIceberg is to install via pip:
pip install -U "pyiceberg[pyarrow]"
Some catalogs require extras. For instance, to add support for Apache Hive metastore catalogs, add the hive
extra. You can find more details on installation options in the Installing PyIceberg recipe.
Catalog configurations
The recommended way to configure PyIceberg is using a YAML config file. You can also use environment variables or pass a dict programmatically through to catalog API. The following are examples of how to configure using each of these methods.
YAML configuration
For a standardized configuration that allows you to define multiple catalogs and access them without needing to define them in code, create a .pyiceberg.yaml
file in your home directory. You can change the directory by setting the PYICEBERG_HOME
environment variable.
# $HOME/.pyiceberg.yaml
catalog:
sandbox:
uri: https://api.www.tabular.io/ws
warehouse: sandbox
credential: <YOUR-CREDENTIAL>
my_hive_catalog:
uri: thrift://hive-metastore:9083
PySpark selects the type of catalog using the URI scheme. REST catalogs use https
(or http
), Hive uses thrift
, and JDBC uses jdbc
.
Each catalog name under catalog
will be addressable via the load_catalog
method as they’re defined in the config file. With the above definitions, the catalogs can be loaded programmatically like the following:
from pyiceberg.catalog import load_catalog
rest = load_catalog('sandbox')
hive = load_catalog('my_hive_catalog')
If you name a catalog, default
, that catalog will be loaded if no name is provided.
Environment variable configuration
PyIceberg can also define catalogs via environment variables which can be useful in scheduling applications and containerized environments where loading a file may prove difficult. The environment variables need to be formatted as PYICEBERG_CATALOG__<CATALOG_NAME>__<PROPERTY>
and will be interpreted similar to the YAML configuration.
# REST Catalog Example
export PYICEBERG_CATALOG__SANDBOX__URI=https://api.www.tabular.io/ws
export PYICEBERG_CATALOG__SANDBOX__CREDENTIAL=<YOUR-CREDENTIAL>
export PYICEBERG_CATALOG__SANDBOX__WAREHOUSE=sandbox
# Hive Metastore Example
export PYICEBERG_CATALOG__MY_HIVE_CATALOG__URI=thrift://hive-metastore:9083
Programmatic configuration
PyIceberg also makes it possible to programmatically configure catalogs by passing properties to the load_catalog
function.
from pyiceberg.catalog import load_catalog
rest = load_catalog("sandbox",
uri="https://api.www.tabular.io/ws",
warehouse="sandbox")
Testing catalog configuration
To validate your YAML or environment catalog settings, you can use the PyIceberg CLI.
$ pyiceberg list
default
examples
$ pyiceberg list examples
examples.backblaze_drive_stats
examples.nyc_taxi_locations
examples.nyc_taxi_yellow
$ pyiceberg show
You can see more examples in the PyIceberg CLI recipe.