PYICEBERG
Apache Iceberg is language- and engine-agnostic, meaning it was designed to be portable so that any language or engine can interact with Iceberg tables. PyIceberg is the official Python client of the Iceberg project and an easy way to get started with Iceberg. It provides a lightweight option to query data from an Iceberg table for further analysis using your favorite Python tools for data science and AI.
PyIceberg has full read support and integrates with other projects, including Polars, Pandas, and DuckDB. Write support is underway, and the best way to track the progress is by following the repository or the documentation website.
PyIceberg is available through pip and can be installed using pip
install
. Support for the REST catalog comes out of the box, and the PyArrow extra is the easiest way to get started.
pip3 install -U "pyiceberg[pyarrow]"
Installing optional extensions (extras)
Optional packages can be installed depending on your needs to keep the installation lightweight.
Extras for FileIO (to fetch the data):
Option | Description: |
---|---|
pyarrow | PyArrow filesystem backend (supports S3, HDFS, and others) |
s3fs | fsspec implementation for AWS S3 |
adlfs | fsspec implementation for Azure ADLS |
gcsfs | fsspec implementation for Google Cloud Storage |
Extras to add catalog implementations (REST catalog support is built in):
sql-postgres | Support for a Postgres-backed metastore |
hive | Support for Apache Hive Metastore |
glue | Support for AWS Glue |
dynamodb | Support for AWS DynamoDB |
Note that the REST catalog is not listed since it is supported out of the box.
Extras to add integration with your favorite data analysis toolkit:
pandas | Support to read directly into a pandas dataframe |
arrow | Support to read directly into an Arrow dataframe |
duckdb | Support to query the data using duckdb |
ray | Support to convert the data into a Ray dataset |
You can mix and match the options according to your needs. For example, if you want to add support for ADLS and DuckDB, you’d install both the duckdb
and adlfs
extras.
pip3 install -U "pyiceberg[duckdb,adlfs]"