PyIceberg 0.5.0

Tags: Avro, developer, gzip, lambda functions, PyArrow

September 22, 2023

Happy to announce that PyIceberg 0.5.0 has been released, and is packed with many new features! PyIceberg is a Python implementation for reading Iceberg tables into your favorite engine.

This new release is a major step in the maturity of PyIceberg. You can now expect a major speedup of queries, ease of use improvements, and new features. This release includes:

Support serverless environments (including AWS Lambda)
Support for schema evolution
PyArrow HDFS support through PyArrow
Performance:
- Many fixes around Avro performance
- Moving the reading of Avro to Cython (10x speed improvement(!))
Support for the SQLCatalog (JDBC in Java)
Add gzip metadata support
Fix support for UUID columns
Dependencies:
- Bump Pydantic to v2 (improved performance of the JSON (de)serialization)
- Remove the upper bound of PyParsing dependency (blocking a PR in Airflow)
A lot of bugfixes!

Let’s dive into some details for some of them!

Support for serverless environments

Big shout out to Josh Wiley for working on this!

PyIceberg is an excellent candidate to run in an AWS lambda, to set properties or query a table. The technical reason is that Python’s multiprocessing module relies on OS-level APIs that are not available within a Lambda (due to security constraints). Moving to concurrent.futures fixed this.

Support for schema evolution

Big thanks to Liwei Li for working on this!

PyIceberg is an excellent candidate to script your schema evolution. Many companies embed the principles of data contract in their organization. One of them is making sure that you don’t make changes to the table schema that break downstream consumers of the data. In Iceberg a schema change is an explicit operation to the table, and can now also be done using PyIceberg:

# Adding a column
with table.update_schema() as update:
  update.add_column("some_field", IntegerType(), "doc")

# Renaming
with table.update_schema() as update:
  update.rename("retries", "num_retries")
  # This will rename `confirmed_by` to `exchange`
  update.rename("properties.confirmed_by", "exchange")
  
# Move a column
with table.update_schema() as update:
  update.move_first("symbol")
  update.move_after("bid", "ask")
  # This will move `created_by` before `exchange`
  update.move_before("details.created_by", "details.exchange")

# Updating a type
with table.update_schema() as update:
  # Promote a float to a double
  update.update_column("bid", field_type=DoubleType())
  # Make a field optional
  update.update_column("symbol", required=False)
  # Update the documentation
  update.update_column("symbol", doc="Name of the share on the exchange")

# Deleting a column, which might cause issues down the line
with table.update_schema(allow_incompatible_changes=True) as update:
  update.delete_column("some_field")

For the full API, please refer to the docs.

HDFS support

Big shout out to Luigi Cerone for working on this!

If you’re on HDFS, you can now also read in data. Check the docs for how to configure this.

Improved performance

Big shout out to Rusty Conover for working on this!

There has been a great push on performance with the last release. Mostly due to moving the binary decoder from Python to C. With the Avro reading itself, we observed a 20x speedup! If your table has many manifest entries this will improve performance quite a bit.

Since we now ship native code, PyIceberg is also available in Python wheels, so you don’t have to compile the code locally.

Support for the SQLCatalog

Big shout out to Eric for working on this!

This is the equivalent of Java’s JDBC catalog. This catalog connects to a database (Postgres for now) for storing the namespaces and tables. Check out the docs if you want to learn more.

Support for the UUID column

Big shout out to Jonas Jiang for working on this!

In PyIceberg the UUID type was not properly implemented, and this has been fixed in the 0.5.0 release. Including tests to make sure that it won’t break again.

Dependencies

It is always important to keep the dependencies up to date for security reasons and to make sure that the latest versions of Python are supported. Dependencies in Python is always a sensitive subject since if you integrate with other projects the dependencies need to be in line. Jarek Potiuk, Airflow committer, and PMC has an excellent talk on this. For Python, there are two big changes:

Pydantic 2.0: We took the leap to Pydantic 2.0, because we wanted to integrate with Polars, and this project is already on version 2 already. Unfortunately, other projects that use PyIceberg such as Datahub are not yet on version 2. There are a lot of breaking changes, so upgrading can be non-trivial. The new version of Pydantic also comes with a major speed improvement, and we’ve seen a 33% increase in speed when parsing JSON.
pyparsing: For the 0.4.0 release we’ve noticed a bug when upgrading to PyParsing 3.1.0, so we pinned the version. This is against the recommendations in the above-mentioned talk and bit us when integrating with Airflow. 0.5.0 removes this pin and allows us to upgrade the version in Airflow.

Give PyIceberg a try!

The list on this page is just the highlighted features, there are many more bugfixes and small improvements. It’s available now on pip. For details, please check the docs site. Make sure to give it a try! If you run into anything, feel free to reach out in the #python channel on the Iceberg Slack.

PyIceberg 0.5.0

Support for serverless environments

Support for schema evolution

HDFS support

Improved performance

Support for the SQLCatalog

Support for the UUID column

Dependencies

Give PyIceberg a try!

Related Posts

PyIceberg 0.2.1: PyArrow and DuckDB

Apache Iceberg 1.4.0 is available!

PyIceberg 0.4.0