Work with data in Pandas

PYICEBERG

This recipe shows how to fetch data from an Iceberg table with pure Python into a Pandas dataframe.

Reading with PyIceberg

The recommended way to read from an Iceberg table is to connect to an Iceberg catalog. Refer to the PyIceberg configuration recipe to configure the catalogs in your PyIceberg environment.

The first step is to load a catalog.

from pyiceberg.catalog import load_catalog
catalog = load_catalog('sandbox')

Using that catalog, call load_table to get a table instance and then use the scan method to configure what to read from the table. In this example, both row_filter and selected_fields are used to filter down to just the desired rows and project columns. Finally, call to_pandas to run the scan and produce a Pandas dataframe.

table = catalog.load_table('examples.nyc_taxi_yellow')

df = table.scan(
  row_filter="pickup_time >= '2021-01-01T00:00:00+00:00' and pickup_time < '2021-02-01T00:00:00+00:00'",
  selected_fields=("pickup_time", "trip_distance", "tip_amount"),
).to_pandas()

This will read only the relevant files that may contain rows that match the row filter. The number of files that are being fetched from the object store is much lower than loading the entire table! And passing just the columns that are needed reduces memory usage.

Once you have a dataframe, you can use it like any other.

df
#                       pickup_time  trip_distance  tip_amount
# 0       2021-01-22 15:19:02+00:00           1.88        4.14
# 1       2021-01-22 15:19:04+00:00           2.16        2.86
# 2       2021-01-22 15:19:05+00:00           6.70        6.65
# 3       2021-01-22 15:19:06+00:00           1.05        1.86
# 4       2021-01-22 15:19:06+00:00           1.93        1.00
# ...                           ...            ...         ...
# 1369759 2021-01-26 10:34:00+00:00          15.43        0.00
# 1369760 2021-01-26 10:34:00+00:00           0.55        0.00
# 1369761 2021-01-26 10:34:05+00:00           2.29        0.00
# 1369762 2021-01-26 10:34:22+00:00          17.73        0.00
# 1369763 2021-01-26 10:35:03+00:00           2.68        0.00
# [1369764 rows x 3 columns]

Let’s check which day of the week gives us the most tips per mile:

df = df.query('tip_amount > 0')
df['day_of_week'] = df['pickup_time'].dt.day_name()
df['tip_per_mile'] = df['trip_distance'] / df['tip_amount'] 

df.groupby(['day_of_week'])['tip_per_mile'].mean()

# day_of_week
# Friday       1.484211
# Monday       1.822972
# Saturday     2.382211
# Sunday       2.000384
# Thursday     1.660562
# Tuesday      1.650521
# Wednesday    2.060437
# Name: tip_per_mile, dtype: float64

This analysis tells us that Saturday will probably give you the highest tips.