An Introduction to the Iceberg Java API - Part 1

blog-image

With Iceberg’s integration into a growing number of compute engines, there are many interfaces with which you can use its various powerful features. This blog post is the first part of a series that covers the underlying Java API available for working with Iceberg tables without an engine.

Whether you’re a developer working on a compute engine, an infrastructure engineer maintaining a production Iceberg warehouse, or a data engineer working with Iceberg tables, the Iceberg java client provides valuable functionality to enable working with Iceberg tables. The easiest way to try out the java client is to use the interactive notebook Iceberg - An Introduction to the Iceberg Java API.ipynb, which can be found using the docker-compose provided in one of our earlier blog posts: Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg! . If you already have the tabulario/spark-iceberg image cached locally, make sure you pick up the latest changes by running docker-compose pull.

The Catalog Interface

A catalog in Iceberg is an inventory of Iceberg namespaces and tables. Iceberg comes with many catalog implementations, such as Hive, Glue, and DynamoDB. It will even include a generic REST-based catalog in an upcoming release. It’s even possible to plug in your own catalog implementation to inject custom logic specific to your use-cases.

For this walkthrough, we will use the JdbcCatalog that comes with Iceberg. Let’s get started!

Loading a Catalog

To load a catalog, you first have to construct a properties map to configure it. The properties required vary depending on the type of catalog you’re using. We’re using a JdbcCatalog backed by a Postgres database, so our properties map needs to include the Postgres connection information.

Two properties commonly required by all catalogs are the warehouse location and the file-io implementation. We’ll use a local directory as our warehouse location and HadoopFileIO as our catalog’s file-io implementation.

Note: To learn more about the file-io abstraction in Iceberg, check out one of our earlier blog posts that provides an excellent overview: Iceberg FileIO: Cloud Native Tables .

Let’s go ahead and generate a map of catalog properties to configure our JdbcCatalog.

import org.apache.iceberg.CatalogProperties;
import org.apache.iceberg.jdbc.JdbcCatalog;
import org.apache.iceberg.hadoop.HadoopFileIO;

Map<String, String> properties = new HashMap<>();
properties.put(CatalogProperties.CATALOG_IMPL, JdbcCatalog.class.getName());
properties.put(CatalogProperties.URI, "jdbc:postgresql://postgres:5432/demo_catalog");
properties.put(JdbcCatalog.PROPERTY_PREFIX + "user", "admin");
properties.put(JdbcCatalog.PROPERTY_PREFIX + "password", "password");
properties.put(CatalogProperties.WAREHOUSE_LOCATION, "/home/iceberg/warehouse");
properties.put(CatalogProperties.FILE_IO_IMPL, HadoopFileIO.class.getName());

Next, let’s initialize the catalog, setting a name for it and passing it the properties map containing our configuration.

JdbcCatalog catalog = new JdbcCatalog();
catalog.initialize("demo", properties);

That’s it! We now have a catalog instance that includes operations such as listing, creating, renaming, and dropping tables.

Defining a Schema and a Partition Spec

In the next section, we’ll create a table, but first, we must define the table’s schema. Let’s create a simple schema with four columns–level, event_time, message, and call_stack.

import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;

Schema schema = new Schema(
      Types.NestedField.required(1, "level", Types.StringType.get()),
      Types.NestedField.required(2, "event_time", Types.TimestampType.withZone()),
      Types.NestedField.required(3, "message", Types.StringType.get()),
      Types.NestedField.optional(4, "call_stack", Types.ListType.ofRequired(5, Types.StringType.get()))
    );

Additionally, let’s build a partition spec that defines an hourly partition on the event_time column.

import org.apache.iceberg.PartitionSpec;

PartitionSpec spec = PartitionSpec.builderFor(schema)
      .hour("event_time")
      .build();

Creating a Table

Using our schema and partition spec, we can now create our table. We’re going to create a “webapp” namespace and create our table identifier in that namespace.

import org.apache.iceberg.catalog.Namespace;
import org.apache.iceberg.catalog.TableIdentifier;

Namespace namespace = Namespace.of("webapp");
TableIdentifier name = TableIdentifier.of(namespace, "logs");

Now, let’s create our table!

catalog.createTable(name, schema, spec)

If we call the listTables method on our catalog, we can see our newly created table in the list.

List<TableIdentifier> tables = catalog.listTables(namespace);
System.out.println(tables)

output:

[webapp.logs]

Dropping a Table

As you would expect, the Catalog interface also includes a method for dropping tables. Let’s use the same table identifier object to drop the table we created in the previous section.

catalog.dropTable(name)

What’s Next

If you enjoyed this post, head over to Part 2 of the series which covers the core Java API that is commonly used by query engines to perform table scans and can also be used for developing applications that need to interact with Iceberg’s core internals. Also, if you’d like to be a part of the growing Iceberg community or just want to stop in and say hello, check out our community page to learn where to find us!