An Introduction to the Iceberg Java API – Part 1

Tags: catalog, developer, Java, partition spec, schema, table operations

April 1, 2022

With Iceberg’s integration into a growing number of compute engines, there are many interfaces with which you can use its various powerful features.
This blog post is the first part of a series that covers the underlying Java API available for working with Iceberg tables without an engine.

Whether you’re a developer working on a compute engine, an infrastructure engineer maintaining a production Iceberg warehouse, or a data engineer working with Iceberg tables,
the Iceberg java client provides valuable functionality to enable working with Iceberg tables. The easiest way to try out the java client is to use the interactive notebook Iceberg - An Introduction to the Iceberg Java API.ipynb, which can be found using the docker-compose provided in one of our earlier blog posts: Docker, Spark, and Iceberg: The Fastest Way
to Try Iceberg!. If you already have the tabulario/spark-iceberg image cached locally, make sure you pick up the latest changes by running docker-compose pull.

The Catalog Interface

A catalog in Iceberg is an inventory of Iceberg namespaces and tables. Iceberg comes with many catalog implementations, such as REST, Hive, Glue, and DynamoDB. It’s even possible to plug in your own catalog implementation to inject custom logic specific to your
use-cases.

For this walkthrough, we will use the RestCatalog that comes with Iceberg. Let’s get started!

Loading a Catalog

To load a catalog, you first have to construct a properties map to configure it. The properties required vary depending on the type of catalog you’re using. We’re using a REST catalog where we just have to point to the service.

Two properties commonly required by all catalogs are the warehouse location and the FileIO implementation. We’ll use a Minio container that’s S3 compatible.

Note: To learn more about the file-io abstraction in Iceberg, check out one of our earlier blog posts that provides an excellent overview: Iceberg FileIO: Cloud Native Tables.

Let’s go ahead and generate a map of catalog properties to configure our RestCatalog.

import org.apache.iceberg.catalog.Catalog;
import org.apache.hadoop.conf.Configuration;
import org.apache.iceberg.CatalogProperties;
import org.apache.iceberg.rest.RESTCatalog;
import org.apache.iceberg.aws.AwsProperties;

Map<String, String> properties = new HashMap<>();
properties.put(CatalogProperties.CATALOG_IMPL, "org.apache.iceberg.rest.RESTCatalog");
properties.put(CatalogProperties.URI, "http://rest:8181");
properties.put(CatalogProperties.WAREHOUSE_LOCATION, "s3a://warehouse/wh");
properties.put(CatalogProperties.FILE_IO_IMPL, "org.apache.iceberg.aws.s3.S3FileIO");
properties.put(AwsProperties.S3FILEIO_ENDPOINT, "http://minio:9000");

Next, initialize the catalog, setting a name for it and passing it the properties map containing our configuration.

RESTCatalog catalog = new RESTCatalog();
Configuration conf = new Configuration();
catalog.setConf(conf);
catalog.initialize("demo", properties);

That’s it! We now have a catalog instance that includes operations such as listing, creating, renaming, and dropping tables.

Defining a Schema and a Partition Spec

In the next section, we’ll create a table, but first, we must define the table’s schema. Let’s create a simple schema with four columns–level, event_time, message, and call_stack.

import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;

Schema schema = new Schema(
      Types.NestedField.required(1, "level", Types.StringType.get()),
      Types.NestedField.required(2, "event_time", Types.TimestampType.withZone()),
      Types.NestedField.required(3, "message", Types.StringType.get()),
      Types.NestedField.optional(4, "call_stack", Types.ListType.ofRequired(5, Types.StringType.get()))
    );

Additionally, let’s build a partition spec that defines an hourly partition on the event_time column.

import org.apache.iceberg.PartitionSpec;

PartitionSpec spec = PartitionSpec.builderFor(schema)
      .hour("event_time")
      .build();

Creating a Table

Using our schema and partition spec, we can now create our table. We’re going to create a “webapp” namespace and create our table identifier in that namespace.

import org.apache.iceberg.catalog.Namespace;
import org.apache.iceberg.catalog.TableIdentifier;

Namespace namespace = Namespace.of("webapp");
TableIdentifier name = TableIdentifier.of(namespace, "logs");

Now, let’s create our table!

catalog.createTable(name, schema, spec)

If we call the listTables method on our catalog, we can see our newly created table in the list.

List<TableIdentifier> tables = catalog.listTables(namespace);
System.out.println(tables)

output:

[webapp.logs]

Dropping a Table

As you would expect, the Catalog interface also includes a method for dropping tables. Let’s use the same table identifier object to drop the table we created in the previous section.

catalog.dropTable(name)

What’s Next

If you enjoyed this post, head over to Part 2 of the series which covers the core Java API that is commonly used by query engines to perform table scans and can also be used for developing applications that need to interact with Iceberg’s core internals. Also, if you’d like to be a part of the growing Iceberg community or just want to stop in and say hello, check out our community page to learn where to find us!

An Introduction to the Iceberg Java API – Part 1

The Catalog Interface

Loading a Catalog

Defining a Schema and a Partition Spec

Creating a Table

Dropping a Table

What’s Next

Related Posts

An Introduction to the Iceberg Java API Part 3 – Appending Data Files

An Introduction to the Iceberg Java API Part 2 – Table Scans

Tabular publishes Apache Iceberg Cookbook with 34 initial recipes