Hive SNAPSHOT

MENU – Apache Iceberg Cookbook

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg

MIGRATING TO ICEBERG

The SNAPSHOT procedure provides the ability to create a temporary Apache Iceberg copy of an Apache Hive table with the same underlying data. The procedure scans the Hive table to construct Iceberg metadata and creates an Iceberg table referencing the existing data files. No data is copied during this operation, which makes it ideal for testing out the migration process and experimenting with Iceberg.

It is important to note that the data is still owned by the Hive table. Deletes and table maintenance procedures such as expire_snapshots should not be run on the Iceberg table and will not have an effect if they are. In addition, if data files are deleted from the original Hive table, reading the Iceberg table might fail because Iceberg will not ignore missing data files. Similarly, changes to the Iceberg table will not affect the Hive table.

This procedure requires connecting to a Hive Metastore. For the Apache Spark configuration and instructions for running a metastore locally, see this chapter’s background section on Connecting to a Hive metastore.

Creating an example table

The snapshot procedure operates on an existing Hive table with Parquet, ORC, or Avro data files. To test it, create a Hive table using Parquet as the storage file format

CREATE DATABASE IF NOT EXISTS cookbook
USE cookbook;

CREATE TABLE hive1 (s string) USING PARQUET
-- Time taken: 0.091 seconds

-- Insert a row to create a data file

INSERT INTO hive1 values ('hive data')
-- Time taken: 0.621 seconds

Running the snapshot procedure

There is now one parquet data file with the value “hive data", which we will use to create a temporary Iceberg table.

CALL system.snapshot('cookbook.hive1', 'cookbook.iceberg1')

Now that the iceberg table is created, you can inspect the two tables to see that they point to the same file:

SELECT input_file_name() FROM hive1
-- file:///tmp/warehouse/hive1/part-00000-5077ce77-fa1d-46ee-ae99-0158c11fe3b6-c000.snappy.parquet

SELECT input_file_name() FROM iceberg1
-- file:/tmp/warehouse/hive1/part-00000-5077ce77-fa1d-46ee-ae99-0158c11fe3b6-c000.snappy.parquet

SELECT * FROM iceberg1
-- hive data

The newly created Iceberg table is now ready for experiments and testing. Remember, changes to the Iceberg table will not be reflected in the original Hive table.

When experimentation is complete, drop the Iceberg table to clean up:

DROP TABLE iceberg1