MIGRATING TO ICEBERG
The SNAPSHOT
procedure provides the ability to create a temporary Apache Iceberg copy of an Apache Hive table with the same underlying data. The procedure scans the Hive table to construct Iceberg metadata and creates an Iceberg table referencing the existing data files. No data is copied during this operation, which makes it ideal for testing out the migration process and experimenting with Iceberg.
It is important to note that the data is still owned by the Hive table. Deletes and table maintenance procedures such as expire_snapshots
should not be run on the Iceberg table and will not have an effect if they are. In addition, if data files are deleted from the original Hive table, reading the Iceberg table might fail because Iceberg will not ignore missing data files. Similarly, changes to the Iceberg table will not affect the Hive table.
This procedure requires connecting to a Hive Metastore. For the Apache Spark configuration and instructions for running a metastore locally, see this chapter’s background section on Connecting to a Hive metastore.
Creating an example table
The snapshot procedure operates on an existing Hive table with Parquet, ORC, or Avro data files. To test it, create a Hive table using Parquet as the storage file format
CREATE DATABASE IF NOT EXISTS cookbook
USE cookbook;
CREATE TABLE hive1 (s string) USING PARQUET
-- Time taken: 0.091 seconds
-- Insert a row to create a data file
INSERT INTO hive1 values ('hive data')
-- Time taken: 0.621 seconds
Running the snapshot procedure
There is now one parquet data file with the value “hive data"
, which we will use to create a temporary Iceberg table.
CALL system.snapshot('cookbook.hive1', 'cookbook.iceberg1')
Now that the iceberg table is created, you can inspect the two tables to see that they point to the same file:
SELECT input_file_name() FROM hive1
-- file:///tmp/warehouse/hive1/part-00000-5077ce77-fa1d-46ee-ae99-0158c11fe3b6-c000.snappy.parquet
SELECT input_file_name() FROM iceberg1
-- file:/tmp/warehouse/hive1/part-00000-5077ce77-fa1d-46ee-ae99-0158c11fe3b6-c000.snappy.parquet
SELECT * FROM iceberg1
-- hive data
The newly created Iceberg table is now ready for experiments and testing. Remember, changes to the Iceberg table will not be reflected in the original Hive table.
When experimentation is complete, drop the Iceberg table to clean up:
DROP TABLE iceberg1