layers of ice in front of sunset

In this blog, we’ll show you how to take the first step in Tabular, creating a table and writing data to it. You can sign up for free if you don’t already have a tabular account. You’ll also need Spark, but if you don’t have it installed, there are instructions in the connect workflow.

Getting started

If you haven’t already, download the MovieLens data and unzip it into a directory. In this example, I put it in my Data directory in ml-25m.

shawngordon@Shawns-Air ml-25m % pwd
/Users/shawngordon/Downloads/Data/ml-25m
shawngordon@Shawns-Air ml-25m % ls -la
total 2283344
drwxrwxr-x@ 9 shawngordon  staff        288 Nov 21  2019 .
drwxr-xr-x@ 3 shawngordon  staff         96 Mar 14 06:24 ..
-rw-rw-r--@ 1 shawngordon  staff      10460 Nov 21  2019 README.txt
-rw-rw-r--@ 1 shawngordon  staff  435164157 Nov 21  2019 genome-scores.csv
-rw-rw-r--@ 1 shawngordon  staff      18103 Nov 21  2019 genome-tags.csv
-rw-rw-r--@ 1 shawngordon  staff    1368578 Nov 21  2019 links.csv
-rw-rw-r--@ 1 shawngordon  staff    3038099 Nov 21  2019 movies.csv
-rw-rw-r--@ 1 shawngordon  staff  678260987 Nov 21  2019 ratings.csv
-rw-rw-r--@ 1 shawngordon  staff   38810332 Nov 21  2019 tags.csv
shawngordon@Shawns-Air ml-25m %

Click on the “Connect Apache Spark…” menu item to get set up in a Spark environment. We will then work with the included default database as seen below:

In a terminal window, we’re going to launch Spark and paste the command we just copied.

bin/spark-sql  \
  --repositories https://tabular-repository-public.s3.amazonaws.com/releases/  \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0,io.tabular:tabular-client-runtime:1.2.6 \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.defaultCatalog=personal \
  --conf spark.sql.catalog.personal=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.personal.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
  --conf spark.sql.catalog.personal.uri=https://api.tabular.io/ws/ \
  --conf spark.sql.catalog.personal.credential=<tabular credential> \
  --conf spark.sql.catalog.personal.warehouse=personal

Working with the data

We will start by creating a temporary view on top of the movies.csv file so that we can load it into our Iceberg table. Modify path to match the location of your data files.

CREATE TEMPORARY VIEW movies (
   movieId bigint,
   title string,
   genres string)
USING csv
OPTIONS (
   path '/Users/shawngordon/Downloads/Data/ml-25m/movies.csv',
   header true
);

Create and load the Iceberg table by selecting all of the data from the movies view

CREATE TABLE default.movies AS SELECT * from movies;

Oops, we wanted to follow snake case for our column names to accommodate engines like Hive, but we used the column header from the CSV file, namely movieId, so we want to change it. Iceberg makes schema evolution easy; we just issue the command:

ALTER TABLE default.movies RENAME COLUMN movieId to movie_id;

Back in Tabular, we can see the table has been created and includes our updated column names:

Let’s do a quick SELECT statement here to see the results:

spark-sql> select * from movies limit 10;
1	Toy Story (1995)	Adventure|Animation|Children|Comedy|Fantasy
2	Jumanji (1995)	Adventure|Children|Fantasy
3	Grumpier Old Men (1995)	Comedy|Romance
4	Waiting to Exhale (1995)	Comedy|Drama|Romance
5	Father of the Bride Part II (1995)	Comedy
6	Heat (1995)	Action|Crime|Thriller
7	Sabrina (1995)	Comedy|Romance
8	Tom and Huck (1995)	Adventure|Children
9	Sudden Death (1995)	Action
10	GoldenEye (1995)	Action|Adventure|Thriller
Time taken: 0.596 seconds, Fetched 10 row(s)

Now we will load the ratings.csv file. This one has a bit of a trick as we want to cast the timestamp field into a human-readable date in our table. It is stored in the CSV as a string in epoch format:

CREATE TEMPORARY VIEW ratings (
   userId bigint,
   movieId bigint,
   rating float,
   timestamp bigint)
USING csv         
OPTIONS (
   path '/Users/shawngordon/Downloads/Data/ml-25m/ratings.csv',
   header true
);
CREATE TABLE default.ratings AS 
SELECT 
    userId as user_id, 
    movieId as movie_id, 
    rating, 
    from_unixtime(ratings.timestamp) as rated_at 
FROM ratings
PARTITIONED BY rated_at;

For our final query, we’ll join the movies and ratings tables to look at the top 10 movies that had over 100 ratings:

SELECT
    m.movie_id,
    title,
    avg(rating) as avg_rating,
    count(1) as rating_count
FROM movies m
JOIN ratings r ON
    m.movie_id = r.movie_id
GROUP BY m.movie_id, title
HAVING rating_count > 100
ORDER BY avg_rating DESC
LIMIT 10;
171011	Planet Earth II (2016)	4.483096085409253	1124
159817	Planet Earth (2006)	4.464796794504865	1747
318	Shawshank Redemption, The (1994)	4.413576004516335	81482
170705	Band of Brothers (2001)	4.398598820058997	1356
171495	Cosmos	4.3267148014440435	277
858	Godfather, The (1972)	4.324336165187245	52498
179135	Blue Planet II (2017)	4.289833080424886	659
50	Usual Suspects, The (1995)	4.284353213163313	55366
198185	Twin Peaks (1989)	4.267361111111111	288
1221	Godfather: Part II, The (1974)	4.2617585117585115	34188
Time taken: 18.718 seconds, Fetched 10 row(s)

Securing the data

We’ve created some tables, loaded data and done some table manipulation, but what about data security? As any Spark user knows, data security is not really a thing with Spark, but Tabular provides you with granular control. By navigating to the Access controls tab, we can look at the privileges that Tabular grants to the table by default:

If we want to limit actions for this role only to allow read access, then we just drag the slider all the way to the left and apply the change:

Now let’s try:

INSERT INTO default.movies VALUES (0, 'Shawn\'s Home Videos', 'Horror');

And look what happens, we get an error from Spark that Tabular has blocked the operation:

Summary

In just a few steps, you were able to create and load tables in Tabular using Spark and then perform various queries on the data. We also looked at how you can lock down your data in ways that have previously not been available. Now that you have an environment loaded try out different roles, permissions, and credentials to see what is possible. This brief introduction illustrated some Tabular concepts and how quickly you can get started. Our documentation includes many other examples and tips, as does our YouTube channel.