Connecting to Athena PySpark

GETTINGS STARTED

Amazon Athena is a managed compute service that allows you to use SQL or PySpark to query data in Amazon S3 or other data sources without having to provision and manage any infrastructure. In this recipe, you’ll learn how to use Athena PySpark to query data in Apache Iceberg tables.

Create a workgroup

From the Athena page of the AWS console, create a new workgroup by following these steps:

  1. Under the Administration section, choose Workgroups.
  2. Click Create Workgroup.
  3. Give your workgroup a name and, optionally, a description.
  4. Under Analytics engine, choose Apache Spark.
  5. Open the IAM configuration section and either choose an existing service role or allow the wizard to create one.
  6. Open the Calculation results settings section and either choose an existing S3 bucket or allow the wizard to create one.
  7. Click the Create Workgroup button

This will take you to the detail page for your newly created workgroup. At the top of the page there will be a success banner with a Create Notebook button, which you can click to begin the next phase of the journey.

Amazon Athena Workgroup created message

Create and configure a notebook

Now follow these steps to create and configure your PySpark notebook

  1. Give your notebook a unique name.
  2. For the workgroup, choose the name of the workgroup you just created.
  3. Open the Apache Spark properties section and choose the Edit in the JSON tab.
  4. Enter the following JSON configuration in the space provided, replacing “my-catalog” with the name of your catalog and replacing the credentials as appropriate. For more information on these properties, check out the Spark configuration recipe.
{
    "spark.sql.catalog.sandbox": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.sandbox.type": "rest",
    "spark.sql.catalog.sandbox.credential": "<rest-credential>",
    "spark.sql.catalog.sandbox.uri": "https://api.dev.tabular.io/ws/",
    "spark.sql.catalog.sandbox.warehouse": "sandbox",
    "spark.sql.defaultCatalog": "sandbox",
    "spark.sql.extensions":  "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
}
  1. Click Create to launch your notebook.

Verify catalog connectivity

Once the notebook is up and running, you can use all your favorite Python packages to work with your data, but to quickly verify that everything is configured correctly, you can run a command similar to this to check the count of an Iceberg table in your catalog.

spark.sql("select count(*) from examples.nyc_taxi_yellow").show()

If this returns a result, then you are all set and ready to start exploring your data with PySpark.