March 22, 2024

How to Create Iceberg Tables in Snowflake

By Rangaswamy M R

Snowflake-managed Iceberg table’s performance is at par with Snowflake native tables while storing the data in public cloud storage. They are Ideal for situations where the data is already stored in data lakes and do not intend to load into Snowflake but need to use the features and performance of Snowflake.

In this blog, we will review the steps to create Snowflake-managed Iceberg tables with AWS S3 as external storage and read them from a Spark or Databricks environment.

To learn more about Iceberg tables in Snowflake, read our article:

What are Iceberg Tables in Snowflake and when to use them?

What are Iceberg Tables in Snowflake?

Iceberg tables in Snowflake Data Cloud are a new type of table where the actual data is stored outside Snowflake in a public cloud object storage location (Amazon S3, Google Cloud Storage, or Azure Storage) in Apache Iceberg table format which can be accessed by Snowflake using objects called external volume and catalog integration.

Snowflake accesses this data using new objects called external volume and catalog integration. Snowflake uses its native query semantics with Iceberg specifications and libraries to read data from and write data into the cloud object storage. 

Snowflake supports two types of iceberg tables depending on where the catalog is managed.

  1. Snowflake Managed Iceberg Tables  – Snowflake manages the metadata and catalog for these tables. These tables can support all Snowflake features with read and write access.

  2. Externally Managed Iceberg Tables – An external system, such as AWS Glue, manages the metadata and catalog. These tables support read-only access from Snowflake.

How to Create Iceberg Tables in Snowflake From Spark/Databricks Integration

The below steps are followed to create Iceberg tables in Snowflake:

  1. Create Users and Roles

  2. Create external volume in Snowflake

  3. Create Iceberg tables in Snowflake

  4. Access Iceberg tables from Spark/Databricks

Step 1: Create User, Role and Namespace

Create the users, roles, and namespace required to create Iceberg tables.

				
					-- Create warehouse, database, and schema used to create iceberg tables
USE ROLE SYSADMIN;
CREATE WAREHOUSE iceberg_wh;
CREATE DATABASE iceberg_db;
CREATE SCHEMA iceberg;

USE ROLE USERADMIN;
CREATE ROLE iceberg_role;

USE ROLE SECURITYADMIN;
GRANT ALL ON DATABASE iceberg_db TO ROLE iceberg_role;
GRANT ALL ON SCHEMA iceberg_db.iceberg  TO ROLE iceberg_role;
GRANT ALL ON WAREHOUSE iceberg_wh TO ROLE iceberg_role;

–-Create user iceberg_user with iceberg_role 
USE ROLE USERADMIN;
CREATE OR REPLACE USER iceberg_user
    PASSWORD='********',
    LOGIN_NAME='iceberg_user',
    MUST_CHANGE_PASSWORD=FALSE,
    DISABLED=FALSE,
    DEFAULT_WAREHOUSE='ICEBERG_WH',
    DEFAULT_NAMESPACE='ICEBERG_DB.ICEBERG',
    DEFAULT_ROLE='ICEBERG_ROLE'; 
				
			

Step 2: Create External Volume in Snowflake

  1. As a pre-requisite, create an AWS IAM role (snowflake-role-for-iceberg) with the below permissions for the S3 bucket where data will be stored. 

    1. s3:DeleteObject

    2. s3:DeleteObjectVersion

    3. s3:GetBucketLocation

    4. s3:GetObject

    5. s3:GetObjectVersion

    6. s3:ListBucket

    7. s3:PutObject

  1. Create the external volume in Snowflake using SQL. 

				
					 -- create external volume using ACCOUNTADMIN role
USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE EXTERNAL VOLUME aws_s3_iceberg_vol
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'aws-s3-us-east-1'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://iceberg-tables/'
            STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::01234534343:role/snwoflake-role-for-iceberg'
            ENCRYPTION=(TYPE='AWS_SSE_S3')
         )
      );


				
			
  1. Update the trust relationship of the IAM role created in step A with the correct external ID and principal.

    1. Execute the following command in Snowflake to get the STORAGE_AWS_EXTERNAL_ID and STORAGE_AWS_IAM_USER_ARN of the external volume created in step 2. b.

				
					DESC EXTERNAL VOLUME aws_s3_iceberg_vol;
				
			
  1. In the AWS console, update the trust relationship of the IAM role with policy as shown below after replacing  Snowflake_user_arn with STORAGE_AWS_IAM_USER_ARN recorded in step 2. c. i and snowflake_external_id  with STORAGE_AWS_EXTERNAL_ID  recorded in step 2. c.i.

				
					{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<snowflake_user_arn>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "<snowflake_external_id>"
        }
      }
    }
  ]
  }

				
			
  1. Grant usage on external volume to the role used to create Iceberg tables.

				
					USE ROLE SECURITYADMIN;
GRANT USAGE ON EXTERNAL VOLUME aws_s3_iceberg_vol TO ROLE iceberg_role;
				
			

Step 3: Create Iceberg Tables in Snowflake

  1. Create an Iceberg table with IB_CUSTOMERS in Snowflake and insert data from Snowflake Sample datasets.

				
					--Create Snowflake Iceberg Table IB_CUSTOMERS
USE ROLE iceberg_role;
USE DATABASE iceberg_db;
USE SCHEMA iceberg;

CREATE OR REPLACE ICEBERG TABLE ib_customers (
    c_custkey INTEGER,
    c_name STRING,
    c_address STRING,
    c_nationkey INTEGER,
    c_phone STRING,
    c_acctbal INTEGER,
    c_mktsegment STRING,
    c_comment STRING
)  
    CATALOG='SNOWFLAKE'
    EXTERNAL_VOLUME='aws_s3_iceberg_vol'
    BASE_LOCATION='ib_customers';

--Insert 10K records from Snowflake Sample data.
INSERT INTO ib_customers
  SELECT * FROM snowflake_sample_data.tpch_sf1.customer WHERE  c_custkey BETWEEN 60001 AND 70000;


				
			
  1. Validate the metadata and data files created in the S3 bucket.

Step 4a: Access Snowflake Iceberg Tables From PySpark Console

  1. Create environment variables with the required parameters.

				
					import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ['SNOWFLAKE_CATALOG_URI'] = "jdbc:snowflake://eressas-vkb12345.snowflakecomputing.com"
os.environ['SNOWFLAKE_ROLE'] = "ICEBERG_ROLE"
os.environ['SNOWFLAKE_USERNAME'] = "ICEBERG_USER"
os.environ['SNOWFLAKE_PASSWORD'] = "*********"
os.environ['PACKAGES'] = "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,net.snowflake:snowflake-jdbc:3.14.2,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160"
os.environ['AWS_REGION'] = "us-east-1"
os.environ['AWS_ACCESS_KEY_ID'] = "**********"
os.environ['AWS_SECRET_ACCESS_KEY'] = "**********"

				
			
  1. Create a spark session with the required packages.

				
					spark = SparkSession.builder.appName('iceberg_lab')\
    .config('spark.jars.packages', os.environ['PACKAGES'])\
    .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')\
    .getOrCreate()
				
			
  1. Set spark configurations.

				
					spark.conf.set("spark.sql.defaultCatalog", "snowflake_catalog")
spark.conf.set("spark.sql.catalog.snowflake_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.snowflake_catalog.catalog-impl", "org.apache.iceberg.snowflake.SnowflakeCatalog")
spark.conf.set("spark.sql.catalog.snowflake_catalog.uri", os.environ['SNOWFLAKE_CATALOG_URI'])
spark.conf.set("spark.sql.catalog.snowflake_catalog.jdbc.role", "ICEBERG_ROLE")
spark.conf.set("spark.sql.catalog.snowflake_catalog.jdbc.user", "ICEBERG_USER")
spark.conf.set("spark.sql.catalog.snowflake_catalog.jdbc.password", os.environ['SNOWFLAKE_PASSWORD'])
spark.conf.set("spark.sql.iceberg.vectorization.enabled", "false")
spark.conf.set("spark.sql.catalog.snowflake_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
spark.conf.set("spark.hadoop.fs.s3a.access.key", os.environ['AWS_ACCESS_KEY_ID'])
spark.conf.set("spark.hadoop.fs.s3a.secret.key", os.environ['AWS_SECRET_ACCESS_KEY'])
spark.conf.set("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com")
spark.conf.set("spark.hadoop.fs.s3a.endpoint.region", os.environ['AWS_REGION'])
				
			
  1. Access Iceberg tables using spark and spark sql.

				
					# read with spark.sql
df = spark.sql("select * from iceberg_db.iceberg.customer")
df.show()
				
			
				
					# read with spark.table
df = spark.table("iceberg_db.iceberg.ib_customers")
df.show()
				
			

Step 4b: Access Iceberg Tables From Databricks

To access Snowflake Iceberg tables from Databricks, you need to add the same packages and configurations listed in step 4a using Databricks UI.

  1. Edit the compute cluster and add spark configurations and environment variables listed below.

				
					spark.sql.catalog.snowflake_catalog.uri jdbc:snowflake://xxxxxx -xxxxxxx.snowflakecomputing.com
spark.sql.catalog.snowflake_catalog org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.snowflake_catalog.jdbc.password *******
spark.databricks.cluster.profile singleNode
spark.sql.catalog.snowflake_catalog.jdbc.user ICEBERG_USER
spark.sql.iceberg.vectorization.enabled false
spark.sql.catalog.snowflake_catalog.catalog-impl org.apache.iceberg.snowflake.SnowflakeCatalog
spark.sql.catalog.snowflake_catalog.io-impl org.apache.iceberg.aws.s3.S3FileIO
				
			
  1. Update the catalog and run SQL queries in the Databricks notebook.

				
					%sql
USE CATALOG snowflake_catalog;
SHOW NAMESPACES;
				
			
				
					%sql
select * from iceberg_db.iceberg.ib_customers;
				
			
				
					df_cust = spark.table("iceberg_db.iceberg.ib_customers")
df_cust.show()
				
			

Closing

The steps to create Snowflake-managed Iceberg tables are quite similar to those for creating an external table, with the external volume being the new object as a prerequisite. In Snowflake, we follow the Snowflake syntax and use the Spark or Spark SQL syntax in the Spark/Databricks environment to create and access the Iceberg tables. 

At this point, we cannot modify the data of Snowflake-managed Iceberg tables from Spark, and it doesn’t support time-travel queries. However, as a workaround, we can read data from a specific table snapshot using the Spark command below.

				
					spark.read.format("iceberg").load("s3a://bucket_name/ib_customers/metadata/v1704352930730000000.metadata.json").show()
				
			

If your organization is interested in creating iceberg tables in Snowflake, the Snowflake experts at phData can help! Contact us today for advice, best practices, actionable strategies, and more!

FAQs

While external tables are read-only tables created to access the data files stored outside Snowflake, Iceberg tables are read/write-capable tables that store data in public cloud storage using an open table format. Iceberg tables support a lot of ]features like time travel, schema evolution, hidden partition, multiple query engine support, etc.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit