Snowflake-managed Iceberg table’s performance is at par with Snowflake native tables while storing the data in public cloud storage. They are Ideal for situations where the data is already stored in data lakes and do not intend to load into Snowflake but need to use the features and performance of Snowflake.
In this blog, we will review the steps to create Snowflake-managed Iceberg tables with AWS S3 as external storage and read them from a Spark or Databricks environment.
To learn more about Iceberg tables in Snowflake, read our article:
What are Iceberg Tables in Snowflake?
Iceberg tables in Snowflake Data Cloud are a new type of table where the actual data is stored outside Snowflake in a public cloud object storage location (Amazon S3, Google Cloud Storage, or Azure Storage) in Apache Iceberg table format which can be accessed by Snowflake using objects called external volume and catalog integration.
Snowflake accesses this data using new objects called external volume and catalog integration. Snowflake uses its native query semantics with Iceberg specifications and libraries to read data from and write data into the cloud object storage.
Snowflake supports two types of iceberg tables depending on where the catalog is managed.
Snowflake Managed Iceberg Tables – Snowflake manages the metadata and catalog for these tables. These tables can support all Snowflake features with read and write access.
Externally Managed Iceberg Tables – An external system, such as AWS Glue, manages the metadata and catalog. These tables support read-only access from Snowflake.
How to Create Iceberg Tables in Snowflake From Spark/Databricks Integration
The below steps are followed to create Iceberg tables in Snowflake:
Create Users and Roles
Create external volume in Snowflake
Create Iceberg tables in Snowflake
Access Iceberg tables from Spark/Databricks
Step 1: Create User, Role and Namespace
Create the users, roles, and namespace required to create Iceberg tables.
-- Create warehouse, database, and schema used to create iceberg tables
USE ROLE SYSADMIN;
CREATE WAREHOUSE iceberg_wh;
CREATE DATABASE iceberg_db;
CREATE SCHEMA iceberg;
USE ROLE USERADMIN;
CREATE ROLE iceberg_role;
USE ROLE SECURITYADMIN;
GRANT ALL ON DATABASE iceberg_db TO ROLE iceberg_role;
GRANT ALL ON SCHEMA iceberg_db.iceberg TO ROLE iceberg_role;
GRANT ALL ON WAREHOUSE iceberg_wh TO ROLE iceberg_role;
–-Create user iceberg_user with iceberg_role
USE ROLE USERADMIN;
CREATE OR REPLACE USER iceberg_user
PASSWORD='********',
LOGIN_NAME='iceberg_user',
MUST_CHANGE_PASSWORD=FALSE,
DISABLED=FALSE,
DEFAULT_WAREHOUSE='ICEBERG_WH',
DEFAULT_NAMESPACE='ICEBERG_DB.ICEBERG',
DEFAULT_ROLE='ICEBERG_ROLE';
Step 2: Create External Volume in Snowflake
As a pre-requisite, create an AWS IAM role (snowflake-role-for-iceberg) with the below permissions for the S3 bucket where data will be stored.
s3:DeleteObject
s3:DeleteObjectVersion
s3:GetBucketLocation
s3:GetObject
s3:GetObjectVersion
s3:ListBucket
s3:PutObject
Create the external volume in Snowflake using SQL.
-- create external volume using ACCOUNTADMIN role
USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE EXTERNAL VOLUME aws_s3_iceberg_vol
STORAGE_LOCATIONS =
(
(
NAME = 'aws-s3-us-east-1'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = 's3://iceberg-tables/'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::01234534343:role/snwoflake-role-for-iceberg'
ENCRYPTION=(TYPE='AWS_SSE_S3')
)
);
Update the trust relationship of the IAM role created in step A with the correct external ID and principal.
Execute the following command in Snowflake to get the
STORAGE_AWS_EXTERNAL_ID
andSTORAGE_AWS_IAM_USER_ARN
of the external volume created in step 2. b.
DESC EXTERNAL VOLUME aws_s3_iceberg_vol;
In the AWS console, update the trust relationship of the IAM role with policy as shown below after replacing
Snowflake_user_arn
withSTORAGE_AWS_IAM_USER_ARN
recorded in step 2. c. i andsnowflake_external_id
withSTORAGE_AWS_EXTERNAL_ID
recorded in step 2. c.i.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"AWS": ""
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": ""
}
}
}
]
}
Grant usage on external volume to the role used to create Iceberg tables.
USE ROLE SECURITYADMIN;
GRANT USAGE ON EXTERNAL VOLUME aws_s3_iceberg_vol TO ROLE iceberg_role;
Step 3: Create Iceberg Tables in Snowflake
Create an Iceberg table with
IB_CUSTOMERS
in Snowflake and insert data from Snowflake Sample datasets.
--Create Snowflake Iceberg Table IB_CUSTOMERS
USE ROLE iceberg_role;
USE DATABASE iceberg_db;
USE SCHEMA iceberg;
CREATE OR REPLACE ICEBERG TABLE ib_customers (
c_custkey INTEGER,
c_name STRING,
c_address STRING,
c_nationkey INTEGER,
c_phone STRING,
c_acctbal INTEGER,
c_mktsegment STRING,
c_comment STRING
)
CATALOG='SNOWFLAKE'
EXTERNAL_VOLUME='aws_s3_iceberg_vol'
BASE_LOCATION='ib_customers';
--Insert 10K records from Snowflake Sample data.
INSERT INTO ib_customers
SELECT * FROM snowflake_sample_data.tpch_sf1.customer WHERE c_custkey BETWEEN 60001 AND 70000;
Validate the metadata and data files created in the S3 bucket.
Step 4a: Access Snowflake Iceberg Tables From PySpark Console
Create environment variables with the required parameters.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ['SNOWFLAKE_CATALOG_URI'] = "jdbc:snowflake://eressas-vkb12345.snowflakecomputing.com"
os.environ['SNOWFLAKE_ROLE'] = "ICEBERG_ROLE"
os.environ['SNOWFLAKE_USERNAME'] = "ICEBERG_USER"
os.environ['SNOWFLAKE_PASSWORD'] = "*********"
os.environ['PACKAGES'] = "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,net.snowflake:snowflake-jdbc:3.14.2,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160"
os.environ['AWS_REGION'] = "us-east-1"
os.environ['AWS_ACCESS_KEY_ID'] = "**********"
os.environ['AWS_SECRET_ACCESS_KEY'] = "**********"
Create a spark session with the required packages.
spark = SparkSession.builder.appName('iceberg_lab')\
.config('spark.jars.packages', os.environ['PACKAGES'])\
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')\
.getOrCreate()
Set spark configurations.
spark.conf.set("spark.sql.defaultCatalog", "snowflake_catalog")
spark.conf.set("spark.sql.catalog.snowflake_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.snowflake_catalog.catalog-impl", "org.apache.iceberg.snowflake.SnowflakeCatalog")
spark.conf.set("spark.sql.catalog.snowflake_catalog.uri", os.environ['SNOWFLAKE_CATALOG_URI'])
spark.conf.set("spark.sql.catalog.snowflake_catalog.jdbc.role", "ICEBERG_ROLE")
spark.conf.set("spark.sql.catalog.snowflake_catalog.jdbc.user", "ICEBERG_USER")
spark.conf.set("spark.sql.catalog.snowflake_catalog.jdbc.password", os.environ['SNOWFLAKE_PASSWORD'])
spark.conf.set("spark.sql.iceberg.vectorization.enabled", "false")
spark.conf.set("spark.sql.catalog.snowflake_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
spark.conf.set("spark.hadoop.fs.s3a.access.key", os.environ['AWS_ACCESS_KEY_ID'])
spark.conf.set("spark.hadoop.fs.s3a.secret.key", os.environ['AWS_SECRET_ACCESS_KEY'])
spark.conf.set("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com")
spark.conf.set("spark.hadoop.fs.s3a.endpoint.region", os.environ['AWS_REGION'])
Access Iceberg tables using spark and spark sql.
# read with spark.sql
df = spark.sql("select * from iceberg_db.iceberg.customer")
df.show()
# read with spark.table
df = spark.table("iceberg_db.iceberg.ib_customers")
df.show()
Step 4b: Access Iceberg Tables From Databricks
To access Snowflake Iceberg tables from Databricks, you need to add the same packages and configurations listed in step 4a using Databricks UI.
Edit the compute cluster and add spark configurations and environment variables listed below.
spark.sql.catalog.snowflake_catalog.uri jdbc:snowflake://xxxxxx -xxxxxxx.snowflakecomputing.com
spark.sql.catalog.snowflake_catalog org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.snowflake_catalog.jdbc.password *******
spark.databricks.cluster.profile singleNode
spark.sql.catalog.snowflake_catalog.jdbc.user ICEBERG_USER
spark.sql.iceberg.vectorization.enabled false
spark.sql.catalog.snowflake_catalog.catalog-impl org.apache.iceberg.snowflake.SnowflakeCatalog
spark.sql.catalog.snowflake_catalog.io-impl org.apache.iceberg.aws.s3.S3FileIO
Update the catalog and run SQL queries in the Databricks notebook.
%sql
USE CATALOG snowflake_catalog;
SHOW NAMESPACES;
%sql
select * from iceberg_db.iceberg.ib_customers;
df_cust = spark.table("iceberg_db.iceberg.ib_customers")
df_cust.show()
Closing
The steps to create Snowflake-managed Iceberg tables are quite similar to those for creating an external table, with the external volume being the new object as a prerequisite. In Snowflake, we follow the Snowflake syntax and use the Spark or Spark SQL syntax in the Spark/Databricks environment to create and access the Iceberg tables.
At this point, we cannot modify the data of Snowflake-managed Iceberg tables from Spark, and it doesn’t support time-travel queries. However, as a workaround, we can read data from a specific table snapshot using the Spark command below.
spark.read.format("iceberg").load("s3a://bucket_name/ib_customers/metadata/v1704352930730000000.metadata.json").show()
If your organization is interested in creating iceberg tables in Snowflake, the Snowflake experts at phData can help! Contact us today for advice, best practices, actionable strategies, and more!
FAQs
While external tables are read-only tables created to access the data files stored outside Snowflake, Iceberg tables are read/write-capable tables that store data in public cloud storage using an open table format. Iceberg tables support a lot of ]features like time travel, schema evolution, hidden partition, multiple query engine support, etc.