As the concept of lakehouse evolves in the industry, file format wars are also heating up. Various file formats, such as Delta Lake, Hudu, and Iceberg, exist in the industry. As organizations try to adopt these formats, sometimes interoperability becomes a problem. Sometimes, even multiple organizational business units experience and adopt different file formats. In this case, there will always be a case for a business unit to consume different file formats. Organizations would not always want to ingest(or convert) these file formats into their preferred format and consume them but do so without compromising performance.
Recently, Databricks and Snowflake pushed Delta Lake and Iceberg as the de facto formats, respectively. However, most organizations do not use Databricks or Snowflake for their computing needs. Earlier, Snowflake offered ways to read Delta Tables as external tables, but this feature did not become GA. Snowflake does have a feature to load Parquet files directly using the COPY
command or read them directly as external tables.
What is the Difference Between External Table, Iceberg & Delta Lake?
In Snowflake, external tables are pointers to data files (parquet, CSV, JSON, etc.) in object storage(S3, Azure blob) and are read directly when queried.
Delta Lake is a file format built and standardized by Databricks. It uses Parquet as the underlying data file and delta logs(JSON format) to identify the latest and active records.
Iceberg is another open-source file format built by Apache and designed to work with Parquet, Avro, or Orc. It uses manifest files and manifest list files to track data changes. Snowflake only supports Iceberg with Parquet files as data files.
Recently, Snowflake announced two different features to handle Delta Lake and Parquet files.
Delta Direct. It enables organizations to consume Delta Lake tables as Iceberg tables, offering near similar performance as regular Iceberg tables.
Ingesting Parquet Files directly into Iceberg tables without rewriting them as long as those Parquet files are in Iceberg-compatible format.
Both features differ in how they are loaded or ingested.
When do you use Delta Direct vs Parquet Ingest?
Delta Direct—If your existing process, such as Databricks or EMR, writes Delta Lake files to object storage and you want to read them as an Iceberg table without modification, use Delta Direct. In this case, the table is unmanaged, as an external process controls the catalog. This table will be read-only.
Parquet Ingest – Use this method to ingest or load Parquet files into Snowflake and manage them as an Iceberg table. It works similarly to loading CSV or Parquet files into Snowflake with the
COPY
command. Still, instead of parsing the file contents, Snowflake simply copies the files to the Iceberg table’s storage location (external volume). This table will be read-write. Do not use this method if the underlying format is Delta Lake.
Both options require different types of objects within Snowflake.
Delta Direct
External Volume for Iceberg tables. This volume is a source for the Iceberg table where Delta Lake resides. There is no “target” as this table is read-only.
Catalog Integration object
Iceberg table creation
Parquet Ingest
External volume for Iceberg table. In this case, the volume acts as a target.
Storage integration for the external stage.
An external stage for Parquet files. This is the source location from which Parquet files will be read.
No catalog integration is required as Snowflake maintains the catalog.
Let’s figure out how to configure the required objects for Delta Direct and Parquet Ingest.
Delta Direct Configuration:
Following are the configurations required for Delta direct setup.
External Volume: Configuring external steps varies by cloud service provider. This process establishes the authentication and permissions needed to access external object storage. For example – The external volume configuration for the S3 bucket will need the following.
Creation of S3 bucket – if it doesn’t exist already
IAM Policy on S3 bucket with necessary permissions
IAM’s Role with the policy setup in the previous step
Creation of the external volume itself.
Follow Snowflake documentation for detailed steps for all the cloud providers. Once the policy and authentication are configured, an external volume can be created, as shown below.
CREATE OR REPLACE EXTERNAL VOLUME iceberg_delta_external_volume
STORAGE_LOCATIONS =
(
(
NAME = 's3-external-volume'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = 's3:///'
STORAGE_AWS_ROLE_ARN = ''
STORAGE_AWS_EXTERNAL_ID = 'iceberg_table_external_id'
)
);
Catalog Integration object: Snowflake offers different types of catalog integrations to be configured (Glue, Open Catalog, REST, etc.). The type of integration we need for Delta Direct is “Object Storage” integration.
CREATE CATALOG INTEGRATION iceberg_detla_integration
CATALOG_SOURCE = OBJECT_STORE
TABLE_FORMAT = ICEBERG
ENABLED = TRUE;
Now that we have an external volume configured and catalog integration set up, an Iceberg table can be created directly on top of external object storage, as shown below.
CREATE ICEBERG TABLE iceberg_delta_table
CATALOG = iceberg_detla_integration
EXTERNAL_VOLUME = iceberg_delta_external_volume
BASE_LOCATION = 'relative/path/from/ext/vol/';
The base location is the relative path where the Delta files reside.
Privileges:
Only account admins or a role with CREATE EXTERNAL VOLUME
can create an external volume.
Similarly, only the ACCOUNTADMIN
role or any role with CREATE INTEGRATION
can create the catalog integration.
However, once these objects are set up, USAGE
can be granted to SYSDAMIN
or Developer roles, which can create iceberg tables.
Parquet Ingest:
The External Volume must first be configured to ingest Parquet files into a Snowflake Iceberg table, similar to the setup done in the previous step. This bucket will be a read-write bucket where Snowflake will write data into it.
External Volume: Configuring external steps varies by cloud service provider. This process establishes the authentication and permissions needed to access external object storage. For example – The external volume configuration for the S3 bucket will need the following.
Creation of S3 bucket – if it doesn’t exist already
IAM Policy on S3 bucket with necessary permissions
IAM’s Role with the policy setup in the previous step
Creation of the external volume itself.
Follow Snowflake documentation for detailed steps for all the cloud providers. Once the policy and authentication are configured, an external volume can be created, as shown below.
CREATE OR REPLACE EXTERNAL VOLUME iceberg_external_volume
STORAGE_LOCATIONS =
(
(
NAME = 's3-external-volume'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = 's3:///'
STORAGE_AWS_ROLE_ARN = ''
STORAGE_AWS_EXTERNAL_ID = 'iceberg_table_external_id'
)
);
An iceberg table can be created once the external volume is in place, as shown below. In this case, the catalog is set to “Snowflake,” indicating that it is a managed Iceberg table.
CREATE ICEBERG TABLE demo_iceberg_table (id int, name string, address string, zip string )
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'iceberg_external_volume'
BASE_LOCATION = 'relative/path/to/iceberg/table';
Since this is a Snowflake-managed Iceberg table, data must be ingested. While the standard COPY
command parses the file’s data and loads it into a new table in Iceberg format, it can be used. Instead, a more efficient approach is to use a new COPY
command option that directly copies the existing Parquet file—assuming it contains Iceberg-compatible data types—from an external stage location. The external stage must be configured with a Storage Integration. Refer to Snowflake’s documentation for setting up an external stage with Storage Integration for S3.
Once all these are configured, the files can be loaded into the Iceberg table using the below command.
COPY INTO demo_iceberg_table
FROM @external_parquet_stage
FILE_FORMAT = 'parquet_format'
LOAD_MODE = ADD_FILES_COPY
PURGE = TRUE
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE;
Why not create an external table directly on the Parquet files instead of using the COPY
command to load the data? While external tables can function well, loading the data into the Iceberg table significantly improves performance, making it nearly as fast as regular tables.
The performance can be summarized in one line as
External Tables over S3 < Iceberg over Delta(Unmanaged Iceberg) < Snowflake Managed Iceberg
How do we Choose Between These Options?
If you have external Parquet files in an Iceberg-compatible format and need high performance, use the
COPY
command with theADD_FILES_COPY
option to load the data.If your data is stored in Delta Lake on object storage and the files are Iceberg-compatible, access them directly using the Delta Direct feature.
Suppose you have external Parquet files in object storage, partitioned into different folder structures, and your queries only access a subset of the data at a time. In that case, external tables are the best choice.
Closing
This blog explored various methods for loading Parquet files from external locations into Snowflake. At phData, our experts have extensive experience with Lakehouse architectures and have successfully executed numerous platform migrations to Snowflake. Feel free to reach out if you have any questions!