Snowflake & Apache Iceberg on Azure ADLS Gen2: An Architecture Guide

For many large enterprises on Azure, Snowflake is the backbone of their data platform, and Apache Iceberg is increasingly evaluated as the open table format for building modern data lakes on Azure ADLS Gen2. As these platforms grow, this question will inevitably come up:

How do we stay flexible for the future?

This is why Apache Iceberg has become such a hot topic. I recently set up an Iceberg table on Azure, and my first thought was:

Wait, where did my folders go?

It was a great reminder that moving to an open table format like Iceberg changes the rules. You gain freedom from being locked into one vendor, but you also take on new responsibilities. Some responsibilities that Snowflake previously abstracted, such as file layout and cost visibility, now become explicit.

This guide is for architects and leads who are ready to move past the basic “how-to” steps. We won’t just list features; we will show you the practical design choices that determine if your data lake is easy to manage or a costly headache.

At a Glance

Adopting Apache Iceberg on Azure ADLS Gen2 is a strategic move to mitigate proprietary lock-in and ensure long-term data flexibility. While Snowflake continues to provide the high-performance “brain” (catalog and compute), the physical storage of your data becomes your responsibility.

Key Architectural Takeaways:

Data Ownership: Your data resides in your Azure account in an open format, accessible by any Iceberg-compatible tool (Spark, Trino, etc.).
Metadata-Driven Logic: Apache Iceberg uses Hidden Partitioning while Snowflake uses UUID-suffix folders to boost performance and prevent file conflicts.
The Cost Shift: Storage fees move from Snowflake to Azure. This makes cross-region egress visible line items that must be managed.
Active Maintenance: To avoid “small file” performance lag, architects must intentionally stay on top of compaction and file sizing strategies.

Why Open Table Formats Matter Now

Organizations want to store their data in an open, standardized format, such as Apache Iceberg, and then use the tool that best suits the job. In the past, choosing a data platform meant locking your data into that vendor’s private storage.

Today, companies want to store their data in an open, standard format and then use whatever tool is best for the job, whether that’s Snowflake for analytics, Spark for data engineering, or a new AI engine.

Apache Iceberg has become the go-to choice because it brings the database-like reliability directly to cloud object storage. It allows you to:

Decouple Storage from Compute: You own the data in your Azure ADLS Gen2 account, while Snowflake provides the high-performance engine to query it.
Gain “Database-like” Features: You get ACID transactions (no more messy partial writes), schema evolution (change your columns without a rewrite), and Time Travel (view data as it existed yesterday).
Enable Multi-Engine Access: Since the format is open, different teams can use various tools on the same dataset without needing to move or duplicate files.

Now that Snowflake supports Iceberg natively, you get the best of both worlds: the openness of a data lake with the governance and speed of a world-class data warehouse.

Modern Data Lake Architecture: Snowflake + Azure ADLS Gen2

For organizations already running Snowflake on Azure, adopting Apache Iceberg is less about adding a new system and more about extending an existing, well-understood data platform.

Figure 1 illustrates this relationship clearly: Iceberg data and metadata reside in ADLS Gen2, while Snowflake provides the execution and governance layer through a dedicated account-level integration. This separation allows enterprises to retain ownership of their data in cloud storage while continuing to rely on Snowflake for performance, security, and platform maturity.

The integration setup is simple. After creating the ADLS Gen2 storage account and container, a Snowflake external volume is defined with the storage location and Azure tenant information. This step generates a service principal and a consent flow, after which the service principal is granted the Blob Data Contributor role on the storage container.

With that single trust setup in place, Snowflake can securely read and write Iceberg data without ongoing manual intervention.

Figure 2 summarizes this flow across both platforms, showing the minimal set of actions required on the Azure side and the Snowflake side. The simplicity of this setup is deliberate, it ensures teams can move quickly past configuration and focus instead on the design and operational considerations that truly shape long-term success with Iceberg.

				
					-- create external volume using accountadmin privileges with storage URL + Azure tenant id.
CREATE OR REPLACE EXTERNAL VOLUME azure_ext_volume_for_iceberg
STORAGE_LOCATIONS =
(
  (
    NAME = 'azure-ext-volume'
    STORAGE_PROVIDER = 'AZURE'
    STORAGE_BASE_URL = 'azure://myicebergdemostorage001.blob.core.windows.net/myicebergcontainer/'
    AZURE_TENANT_ID = '12345-b3ca-4df0-ae1d-8b93e421da'
  )
);
-- describe the external volume to get the consent url
DESC EXTERNAL VOLUME azure_ext_volume_for_iceberg;

-- post consent url step & role assignment as show in Figure-2, run system function
SELECT SYSTEM$VERIFY_EXTERNAL_VOLUME('azure_ext_volume_for_iceberg');

-- ready to create iceber table using external volume.
create or replace iceberg table customer_iceberg(
    CUSTOMER_ID             BIGINT,
    SALUTATION              STRING,
    PREFERRED_CUST_FLAG     BOOLEAN,
    REGISTRATION_TIME       TIMESTAMP,
    ....
)
catalog = 'SNOWFLAKE'
external_volume = 'azure_ext_volume_for_iceberg'
base_location = 'myicebergcontainer/csvdata/customer_data';

Design Responsibilities with Iceberg

With Iceberg, Snowflake continues to provide compute, governance, and a managed catalog, but some decisions that were previously invisible now become explicit. This is not a change in capability, but a change in responsibility. Because data is stored in object storage and organized through Iceberg metadata, design choices such as file sizing, compaction behavior, and partitioning strategy start to play a more visible role in how the platform behaves over time.

For many teams, this shift is unexpected.

Snowflake native tables abstract most storage-level considerations, allowing performance to feel automatic. Iceberg introduces a more transparent model, where Snowflake optimizes execution while architects retain influence over how data is physically laid out and evolves.

When approached intentionally, this transparency becomes an advantage, it enables better cost awareness, predictable performance, and long-term flexibility, but it does require a different way of thinking about table design from the start. The rest of this guide focuses on those design choices

Iceberg Directory Structure in Azure ADLS Gen2

When Snowflake creates an Iceberg table backed by ADLS Gen2, it deliberately writes data and metadata into a directory that includes a short, system-generated identifier appended to the table path. This identifier is not arbitrary. It exists to avoid conflicts in shared storage environments, where multiple Snowflake accounts may point to the same container, or where tables with the same logical name may be created, dropped, and recreated over time.

By ensuring each Iceberg table instance has a unique physical location, Snowflake preserves metadata integrity and prevents accidental overlap between independent table lifecycles.

An important behavior to be aware of is how Snowflake handles table drops in this model. Dropping an Iceberg table removes the metadata reference from Snowflake, but it does not delete the underlying data and metadata files from ADLS Gen2.

If the table is later undropped, Snowflake simply re-establishes the reference to the same directory, allowing access to resume without rewriting data. This approach aligns with Iceberg’s open, storage-first philosophy and reinforces the need for teams to treat storage locations as long-lived assets, managed intentionally rather than assumed to be ephemeral.

Iceberg Table Metadata Visibility in Snowflake

Once an Iceberg table is created in Snowflake, it is registered as an external Iceberg table type, while still supporting familiar Snowflake capabilities such as Time Travel, Semantics, and many more. From an operational and architectural standpoint, Snowflake exposes rich metadata about Iceberg tables to help teams understand how the tables behave beneath the surface.

You can invoke the system procedure SYSTEM$GET_ICEBERG_TABLE_INFORMATION to inspect the current Iceberg metadata file and related table state, providing visibility into the active snapshot and versioning details managed by the Snowflake-managed catalog.

In addition, standard Snowflake commands such as SHOW PARAMETERS, DESCRIBE TABLE, and SHOW ICEBERG TABLES expose extended properties specific to Iceberg tables. These include column definitions, retention duration, change tracking settings, whether the table is an Iceberg table, partition parameters, and high-level indicators such as data volume and data size.

Together, these properties influence how Iceberg tables evolve, how long historical data is retained, and how storage and metadata grow over time. They become increasingly important for architectural and operational decisions, which are explored in the sections that follow.

				
					-- describe the icerberg and show the table column and data type
desc iceberg table customer_iceberg_tbl;
-- show the retention period, data volume and data size.
show tables like 'customer_iceberg_tbl';
-- fetch the latest version file mapped to 
call SYSTEM$GET_ICEBERG_TABLE_INFORMATION('mydb.public.customer_iceberg_tbl');

Cost Considerations for Iceberg Tables in Snowflake

When Iceberg tables are backed by external object storage, cost behavior is more transparent than with Snowflake native tables. In scenarios where the Snowflake account and the underlying storage are hosted with different cloud providers or regions, data movement between Snowflake compute and the storage layer can incur measurable data transfer charges.

Unlike native tables, where these costs are largely abstracted away, Iceberg makes this interaction visible because data is explicitly read from and written to external storage. This does not mean Iceberg is inherently more expensive, but it does mean architects must be aware of where compute is running relative to where data resides, especially for read-heavy workloads or cross-cloud deployments.

Figure 5 illustrates this clearly by showing data transfer activity categorized as data lake traffic, including the source cloud, target cloud, region, and bytes transferred. This level of visibility enables teams to reason about costs with precision and make informed architectural decisions, such as aligning compute and storage locations, optimizing file layouts, or adjusting workload patterns.

Understanding this interaction early is essential because file sizing, compaction strategy, and query behavior directly influence how often and how much data is moved, which is why cost awareness naturally precedes deeper design discussions.

When Snowflake compute and Iceberg storage reside in the same cloud and region, there is no cross-cloud egress cost, and Snowflake continues to bill only for virtual warehouse (compute) usage and standard cloud services when working with Iceberg tables.

Iceberg Table Compaction in Snowflake

For Snowflake-managed Iceberg tables, compaction is handled as part of the table lifecycle to keep both data and metadata efficient over time. Data compaction combines small data files into larger, more efficient files, helping manage storage growth and maintain consistent query performance.

In most scenarios, this process has little impact on compute costs, but Snowflake allows architects to disable data compaction at the account, database, schema, or table level when workloads are infrequently queried or when compaction behavior needs to be tightly controlled.

				
					CREATE or replace ICEBERG TABLE customer (customer_id INT, ....)
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'my_external_volume'
  BASE_LOCATION = 'my_iceberg_table'
  ENABLE_DATA_COMPACTION = FALSE;
-- default is TRUE

In addition to data compaction, Snowflake automatically performs manifest compaction to optimize the Iceberg metadata layer by consolidating smaller manifest files. This reduces metadata overhead and improves query planning efficiency, and it is always enabled by design.

Together, these mechanisms ensure that Iceberg tables remain performant and manageable while still giving teams the flexibility they need where it matters most.

It is important to recognize that if data compaction behavior is altered and table churn is high, small files can accumulate in the storage location over time, increasing both metadata volume and storage footprint unless compaction is intentionally managed as part of the overall design.

Iceberg File Size Strategy in Snowflake

File size plays a foundational role in how Iceberg tables behave over time, influencing query performance, compaction efficiency, and cross-engine interoperability. Snowflake allows architects to define a target file size for both Snowflake-managed and externally managed Iceberg tables, either explicitly or using the AUTO option.

For Snowflake-managed tables, AUTO enables Snowflake to dynamically select and adjust file sizes based on table characteristics such as data volume, DML patterns, ingestion workload, and clustering behavior, starting conservatively and evolving over time for optimal performance.

Once a target file size is set as shown in Figure 6, Snowflake applies it immediately to new writes and gradually aligns existing files through asynchronous maintenance. Because file size directly affects read efficiency, write behavior, and compaction, it should be treated as a deliberate architectural decision made early, rather than a reactive optimization later.

Snapshot Expiry and Time Travel in Iceberg Tables

Snapshot expiry is handled automatically for Iceberg tables to manage long-term storage and metadata growth. Based on predefined retention policies, Snowflake systematically removes older snapshots along with any metadata files that are no longer referenced by the table’s active history.

This process is always enabled and cannot be disabled, ensuring that table history remains meaningful and storage usage does not grow indefinitely. From an architectural perspective, snapshot expiry provides a predictable balance between historical access and sustainable storage management, without requiring manual intervention.

Partitioning Strategy for Iceberg Tables in Snowflake

Partitioning in Snowflake-managed Iceberg tables is logical rather than directory-based. When a table is partitioned, Snowflake records the partition information in Iceberg metadata manifest files instead of encoding it into the storage path.

As a result, data is not physically organized into rigid folder structures, and partition values remain attributes of the data itself rather than assumptions derived from directory names.

This metadata-driven approach allows Snowflake to efficiently prune data during query execution by reading manifest information and skipping irrelevant files, while keeping the storage layout simple and flexible.

Because partition values are stored in Parquet files, the data remains interoperable with other Iceberg-compatible engines, such as Spark or Trino. More importantly, this design enables partition evolution over time, allowing partition strategies to change as data access patterns evolve, without forcing disruptive storage rewrites or directory restructuring.

				
					create or replace iceberg table customer_partition (
    C_CUSTKEY        BIGINT,
    C_NAME           STRING,
    C_ADDRESS        STRING,
    C_NATIONKEY      INT,
	...
)
partition by (C_NATIONKEY)
catalog = 'SNOWFLAKE'
external_volume = 'azure_ext_volume_for_iceberg'
base_location = 'customer_partition';

Key Takeaways for Running Iceberg Tables in Snowflake

Adopting Apache Iceberg with Snowflake on Azure is not just about changing a CREATE statement; it’s about shifting your mindset from a closed warehouse to an open ecosystem. As we’ve seen, the behavioral nuances (file size strategy, compaction behavior, snapshot expiry, partitioning, directory layout, and cost visibility) all work together to shape how Iceberg tables perform, scale, and evolve.

When these aspects are understood upfront, Iceberg becomes a powerful enabler, delivering open data ownership, predictable performance, and long-term flexibility without sacrificing the operational maturity enterprises expect from Snowflake.

At the same time, it is important to acknowledge that this guide only covers a subset of the architectural and operational considerations involved in running Iceberg at scale. Topics such as workload patterns, ingestion strategies, multi-engine access, monitoring, governance alignment, and long-term platform evolution each deserve deeper discussion and cannot be fully addressed in a single guide.

You don’t have to navigate the Iceberg alone.

At phData, we specialize in these high-stakes architectural shifts. We’ve seen where the “standard implementation” falls short at enterprise scale and how to tune these systems for both multi-engine performance and financial predictability.

Whether you are struggling with orphan file cleanup, configuring cross-cloud egress, or designing a future-proof partitioning strategy, our technical team is ready to help.