Designing Efficient Snowflake External Tables for Cost Optimization

Many enterprises, large or small, are storing data in cloud object storage like AWS S3, Azure ADLS Gen2, or Google Bucket because it offers scalable and cost-effective solutions for managing vast amounts of data.

Suppose you have to analyze such extensive data without loading them into the Snowflake Data Cloud warehouse platform. In that case, Snowflake offers a schema-level object called an external table that can reference this data set and can be queried without loading them into the Snowflake table.

The external tables in Snowflake support six different types of data file formats, like CSV, Parquet, ORC, Avro, JSON & XML, and they can be integrated with three major cloud providers (AWS, Azure & GCP) using schema-level stage objects or account level integration objects.

We all know that the snowflake-supported tables are more cost-efficient and leverage micropartition-based storage, reducing the storage cost and making queries faster, eventually speeding up data analysis work and saving tons of virtual warehouse costs.

However, there are business reasons why you need to start your data analysis with external tables. Querying external tables with large data files is slower and also compute-intensive.

In this blog, we will cover some of the external table design considerations, best practices, and dos & don’ts that we have learned working with our customers that save tons of their computing costs and also make their data analysis faster.

How To Inspect Your External Table Structure & Attributes

You can use the Snowsight WebUI to inspect the external table construct and its associated stage and file format. The Table Details tab (as shown in Figure 1) shows the stage location, file format, cloud provider, and other information about the Snowflake external table.

The Columns tab (as shown in Figure 2) displays the column names, data types, tags, and masking policies. However, these two views have some limitations. They do not display whether the external table columns are of type column, virtual, or partitioned.

To understand the table design of an external table, you can run the desc external table SQL statement. This will return a result set (as shown in Figure 3) that shows the column names, data types, and other information about the table, including the type of each column (column, virtual, or partitioned). The result set also shows the expressions used to derive the values of the virtual columns.

				
					--run show command
show external tables like 'orders%';

--run desc command
desc external tables orders_ext type='column'; 
desc external tables orders_ext type='stage';

The desc external table statement also takes an optional parameter called type. If you specify the value stage for this parameter, the statement will return information about the stage associated with the external table.

Querying External Tables With Limit Clause

To illustrate the performance of queries on external tables, we created two external tables called orders and order_items. These tables have millions of records, so we can demonstrate the difference in query performance when the tables are recreated using the best practices recommended in this article.

The order table has about 44 million records spread across 1468 partition files in the stage location. The order-item table, on the other hand, has about 182 million records spread across 1516 partition files. Figure 4 shows sample records from the order and order-item tables and their partition statistics.

It is common practice to run a select statement with a limit clause to fetch the first ten records of a table before analyzing and profiling it. This allows us to get a general overview of the data before we run any aggregated or analytical queries.

If you run a select statement with a limit clause on the order external table, you will see an interesting result, mainly when you review the query profile page.

In Figure 5, a select query with a limit clause scanned 34 partition files out of 1468. It read around 2,210 records and then displayed the first ten records.

We added the metadata$filename column to our query to see which files the records were coming from. The query results showed that all the records came from a single file. So, if all the records come from a single file, why did Snowflake scan 34 partition files?

Snowflake does not store record-level metadata for each file. This makes it difficult for the Snowflake query engine to estimate how many records are in each file. As a result, the query engine reads and scans more files than necessary to get the results back to the user.

If you rerun the same query, assuming the result cache is not enabled, you will see different results in the query profile each time. It is impossible to predict whether Snowflake will scan 34 partition files one time and 60 partition files the next session.

No parameter or configuration allows you to control which files Snowflake reads from your external table definition. However, you can specify a single file in the location parameter or add a pattern parameter to scan specific files.

Another important thing to note is that if the file sizes in your stage location are not uniform, even a simple select query can take much longer to run. In Figure 6, there are 316 partition files, but Snowflake had to scan 60 of them to get the requested ten records.

This is because the file sizes are not evenly distributed. Some files are much larger than others, and Snowflake has to scan all the files to find the necessary records.

Based on these observations, it is better to have smaller files with uniform record distribution in the stage location. This will help queries with limit clauses to run faster by scanning fewer files. Smaller file sizes make copy commands more efficient and easier to debug, saving computing and analysis costs.

External Table Query Performance For Partitioned Data

Storing large amounts of data in partitioned folders is a good idea. The way the folders are partitioned depends on the data domain. For example, if the data is time-series, the folders can be partitioned by date. The folders can be partitioned by source if the data is from different sources.

In Figure 7, the order and order-item external tables are mapped to a stage location where the data files are stored in partitioned folders. The data in these tables comes from two different regions, EU and APAC. Therefore, the data is stored in separate partition folders for each region. The folders are further partitioned by year, month, and day.

To create the order external table, the following SQL statement is executed. It has 7 virtual columns that are derived from the implicit value column. The values in these columns are cast to specific data types using the double-colon notation.

				
					--create external table
CREATE OR REPLACE EXTERNAL TABLE SANDBOX.SALES.ORDERS_EXT (
    order_id_pk number as (value:c1::number),
    customer_id_fk number as (value:c2::number),
    status varchar(1) as (value:c3::varchar(1)),
    total_price number(10,2) as (value:c4::number(10,2)),
    order_dt date as (value:c5::date),
    order_priority varchar(20) as (value:c6::varchar(20)),
    shipping_priority number as (value:c7::number)
)
location=@SANDBOX.SALES.STAGE_LOCATION//entity=orders
auto_refresh=false
file_format=SANDBOX.SALES.CSV_FF;

Let’s run a simple aggregation operation by having two filter clauses in the SQL query. The query calculates the total sales price for the year 2022 and month 01.

Based on the query profile in Figure 8, the Snowflake query engine reads all 1468 partition files from the stage location. It scanned a total of 44.7 million records, applied the filters to narrow down the results to 3.74 million records, and then aggregated the data. The total query execution time was 15 seconds, and 88% was spent on external scans. This is an expensive operation and an inefficient way to run the query.

If we add a third filter to our aggregate query to find the total sales for a particular day, the Snowflake query engine will still scan all stage files. This is because the query engine does not have any way of knowing which files contain the data that it needs.

As shown in Figure 9, the query engine scanned all 1468 partition files, even though only a few contained the data it needed. This resulted in a total query execution time of 15 seconds, and 88% of that time was spent on external scans.

Design External Table With Partition Columns

Unlike standard tables, which allow us to add a cluster key, external tables will enable us to add a partition key. This can make external table queries more performant, especially when the data is filtered using the partitioned columns.

In our case, the data in the stage location is already partitioned by region/year/month/day. As shown below, we can use the filename metadata column to add a partition column to our external table definition. We can then use the partition-by parameter to tell the Snowflake query engine to scan only the files that match the partition criteria.

				
					CREATE OR REPLACE EXTERNAL TABLE SANDBOX.SALES.PARTITIONED_ORDERS_EXT (
    order_id_pk number as (value:c1::number) ,
    customer_id_fk number as (value:c2::number) ,
    status varchar(1) as (value:c3::varchar(1)) ,
    total_price number(10,2) as (value:c4::number(10,2)) ,
    order_dt date as (value:c5::date) ,
    order_priority varchar(20) as (value:c6::varchar(20)),
    shipping_priority number as (value:c7::number),
    entity_partition varchar AS  split_part(metadata$filename, '/', 3),
    source_partition varchar AS  split_part(metadata$filename, '/', 4 ),
    year_partition varchar AS  split_part(metadata$filename, '/', 5 ),
    month_partition varchar AS  split_part(metadata$filename, '/', 6 ),
    day_partition varchar AS  split_part(metadata$filename, '/', 7 )
) 
partition by (entity_partition, source_partition, year_partition, month_partition, day_partition)
location=@SANDBOX.SALES.STAGE_LOCATION/entity=orders
auto_refresh=false
file_format=SANDBOX.SALES.CSV_FF;

After recreating the external table with partition columns, the partition columns will also appear along with the virtual columns when you run the describe external SQL statement, as shown in Figure 10.

Now that the order external table is partition-enabled, the same total sales aggregation query is much more efficient. As shown in Figure 11, the query only scanned four files, and the overall query execution time was 383 milliseconds. Only 124k records were read from those 4 partitioned files, and the result was produced.

Figure 12 compares the query performance of the order table when it does not have a partition column and when it does have a partition column. The results show that the query performance is 97% faster when the table has a partition column.

Multiple External Tables & Query Performance

In the real world, we often need to join multiple tables, whether they are external or internal, to complete our debugging or data analysis tasks. We will see how the query performance of joining two or more external tables changes and whether we need to make any changes to our WHERE clause to apply filter criteria.

Let’s review the DDL statement for the order-item table. We start with the external order-item table without defining any partition columns. Here is the DDL statement to create the order-item external table:

				
					CREATE OR REPLACE EXTERNAL TABLE SANDBOX.SALES.ORDER_ITEMS_EXT (
    order_id_pk number as (value:c1::number),
    product_id  number as (value:c2::number),
    item_number  number as (value:c3::number),
    quantity  number(4) as (value:c4:: number(4)),
    price  number(7,2) as (value:c5::number(7,2)), 
    discount number(3,2) as (value:c6::number(3,2)),
    tax number(3,2) as (value:c7::number(3,2)),
    return_flag varchar(1) as (value:c8::varchar(1)),
    status varchar(1) as (value:c9::varchar(1)),
    order_dt date as (value:c10::date),
    ship_dt date as (value:c11::date),
    commit_dt date as (value:c12::date),
    receipt_dt date as (value:c13::date),
    ship_instruction varchar as (value:c14::varchar),
    ship_mode varchar as (value:c15::varchar),
    comment varchar as (value:c16::varchar)
) 
location=@SANDBOX.SALES.DEV_AZURE_STAGE_LOCATION/entity=order_items
auto_refresh=false
file_format=SANDBOX.SALES.CSV_FF;

The order and order-item entities have a one-to-many relationship, with the order-id being the common key. The order table has approximately 44 million records, and the order-item table has approximately 182 million records.

Let’s execute a query that joins these two external tables. The intent of this query is to find the total sales value for the year 2022 and month 01 and calculate the total sales value, average sales value, and total tax value.

As shown in Figure 14, the query profile shows that it scanned 2,984 partition files and read almost all of the approximately 226 million records (44 million + 182 million). It then applied the filter criteria before applying the aggregation operation. The query took around 2 minutes to compute the result.

Another simulation was conducted where the same query was executed by joining the partitioned and non-partitioned order-item tables. Figure 15 shows a significant improvement, with the query finishing in 1 minute and 48 seconds and scanning only 1,638 files.

Let’s add partition columns to the order-item table, as shown below, and then rerun the join query.

				
					CREATE OR REPLACE EXTERNAL TABLE SANDBOX.SALES.PARTITIONED_ORDER_ITEMS_EXT (
    order_id_pk number as (value:c1::number),
    product_id  number as (value:c2::number),
    item_number  number as (value:c3::number),
    quantity  number(4) as (value:c4:: number(4)),
    price  number(7,2) as (value:c5::number(7,2)), 
    discount number(3,2) as (value:c6::number(3,2)),
    tax number(3,2) as (value:c7::number(3,2)),
    return_flag varchar(1) as (value:c8::varchar(1)),
    status varchar(1) as (value:c9::varchar(1)),
    order_dt date as (value:c10::date),
    ship_dt date as (value:c11::date),
    commit_dt date as (value:c12::date),
    receipt_dt date as (value:c13::date),
    ship_instruction varchar as (value:c14::varchar),
    ship_mode varchar as (value:c15::varchar),
    comment varchar as (value:c16::varchar),
    entity_partition varchar AS  split_part(metadata$filename, '/', 3),
    source_partition varchar AS  split_part(metadata$filename, '/', 4 ),
    year_partition varchar AS  split_part(metadata$filename, '/', 5 ),
    month_partition varchar AS  split_part(metadata$filename, '/', 6 ),
    day_partition varchar AS  split_part(metadata$filename, '/', 7 )
) 
partition by (entity_partition, source_partition, year_partition, month_partition, day_partition)
location=@SANDBOX.SALES.DEV_AZURE_STAGE_LOCATION/entity=order_items
auto_refresh=false
file_format=SANDBOX.SALES.CSV_FF;

As shown in Figure 16, the query execution time was reduced to 14 seconds, and it only scanned 248 partitioned files.

As shown in Figure 17, which compares the query profiles of the two different scenarios, partitioning your external tables with appropriate partition keys can significantly reduce the compute cost and query execution time. This can help you to achieve your analytical workload or data validation activities much faster.

Result Cache & Queries With External Tables

Result caching is a powerful feature in Snowflake that stores the results of previous queries in memory. This can significantly improve the performance of subsequent queries that access the same data, as the data does not need to be reread from external storage.

There are two ways to validate whether a query uses a result cache. We can check the query profile screen (Figure 18) or the query_history view in the account_usage schema (Figure 19).

The result scan table function is another built-in function allowing you to fetch data from the cache (in memory) to extend your data analysis on the precomputed data set.

4 Tips For Better External Table Design

We have simulated some important scenarios to see how queries perform and how efficiently the Snowflake query engine scans the stage files for external tables. Here are the five tips to remember when loading data files into a stage location and creating external tables on those stages.

By following these best practices, you can improve the performance of your external tables and get the insights you need faster.

If you manage a large amount of data, store the external files in partition folders in the stage location.
Avoid creating huge data files, and make sure the records are distributed evenly among the data files.
When designing external tables, use partition columns and partition key parameters to enable partitioning, making the external data scan process more efficient.
Queries that refer to external tables can use result caching, so make sure the caching flag is not disabled for your snowflake session. For very large datasets, use the result_scan feature to save scan costs.

If you are interested in exploring more opportunities to improve the configuration, security, performance, and efficiency of your Snowflake environment, you can also check phData’s advisory tool kit.

Conclusion

In this blog post, we discussed improving the query performance for external tables in Snowflake. By properly partitioning the data files in the stage location and using partition columns and partition key parameters to design the external tables, you can significantly improve the performance of queries and save on data scanning costs. This can help you to speed up your data analysis work and get the insights you need faster.

Ready to supercharge your data efficiency and cost savings? Start implementing these Snowflake External Table design strategies today!

Designing Efficient Snowflake External Tables for Cost Optimization

How To Inspect Your External Table Structure & Attributes

Querying External Tables With Limit Clause

External Table Query Performance For Partitioned Data

Design External Table With Partition Columns

Multiple External Tables & Query Performance

Result Cache & Queries With External Tables

4 Tips For Better External Table Design

Conclusion

More to explore

Alteryx Server 2024.2: New Features and How to Update

From Pipelines to Loops: How Fivetran + Census Reflects a Shift in Data Architecture

Snowflake Query Tagging Best Practices

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit

Industries

Solutions

Company

Technology Partners

Other Technology Partners

Check out our latest insights

Alteryx Server 2024.2: New Features and How to Update

From Pipelines to Loops: How Fivetran + Census Reflects a Shift in Data Architecture

Data Engineering

Consulting, Migrations, Data Pipelines, DataOps

Change Management, Enablement & Learning

COE, Coaching, PMO

Data Science and Machine Learning Services

MLOps Enablement, Prototyping, Model Development and Deployment

Strategy Services

Data, Analytics, and AI Strategy, Architecture and Assessments

Reporting, Analytics, and Visualization Services

Self-Service, Integrated Analytics, Dashboards, Automation

Elastic Operations

Data Platforms, Data Pipelines, and Machine Learning