November 16, 2021

How Do I Optimize My Snowpipe Data Load?

By Deepa Ganiger

Let’s talk about optimizing Snowpipe pipelines. For those unfamiliar, Snowpipe is the Snowflake Data Cloud’s solution for near real-time or continuous data ingestion. 

Although Snowpipe is mostly controlled by Snowflake in terms of data load and compute (virtual warehouse), some tweaks can be applied to ensure data ingestion/transformation is optimal. 

In this blog, we will talk about how to optimize various factors like file sizing, frequency of data load, and more!

What is Snowpipe?

​​Snowpipe is a serverless, scalable, and optimized data ingestion utility provided by Snowflake for continuously loading data into Snowflake tables. Snowpipe is especially useful when external applications are landing data continuously in external storage locations like S3 or Azure Blob, which needs to be loaded in Snowflake as it arrives. Snowpipe works with both external and internal stages, however, the automation depends on where the file is landed.

How The Snowpipe Process Flow Works

Before getting started, it’s important to have an understanding of how the Snowpipe process flows. In this section, we’ve detailed out the following steps that are required to ingest data into Snowpipe (using AWS bucket as an example).

A chart titled, "Snowpipe Process flow" with several icons and graphics

1. Configure authentication to the storage system (S3 example is given below)

				
					create storage integration s3_int
  type = external_stage
  storage_provider = s3
  storage_aws_role_arn = 'arn:aws:iam::001234567890:role/myrole'
  enabled = true
  storage_allowed_locations = ('s3://mybucket1/path1/', 's3://mybucket2/path2/');
				
			

2. Create stage object

				
					create stage mystage
  url = 's3://mybucket/load/files'
  storage_integration = my_storage_int

				
			

3. Create pipe object

				
					create pipe snowpipe_db.public.mypipe auto_ingest=true as
  copy into snowpipe_db.public.mytable
  from @snowpipe_db.public.mystage
  file_format = (type = 'JSON');

				
			

4. Auto Ingest using Event Notification approach 

Snowflake’s Snowpipe can be configured to load continuously into tables by enabling auto ingest and configuring cloud provider event notification to initiate the data load. This method works only with the external stage. Different scenarios of auto ingestion of the data using Snowpipe are listed below:

  • AWS S3  → SNS → SQS → Event Notification  → Snowpipe → Snowflake table
  • Azure BLOB → Eventgrid → Event Notification  → Snowpipe → Snowflake table
  • Google Bucket  → PUB/SUB → Event Notification → Snowpipe → Snowflake table
A chart titled, "Snowpipe Auto Ingest Methods" that has several icons and graphics

5. REST API approach  

Snowflake also provides a REST API option to trigger Snowpipe data. This option is very useful if on-demand data load should be invoked or when there is a requirement to load data from an internal stage using Snowpipe. This can be called from any tool or programming language which supports REST calls.

How to Optimize Snowpipe Data Load

Snowpipe resource consumption is based on various factors like file size, file type, transformation logic, and internal load queue. The latency of data load is also determined by the various factors listed above, however, it will be difficult to estimate beforehand. 

The following are some of the items to consider when designing optimal data load solutions using Snowpipe.

File Sizing

Snowpipe is typically used to load data that is arriving continuously. File sizing plays an important role in Snowpipe’s performance. The recommended file size for data loading is 100-250MB compressed, however, if data is arriving continuously, then try to stage the data within one-minute intervals.

An optimal size between 10-100MB compressed may be a good balance when loading through Snowpipe. 

Staging tiny files too frequently will result in an increased queue size on Snowflake, thereby increasing cost. Snowflake manages the files in the internal load queue and costs will be calculated based on the file queue count. Snowflake approximately charges 0.06 credits per 1,000 files queued. If data is landed using some tools like Kafka or other streaming services, adjust the parameters to ensure files are not dropped continuously, rather they are buffered and dropped in specific intervals.

Staging large files will result in Snowpipe taking a longer time to load the data, thereby impacting performance. 

How Many AWS S3 Buckets/ Azure Blobs Do I Need?

The decision of single vs. multiple buckets should depend on the requirements around data security and governance rather than for Snowpipe. As Snowpipe leverages the copy command to load the data into tables, the copy command syntax can be adjusted to load from specific prefixes or folders. 

As per AWS’s recommendation, Snowpipe designates no more than one SQS (AWS) for each bucket and this SQS may also be shared by multiple buckets within the same AWS account. This means that regardless of single/multiple buckets, the event notifications will eventually reach a single SQS queue. Snowpipe will load the data depending on the path specified in the copy command. The SQS queue is actually managed by Snowflake and is not configured by the customer. 

Can Snowpipe Help in Historical Data Load?

Snowpipe is generally used for automating incremental load, however, it can also be used for historical data load. But this depends on the number of files to be loaded. Historical data load is for the files which are already existing in S3 or Blob before configuring in Snowpipe. When Snowpipe auto-ingest is set to true, it only generates notifications for the files which are loaded in S3, but not for the files which are existing already.

The following Snowflake command can be used to trigger the Snowpipe for historical data:

				
					ALTER PIPE [ IF EXISTS ] <name> REFRESH { [ PREFIX = '<path>' ] [ MODIFIED_AFTER = <start_time> ] }

				
			

The option modified_after allows specifying the oldest date of the file to be copied to the Snowpipe ingest queue for loading into the table. This may be an optimal way to load historical data into Snowflake but it can be expensive if there are millions of files to load. If this is the case for you, an alternative option such as regular copy into a command or any third-party ETL tool may be recommended.

Should I Use Snowpipe Auto Ingest or REST API?

Snowflake’s auto ingest feature uses an event notification as the default option when files are arriving continuously. REST API can be an option when data arrives randomly and if there should be pre-processing to be done using some ETL or ELT tool. The tool can then push the files to stages and call REST API to load the data. This can also be a good option in cases where there is no external stage available. 

Snowflake’s auto ingest will be highly scalable compared to REST API as auto ingest automatically monitors queue size and scales as needed. However, both of these options will require validation/error handling after loading the data into tables. Snowpipe COPY_HISTORY will provide information about the data load and in the case of REST API, insertReport can be called to verify post load. In both cases, the results need to be compared with the source for the number of files and if there’s any data load errors, reprocess them if needed.

Should I Purge Files After Data Load?

Snowpipe generally recommends purging the files after loading data using the Copy command. Having a lot of files in S3 and referencing the file location (especially using Regex or Wild cards) will have an impact on performance. However, in the case of Snowpipe with auto-ingest, the SQS will have the file names and Snowpipe will use them for data load purposes. 

PRO TIP: For operational maintenance, it is always recommended to clean up the S3 files after data load. The file can be cleaned up either using scheduled tasks or S3 Lifecycle policies.

Key Points to Consider When Using Snowpipe:

  1. Ensure appropriate permissions are defined in the S3 policy for data access.
  2. Compare the stage reference in the pipe definition with existing pipes. Verify that the directory paths for the same S3 bucket do not overlap.
  3. Ensure that the file sizes are not too big or too small. Define optimal size of 100-250MB compressed.
  4. Define the post-load validation process to ensure all files are successfully loaded into target tables.

Conclusion

This blog summarizes all the opportunities to tweak the Snowpipe data pipeline. The list is not exhaustive but continues to grow as Snowflake makes updates to Snowpipe and adds more features for data ingestion. 

As a Snowflake Elite Partner, our Data Architects and Engineers have extensive experience in building successful data pipelines using Snowflake. If you have a use case for near real-time data ingestion using Snowflake, we’d love to help!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit