January 3, 2022

How to Calculate Continuous Data Ingestion Costs Using Snowpipe in Snowflake

By Hiresh Roy

Snowpipe is a continuous data ingestion utility provided by the Snowflake Data Cloud that allows users to initiate any size load, charging their account based on actual compute resource usage.

With this pricing model, you only pay for what you use but the trouble is, it can make it difficult for users to estimate Snowpipe credit consumption for their continuous data ingestion workloads. 

In this blog, we will walk you through the Snowpipe credit computation so you and your team can easily calculate and estimate your continuous data ingestion and total cost of ownership (TCO). Additionally, we’ll throw in a few best practices to help you optimize your credit consumption.  

PRO TIP: If you’re looking for best practices around How to Optimize Snowpipe Data Load, we highly recommend you check out our other Snowpipe blog!

How are Snowpipe Charges Calculated?

In the Snowpipe serverless compute model, Snowflake provides and manages the compute resources. The compute automatically scales based on the workload. Your continuous ingestion workload may have a very wide table with no transformation or a narrow table with reordering columns and transformation while executing a copy command. Snowflake tracks the resource consumption of loads for your pipe object with per-second/per-core granularity as data load notification actively queues and processes data files. Per-core refers to the physical CPU cores in a compute server. The utilization recorded is then translated into familiar Snowflake credits.

Besides serverless resource consumption, an overhead is included in the utilization costs charged for Snowpipe: 0.06 credits per 1,000 files. This overhead is charged regardless of the event notifications or REST API calls resulting in data being loaded.

Continuous Data Ingestion Workload & Key Metrics

To understand the snowpipe credit consumption, we will go through different workloads as per the following table:

Workload table

We have taken the most commonly used file formats like CSV (uncompressed or compressed) and Parquet for Snowpipe credit consumption.

We will consider a structured-data set having approximately 40 columns for our simulation. The scenario-1 will place 1Mb of the single CSV data file into stage location by applying compression None and GZIP respectively.

Scenario-2 will load 5Mb of delta data and we will use GZIP compression for CSV and snappy for Parquet and review the performance and credit consumption.

The last scenario, scenario-3, will load 10Mb of delta data with CSV/GZIP and Parquet/snappy file format and compression and review the performance and credit consumption.

To understand and validate the average credit cost, load latency, and overall load performance per scenario per workload, the continuous data loading will run for a 30-60min duration.

Snowpipe Auto Ingest Flow for Continuous Data Load

The data integration pipeline will continuously load data to the cloud storage location where you have set up your Snowflake integration object. The notification services (Event Grid) will notify the pipe object and it will execute a copy command to load the data. 

A simple diagram showcasing how the Snowpipe auto ingest flow for continuous data load works.

Code Set

1. Integration object with cloud storage (Azure Blob)
				
					create storage integration azure_int
  type = external_stage
 storage_provider = azure
  	enabled = true
  	azure_tenant_id = '<tenant_id>'
  	storage_allowed_locations = ('azure://myaccount.blob.core.windows.net/mycontainer/path1/')
				
			
2. Create external stage object
				
					create stage my_blob_stage
 	url = 'azure://blob_loccation/delta/
  storage_integration = azure_int
				
			
3. Create pipe object
				
					create pipe stage.raw.mypipe 
   	auto_ingest=true as
  	 copy into stage.raw.my_table
  from 
@my_blob_stage/csv_files
  file_format = (type = 'CSV’ compression=’GZIP’');
				
			

Snowpipe Credit Trend & Load Latency Simulation Result

Scenario-1 workload load tries to understand the credit usage and load latency with and without compression. On successful execution of two workloads, the pipe_usage_history, and copy_result queries bring the following result:

Copy History Query

				
					select 
        pipe_name,
        file_name,
        status,
        row_count,
        file_size, 
        pipe_received_time , 
        first_commit_time,
        hour(pipe_received_time) as "nth hr",
        minute(pipe_received_time) as "nth min", 
        round((file_size/(1000*1024)),2) as "size(mb)", 
        timediff(second,pipe_received_time,first_commit_time ) as "latency(s)"
    from snowflake.account_usage.copy_history
				
			
A large sheet of data layout with many columns and rows.

Figure 1: 1MB CSV data file without any compression

A screenshot of CSV data.

Figure 2: 1MB CSV data file with GZIP compression

Pipe Usage History Query

				
					select 
    pipe_id, 
    pipe_name,
    start_time, 
    end_time, 
    hour(start_time) as "nth hr", 
    minute(start_time) as "nth min" ,
    credits_used, 
    bytes_inserted,
    (bytes_inserted/(1000*1024)) as "size(mb)", 
    files_inserted 
from 
    snowflake.account_usage.pipe_usage_history
				
			
A screenshot of CSV data

Figure 3: 1MB CSV data file without any compression

A screenshot showcasing a large amount of CSV data

Figure 4: 1MB CSV data file with GZIP compression

If we visualize the latency trend and credit usage for the 1MB data set for compressed & uncompressed workloads, it comes as shown below: 

A bar graph titled, "Credit Consution Trend (1MB CSV data without compression"

Figure 5: Credit Consumption Trend (1MB CSV data without compression)

A bar graph titled, "Credit Consution Trend (1MB CSV data with GZIP compression"

Figure 6: Credit Consumption Trend (1MB CSV data with GZIP compression)

The average credit usage for the compressed data looks slightly cheaper than the uncompressed data set, however, the difference is very marginal. The average credit consumption for 1MB files pushed every minute is around 0.0003, equivalent to x-small user-defined virtual warehouse per second usage.

On the other hand, when we look into data load latency (time difference when pipe receives the notification vs. when the data is first committed into the target table), the trend shows for all three scenarios is approximately 15-25 seconds. This indicates that no matter how frequently you push your data to the external stage, it will have a latency of 15-25 seconds or higher.

A bar graph titled, "Data Load Latency"

Figure 7: Data Load Latency

The second scenario shows as we add more workload to continuous data ingestion, the average credit cost also increases. Again, the pipe credit consumption for 5MB CSV/GZIP and Parquet/Snappy gives an average of  0.00035 & 0.0003 respectively.

A bar graph titled, "Credit Consumption Trend (5MB CSV data with GZIP compression)"

Figure 8: Credit Consumption Trend (5MB CSV data with GZIP compression)

A bar graph titled, "Credit Consumption Trend (5MB Parquet data with Snappy compression)"

Figure 9: Credit Consumption Trend (5MB Parquet data with Snappy compression)

The third scenario deals with a larger delta file having a file size of 10MB and runs the simulation for 60min. The pipe credit consumption for 10MB CSV/GZIP and 10MB Parquet/Snappy gives an average pipe credit consumption of  0.0004 & 0.00032 respectively.

A bar graph showcasing data.
A bar graph showcasing data.

Key Points to Consider For Snowpipe Credit Consumption

  1. The Snowpipe credit consumption increases as the file size grows, however, it is not always linear. 
  2. The credit consumption depends on the file format.
  3. The credit consumption also depends on the complexity of transformation applied during the copy command via Snowpipe. 

Conclusion

At the end of the day, the Snowpipe credit consumption is very reasonable compared to any other data integration solution available today. We hope this blog helps in your data journey. 

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit