October 28, 2022

How to Optimize Stream Performance in Snowflake

By Ahmad Aburia

Streams can be a great tool for continuous data capture (CDC) in an efficient, low-overhead way. In this blog, we’ll explore the different types of Streams and how to optimize performance when using them in the Snowflake Data Cloud.

What Are Streams in Snowflake?

Streams are Snowflake native objects that manage offsets to track data changes for a given object (Table or View). There are two types of Streams: Standard and Append-Only. A Standard Stream can track all DML operations on the object, while Append-Only streams can only track INSERT operations. 

For this blog, we will look at how to use Standard, Append-Only, and a combination of a Stream and timestamp-based filtering to capture change data and perform Extract, Load, Transform (ELT) operations as efficiently as possible. 

Before we dive in, there are a few things to consider when deciding which type of Stream to use.

Here’s a few essential questions you should ask yourself before choosing a Stream type:

  1. What operations do I encounter? Will I see Deletes and Updates? Or will it be Insert Only?
  2. Can my table be partitioned based on updates?
  3. What is the volume of change data?

With that in mind, let’s examine each Stream type in more detail to help you select the right one for your use case.

Standard Streams

When encountering Deletes/Updates, the first choice of data Change Data Capture (CDC) will be a Standard Stream. Given that the change data volume isn’t very large (Megabytes per extraction), or that the source and destination tables can be partitioned based on a key that will group records together, that change within the same operations (I.E. you can get good pruning on the source table, and limit your rewrites on the destination partitions).

Append-Only Streams

When the data being streamed is INSERT only, then the Append-Only Stream would be the best option. An Append-Only Stream is the more performant of the two, and it allows for a table to be truncated without the overhead of tracking record deletions for the next consumption.

Other Considerations

Of course, it would be great if every use case fits into one of the above scenarios, but in reality, data streaming is far more complex than that. Consider a situation where you would want to track updates and inserts with a large volume of data per extraction and it just doesn’t fit nicely in the aforementioned scenarios. 

For the remainder of this blog, we will go through a real-life example with metrics to show how two approaches performed vastly differently results under the same performance loads.

Stream Use Case

Scenario: We are given a single source table with large JSON records being updated with no pattern that will allow for good pruning/partitioning on the source table.

Initial Approach

The initial approach was to put a standard stream on the source table to track INSERTS/UPDATES and propagate those changes downstream using dbt to flatten the JSON records into a tabular format and apply some basic transformations. (See Diagram 1)

A diagram showcasing several parts
Diagram 1

Steps:

  1. Capture changes in Standard Stream
  2. Execute the dbt job every hour to:
    1. Consume data from Stream
    2. Flatten and transform data
    3. MERGE into destination table

The Problem(s)

There were a few issues with the initial approach, the main one was the inability to prune the source table (80+GB), the Stream would invoke a full outer join and fully scan the source table twice to retrieve the full set of records inserted/updated between runs.

This would take 30-60 minutes to run with an X-Large warehouse.

Revised Approach

After spending some time understanding where the performance issue was coming from, it was time to find a way around those inefficient self-joins. It was determined that we should bypass Standard Stream use for this implementation altogether, given that the updates are random and large.

The new approach would involve using Timestamps to find the latest updates on the source table, a metadata table to track the last run, and a temporary table that invokes time travel to retrieve changes. (See Diagram 2)

Steps:

  1. Run task every 30 minutes to:
    1. Retrieve last_run timestamp from metadata table
    2. Select all records from the Source table WHERE UPDATED_TIMESTAMP > last_run
    3. INSERT OVERWRITE INTO temp_table
    4. Update last_run in metadata table as MAX(UPDATED_TIMESTAMP) in temp_table
  2. Execute the dbt job every hour to:
    1. Consume data from Append-Only Stream (NOTE: Since this Stream is append-only it contains ALL changes since the last time the data was consumed. Regardless if the data has been deleted or not between runs. This will invoke  Time-Travel to retrieve all versions of the data between consumptions, and we will take only the latest version of each record found on the stream)
    2. Flatten and transform data
    3. MERGE into destination table

The Pros and Cons

The pros of this approach are great and easily measurable.

  • From 30-50 minutes to 5-10 minutes per load, allowing faster access to fresh data
  • The warehouse was scaled back from X-Large to Small, resulting in huge cost savings 

The cons, on the other hand, are relatively limited. The maintainability is a bit more complex, given the new overhead of:

  • A new metadata table (common across all destination tables)
  • A new temp table (one temp table per destination)
  • A new task to maintain metadata table and load temp table
  • This approach may not necessarily work for DELETEs

Closing

Given the different options for streaming data in Snowflake, it’s always best to analyze your specific use case and determine the best approach for your needs. 

In this example, we were able to reduce cost and time by analyzing the performance gaps in a simplistic approach and implementing a custom solution that fit the needs of our data. 

Discover how this played out in a real-life scenario for our customer NextGen by reading our case study. Don’t miss the overall architecture detailed at the bottom of the case study.

Looking for help optimizing Snowflake performance? We’ve got you covered!

As the 2022 Snowflake Partner of the Year, phData excels at helping organizations succeed with Snowflake. Learn more about our migrations, Snowflake Elastic Platform Operations, and Snowflake data engineering services today!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit