June 1, 2022

Snowpark Performance Best Practices

By Nick Pileggi

Snowpark is a powerful programming abstraction that will run data pipelines or applications inside the Snowflake Data Cloud without moving the data to another product. Snowpark makes it easy to convert code into SQL commands that are executed on Snowflake. This makes creating data transformations and applications easier on Snowflake. 

Because Snowflake is a SaaS product, you will want your processes to be as efficient as possible. This will reduce the cost to run the transformations since it is billed for the amount of time that it runs. 

In this post, we’ll cover some simple best practices and tips that will make your Snowpark applications run more efficiently.  

What is Snowpark Used For?

Snowpark was created by Snowflake to provide a more programmatic way to interact with data in Snowflake. This is accomplished via a Dataframe API and the Scala or Java language (with Python support in the future). While the majority of users will interact with Snowflake via SQL, for more advanced data transformations and applications, using the Snowpark API can be much easier. 

The use of the API allows an easier experience for developers and is significantly easier to test via unit and integration tests. But with any managed service, making sure that it is performing as well as it can help increase throughput and decrease costs.

3 Snowpark Performance Best Practices

Minimizing Fields

Snowpark converts Scala code into SQL. By default, when loading a table into a DataFrame via Snowpark, it is done lazily and with all columns. As transformations are added onto the DataFrame, more SQL commands get nested via Snowpark. And when the data is finally written or collected, the SQL query is executed. 

If the fields are not limited, all of the fields will be copied from the source table. If the table is large or wide, this can cause quite a bit of data to be moved (internally) and reloaded into Snowflake. To remedy this, we recommend minimizing/limiting your fields when writing or caching data.

Caching Data Only as Needed

Since Snowpark executes its commands lazily as Spark does, there are times when the same transformation on a DataFrame is used multiple times. In this instance, the Snowpark DataFrame can be cached. In the background, Snowpark will create a temporary table and load the transformed data into that table. When that base DataFrame is used, later on, that table will be used instead of recalculating the transformations.

While this is a very useful tool, overuse can cause performance issues. If a query is not run multiple times, it shouldn’t be used. Additionally, breaking it apart with cache operations prevents disk spillage in the warehouse. Because of the columnar nature of Snowflake, inserting operations increases the overall processing time, so overuse of the cache operation can slow down the Snowpark job if not needed.

Redistributing Input on UDFs

One of the great uses of Snowpark is working with Java User Defined Functions (UDFs) and letting Snowpark create and clean them up. While 90 percent of the time, the Java UDFs will execute just fine, there are occasions when the UDF will take much longer than normal. This is normally a good time to engage Snowflake support to investigate the UDF, as they have much more insight into how the code is operating on the warehouse. 

But sometimes the data may have some skew to it and when Snowflake distributes the data to each node in the warehouse, it may be suboptimal. In this case, testing by adding df.sort(random()) will resort to the data and the skew will be removed. Allowing the warehouse to parallelize UDF will achieve greater throughput. 

Closing

Snowpark provides a powerful way to interact with Snowflake while still using developer best practices, testing, and automation. Like most SQL systems, several things should be done to minimize the risk of slower, more expensive processes and pipelines. With these three best practices/tips, applications should run more efficiently.  

Looking for more help with Snowflake or Snowpark? As an Elite Snowflake Consulting Partner, phData excels at helping growth-hungry companies succeed with Snowflake. Learn more about our services and technologies today!

FAQs

Snowpark converts the data frame operations into Snowflake SQL commands that are then executed against Snowflake.

The Snowpark API provides a caching method that will insert the DataFrame into a temporary table that will be used in later calls.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit