September 20, 2023

What is Snowpark — and Why Does it Matter? A phData Perspective

By phData Marketing

This blog was originally written by Keith Smith and updated for 2023 by Nick Goble & Dominick Rocco. 

You’ve probably heard of the Snowflake Data Cloud, but did you know that Snowflake also offers a revolutionary set of libraries and runtimes called Snowpark? 

As one of just a few fortunate partners, phData has had early access to Snowpark since its inception. Through that experience, we have seen first-hand the enormous potential Snowpark offers and we’re excited to share our perspective!

In this blog, we’ll explore what Snowpark is, how it’s evolved over the years, why it’s so important, what pain points it solves, and much more! 

What is Snowflake’s Snowpark?

Snowpark is the set of libraries and runtimes in Snowflake that securely deploy and process non-SQL code, including Python, Java, and Scala. 

On the client side, Snowpark consists of libraries, including the DataFrame API and native Snowpark machine learning (ML) APIs for model development (public preview) and deployment (private preview). 

On the server side, runtimes include Python, Java, and Scala in the warehouse model or Snowpark Container Services (private preview). In the warehouse model, developers can leverage user-defined functions (UDFs) and stored procedures (sprocs) to bring in and run custom logic.

Snowpark Container Services are available for workloads that require the use of GPUs, custom runtimes/libraries, or the hosting of long-running full-stack applications.

Why Does Snowpark Matter?

As a declarative language, SQL is very powerful in allowing users from all backgrounds to ask questions about data. But the complex logic of large-scale applications and pipelines can become difficult to understand in SQL alone. 

Complex problems are often much easier to solve using software engineering principles on top of a full-featured programming language like Python, Scala, or Java.  

Most importantly, Snowpark helps developers leverage Snowflake’s computing power to ship their code to the data rather than exporting data to run in other environments where big data is a second-class citizen. This can be a major optimization.  

In fact, we’ve seen Snowpark accelerate workloads by up to 99 percent!

Snowpark is especially powerful because it enables:

  • Custom Software Development – Teams can now create custom applications with complex logic using the functional programming paradigm of the Snowpark API.

  • Traditional DevOps and Engineering Standards  – Developers can create more reliable and deployable applications by writing unit tests and leveraging CI/CD deployment pipelines that push to Snowflake. 

  • Better Partner Integrations –  Snowpark lets partners create software applications that work better on Snowflake. Applications like Dataiku, dbt, and Matillion can push complex computations down to Snowflake (teams may already be using those in Snowpark without even realizing it).

  • Open-Source Libraries – Python, Java, and Scala are vastly more extensible than SQL and allow the development of rich open-source software. Snowpark lets developers run those libraries at scale on Snowflake compute to process data internal and external to Snowflake.

Who Should use Snowpark?

Snowpark is really great, but it’s not for everyone. One of the greatest things about Snowflake is that it lets users do really big things with SQL, especially when paired with tools like dbt. But some workloads are particularly well-suited for Snowflake. We think those workloads fall into three broad categories: 

  • Data Science and Machine Learning – Data Scientists love Python, which makes Snowpark Python an ideal framework for machine learning development and deployment. Data scientists can use Snowpark’s Dataframe API to interact with data in Snowflake, and Snowpark UDFs are ideal for running batch training and inference on Snowflake compute. Working with data right inside Snowflake is significantly more efficient than exporting to external environments. For one of our clients, we migrated a 20-hour batch job to run in 30 minutes on Snowpark.

  • Data-Intensive Applications – Some teams develop dynamic applications that run on data. Snowpark lets those applications run directly on Snowflake compute. Snowpark can be combined with Snowflake’s Native App and Secure Data Sharing capabilities to allow companies to process their customer’s data in a secure and well-governed manner. We’ve worked with one of our clients to do exactly that! 

  • Complex Data Transformations – Some data cleansing and ELT workloads are complex, and SQL can inflate that complexity.  A functional programming paradigm lets developers factor code for readability and reuse, while also providing a better framework for unit tests.  On top of that, developers can also bring in external libraries from internal developers, third parties, or open source.  Snowpark Python makes it so all of that well-engineered code can run on Snowflake compute without depending on shipping data to an external environment.

Why is Snowpark Exciting to us?

phData has been working in data engineering since the inception of the company back in 2015. Until now, we’ve had to treat them as different entities. We have seen customers transform their data analytics with Snowflake and transform their data engineering and machine learning applications with Spark, Java, Scala, and Python. By offering native data engineering functionality inside of Snowflake virtual warehouses, we see a thriving data ecosystem being built around Snowflake.

The release of Snowpark makes our customers’ lives simpler by unifying their data lake into a complete data platform.

Snowflake has empowered our customers to democratize their data — but they’ve remained limited to the functionality of the SQL language. Snowpark will allow for native integration with the data inside of Snowflake, whether it’s via DataFrames or other libraries and runtimes that are now available.

What are the Potential Pain Points Snowpark Solves?

After building and managing workloads at scale for the past six years, we recognize there are a handful of potential issues when implementing development resources on large datasets:

  • Long Startup Time for Distributed Resources

    • Systems like Hadoop or Spark require a cluster of nodes to be ready to do work. On most platforms, this can be a 5-10 minute process before being able to execute any code.

    • Snowpark will solve this by tapping into Snowflake’s readily available virtual warehouses.

  • Efficient Coding Practices Over Datasets

    • Allowing developers access can introduce a learning curve that requires learning best practices for optimization, this is especially true on large data sets.

    • This is something Snowpark is looking to directly solve since they provide a single layer to access their data and open up the use of the DataFrame API. This means you don’t have to access files directly and worry about issues like small files.

  • Managing and Troubleshooting Partitions and Garbage Collection

    • This is the most common problem with Hadoop or other file-based data lakes and tends to be the most difficult to solve.

    • Generally, troubleshooting requires knowledge at a deeper level to understand data skew and how the code is being executed across compute nodes.

If your data team does not have the internal programming skills required (Python, Java, or Scala), there tends to be a steep learning curve. But once the basics are understood, handling and manipulating data remain the same.

What Has phData Done with Snowpark?

As a Snowpark Accelerated partner, we’ve been developing on Snowpark for a while now.  Here’s a snapshot of just a couple of projects we’ve completed:

ML Inference Pipeline

One of our clients was struggling with a ML inference pipeline that took over 20 hours to run.  Most of that time was consumed by a large dataset of 2.5 million records being exported from Snowflake to run in a Kubernetes environment for processing. Since the actual model was trained in Python, it was ideal for running in Snowpark Python.

The Snowpark version runs in just 13 minutes, which equates to a 99 percent reduction in runtime compared with the original 20-hour process!

Data-Intensive Matching Application

Another client of ours had developed a complex matching application on top of Map-Reduce, which requires complex infrastructure. Further, this application is meant to process sensitive customer data, which leads to significant overhead in the form of managing that data in a secure and isolated fashion.  

We migrated that application to Snowpark, which provides the simplicity of running entirely within the Snowflake platform. The application can now scale to meet customer demand with Snowflake compute.  

Most importantly, phData’s implementation enables the use of data sharing and Data Clean Rooms to help our client process their customer’s data without ever ingesting it into their own environment.

What are Ideal Workloads for Snowpark?

Conclusion

In short, Snowpark is an exciting step into the data application world and we are excited to partner with Snowflake on the journey. As we learn more about Snowpark, we will be sharing tips and tricks to help educate you on your journey.

Interested in More Snowpark Resources?

Browse our free collection of ungated Snowpark content today!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit