February 7, 2024

How Does Snowpark Work?

By Justin Delisi

The Snowflake Data Cloud is a leading cloud data platform that provides various features and services for data storage, processing, and analysis. A new feature that Snowflake offers is called Snowpark, which provides an intuitive library for querying and processing data at scale in Snowflake. 

In this blog, we’re going to explain what exactly Snowpark is, how it works, and some use cases for Snowpark.

What is Snowpark?

Snowpark is the set of libraries and runtimes in Snowflake that securely deploy and process non-SQL code, including Python, Java, and Scala. 

On the client side, Snowpark consists of libraries, including the DataFrame API and native Snowpark machine learning (ML) APIs for model development (public preview) and deployment (private preview). 

On the server side, runtimes include Python, Java, and Scala in the warehouse model or Snowpark Container Services (public preview). In the warehouse model, developers can leverage user-defined functions (UDFs) and stored procedures (sprocs) to bring in and run custom logic.

Snowpark Container Services are available for workloads that require the use of GPUs, custom runtimes/libraries, or the hosting of long-running full-stack applications.

How Does Snowpark Work?

Snowpark operates on two main levels, client-side libraries and server-side execution. Here’s a breakdown of how it works:

Client Side

Libraries

Snowflake provides libraries for Snowpark that can be installed in any Python, Java, or Scala runtime.  That can be on your local development environment, notebook, or even a production environment or CI/CD system for deployment automation. 

The library imports the necessary functions and classes for work with Snowpark. Alternatively, with Python only, Snowpark Python code can be written directly into a Python worksheet in Snowsight.

Each library contains functions to establish a connection with Snowflake. For instance, with Python, to create a new session within Snowflake, you’ll need to:

  • Create a Python dictionary (dict) containing the names and values of the parameters for connecting to Snowflake.

  • Pass this dictionary to the Session.builder.configs method to return a builder object that has these connection parameters.

  • Call the create method of the builder to establish the session.

The following example uses a dict containing connection parameters to create a new session:

				
					connection_parameters = {
   "account": "<your snowflake account>",
   "user": "<your snowflake user>",
   "password": "<your snowflake password>",
   "role": "<your snowflake role>",  # optional
   "warehouse": "<your snowflake warehouse>",  # optional
   "database": "<your snowflake database>",  # optional
   "schema": "<your snowflake schema>",  # optional
 } 


new_session = Session.builder.configs(connection_parameters).create()

				
			

DataFrames

In Snowpark, the main way in which you query and process data is through a DataFrame. A DataFrame represents a relational dataset that is evaluated lazily: it only executes when a specific action is triggered. A DataFrame is like a query that must be evaluated to retrieve data.

DataFrames are able to be created from tables, views, streams, and stages, from the results of a SQL query, or from hardcoded values. 

Transformations

You apply various operations to these DataFrames, such as:

  • Filtering rows based on conditions

  • Selecting specific columns

  • Transforming values, including custom user-defined functions (UDFs)

  • Aggregating data

  • Joining a DataFrame with another DataFrame (or itself)

  • Machine learning operations (with Snowpark ML)

Each method returns a new DataFrame object that has been transformed. The method does not affect the original DataFrame object. If you want to apply multiple transformations, you can chain method calls, calling each subsequent transformation method on the new DataFrame object returned by the previous method call. 

Here’s an example of a few transformations being chained together:

				
					df_product_info = session.table("sample_product_data").filter(col("id") == 1).select(col("name"), col("serial_number"))
df_product_info.show()

				
			

These transformation methods specify how to construct the SQL statement and do not retrieve data from the Snowflake database.

Evaluation Actions

As mentioned earlier, the DataFrame is lazily evaluated, which means the SQL statement isn’t sent to the server for execution until you perform an action. An action causes the DataFrame to be evaluated and sends the corresponding SQL statement to the server for execution.

For example, in Python, the following actions would trigger an evaluation of the DataFrame:

  • Collect – Evaluate the DataFrame and return the resulting dataset as a list of Row objects.

  • Count – Evaluates the DataFrame and returns the number of rows.

  • Show – Evaluates the DataFrame and prints the rows to the console. This method limits the number of rows to 10 (by default).

  • Save as table – Saves the data in the DataFrame to the specified table. 

UDFs and Stored Procedures

With Snowpark, you can create user-defined functions (UDFs) and stored procedures for your custom lambdas and functions, and you can call them to process the data in your DataFrame.

When you use the Snowpark API to create a UDF, the Snowpark library uploads the code for your function to an internal stage. When you call the UDF, the Snowpark library executes your function on the server where the data is. As a result, the data doesn’t need to be transferred to the client in order for the function to process the data.

In your custom code, you can also call code packaged in JAR files using Java or Scala, and import modules from Python files or third-party packages if using Python.

Machine Learning

Training machine learning (ML) models can sometimes be resource-intensive. Snowpark-optimized warehouses are a type of Snowflake virtual warehouse that can be used for workloads that require a large amount of memory and compute resources. For example, you can use them to train an ML model using custom code on a single node.

A Python-stored procedure can run nested queries to load and transform the dataset, which is then loaded into the stored procedure memory to perform pre-processing and ML training. The trained model can be uploaded into a Snowflake stage, and can be used to create UDFs to perform inference.

While Snowpark-optimized warehouses can be used to execute pre-processing and training logic, it may be necessary to execute nested queries in a separate warehouse to achieve better performance and resource utilization. A separate query warehouse can be tuned and scaled independently based on the dataset size.

In addition to these features, Snowflake offers Snowpark ML, a Python library, and underlying infrastructure for end-to-end ML workflows in Snowflake, including components for model development and operations. With Snowpark ML, you can use familiar Python frameworks for preprocessing, feature engineering, and training. 

You can deploy and manage models entirely in Snowflake without any data movement, silos, or governance tradeoffs. Snowpark ML provides much of the high-level functionality so that you don’t have to write custom UDFs and stored procedures for common ML models and pipelines.

Server Side

Execution Plan

When you trigger a Snowpark operation, the optimized SQL code and instructions are sent to the Snowflake servers where your data resides. This eliminates unnecessary data movement, ensuring optimal performance. Snowflake spins up a virtual warehouse, which is a cluster of compute nodes, to execute the code. 

The size and configuration of the warehouse are automatically determined based on the complexity of the tasks and the amount of data involved. The server-side runtime receives the code and translates it into a physical execution plan, which maps the logical operations to specific Snowflake data processing mechanisms.

The execution plan leverages Snowflake’s unique architecture to access and process data in parallel across multiple nodes. It takes advantage of Snowflake’s columnar storage and automatic data partitioning to quickly scan and filter large datasets. 

Tasks are then distributed across the available nodes in the virtual warehouse for concurrent processing, with each node executing its assigned portion of the code, operating on local data partitions for optimal performance.

Resource Management

Snowflake dynamically allocates resources (CPU, memory, network bandwidth) to the running tasks based on their needs. To ensure efficient use of the cloud infrastructure, it automatically scales resources as required to handle large workloads. 

Results

Once all the tasks are completed, the results are compiled into a Dataframe. These results are then sent back to the client-side library for further use or display to the user. 

Container Services

Snowpark Container Services is a fully managed container offering designed to facilitate the deployment, management, and scaling of containerized applications within the Snowflake ecosystem. This service enables users to run containerized workloads directly within Snowflake, ensuring that data doesn’t need to be moved out of the Snowflake environment for processing. 

Unlike traditional container orchestration platforms like Docker or Kubernetes, Snowpark Container Services offers an OCI runtime execution environment specifically optimized for Snowflake. This integration allows for the seamless execution of OCI images, leveraging Snowflake’s robust data platform.

As a fully managed service, Snowpark Container Services streamlines operational tasks. It handles the intricacies of container management, including security and configuration, in line with best practices. This ensures that users can focus on developing and deploying their applications without the overhead of managing the underlying infrastructure.

Snowpark Container Services can also be used as part of a Snowflake Native App to enable developers to distribute sophisticated apps that run entirely in their end-customer’s Snowflake account. For Snowflake consumers, this means they can securely install and run cutting-edge products inside their Snowflake account in a way that protects the provider’s proprietary IP.

Snowpark Use Cases

Data Science

  • Streamlining data preparation and pre-processing: Snowpark’s Python, Java, and Scala libraries allow data scientists to use familiar tools for wrangling and cleaning data directly within Snowflake, eliminating the need for separate ETL pipelines and reducing context switching.

  • Building and deploying machine learning models: Snowpark’s machine learning APIs enable the development and deployment of ML models directly within Snowflake, leveraging its massive processing power and scalability for faster training and inference.

  • Real-time analytics and insights: Snowpark’s ability to process data at scale and integrate with streaming data sources can be used for real-time analytics, fraud detection, and anomaly identification, driving faster decision-making.

Data Engineering

  • Simplified data transformation and ETL: Snowpark’s libraries empower data engineers to write complex data transformations in familiar languages directly within Snowflake, minimizing code complexity and eliminating the need for separate transformation engines.

  • Unified platform for data integration and workflows: By combining data storage, processing, and orchestration within Snowflake, Snowpark simplifies data pipelines and reduces infrastructure complexity.

  • Building data applications and microservices: Container Services in Snowpark (private preview) allows data engineers to deploy custom applications and microservices within Snowflake, creating a tightly integrated platform for data processing and delivery.

Data Analytics

  • Enhanced data exploration and visualization: Snowpark’s DataFrames and familiar APIs simplify data exploration and ad-hoc analysis, enabling analysts to quickly analyze large datasets and generate insights through interactive notebooks or BI tools.

  • Faster data sharing and collaboration: Snowpark’s integration with Snowflake’s secure environment and access control features enables secure data sharing and collaboration among analysts and stakeholders, allowing for more efficient data-driven decision-making.

  • Advanced analytics and data mining: Snowpark’s support for user-defined functions and containerized applications opens up possibilities for implementing custom analytics models and data mining algorithms within Snowflake for deeper insights.

Closing

With Snowpark, Snowflake users have a powerful new tool for querying and transforming data at scale without it ever leaving their Snowflake account. Snowpark bridges the gap between the familiar world of programming languages and the power of Snowflake’s data platform. Utilizing Snowpark can take your data engineering, analytics, or data science project to the next level.

Free Generative AI Workshop

Looking for guidance on how to best leverage Snowflake’s newest AI & ML capabilities? Attend one of our free generative AI workshops for advice, best practices, and answers.

FAQs

Snowpark is designed for data-intensive applications, so it may not be the best choice for applications that do not require a lot of data processing. Additionally, Snowpark is a programming model, and it requires some level of programming expertise to be able to utilize it.

Yes, the Snowpark API is free with a Snowflake subscription. There are no additional charges for using Snowpark.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit