March 22, 2023

Performance Benefits of Snowpark for ML Workloads

By Andrew Evans

As companies continue to adopt machine learning (ML) in their workflows, the demand for scalable and efficient tools has increased. Snowpark, an innovative technology from the Snowflake Data Cloud, promises to meet this demand by allowing data scientists to develop complex data transformation logic using familiar programming languages such as Java, Scala, and Python. 

In this blog post, we will explore the performance benefits of Snowpark for ML workloads and how it can help businesses make better use of their data.

Top Use Cases of Snowpark

With Snowpark, bringing business logic to data in the cloud couldn’t be easier. Transitioning work to Snowpark allows for faster ML deployment, easier scaling, and robust data pipeline development. Listed below are three of the top use cases we’ve experienced with our customers that harness Snowpark. 

ML Applications

For data scientists, models can be developed in Python with common machine learning tools. Running the model in production becomes as simple as registering a Snowpark UDF wrapping an inference call.

Data-Intensive Apps

Teams building data-intensive apps can deploy their logic directly in a customer’s warehouse. Data governance and security is straightforward and performance can automatically scale with demand.

Complex Transformations

Data engineers can maintain all of their complex transformation pipelines as code. Leveraging test-driven development and CI/CD best-practices as well as open source libraries.

Snowpark vs. Local Inference

In this example, we’ll use a pre-existing model trained to predict the point-spreads of NFL games using data from the Snowflake Marketplace, specifically from ThoughtSpot’s Fantasy Football dataset. To make things super simple, we’ll use Hex for each inference mode. 

For more details on using Hex with Snowpark, checkout How to Use Snowpark with Hex for Machine Learning.

With ML inference in Snowpark, inference logic can be written natively in Python, deserializing a pretrained model from a Snowflake stage. Predictions are generated in parallel, distributed using Snowflake’s efficient logic and rapidly scaled out within a Warehouse. Bringing the model to the data, as opposed to the traditional approach has two key benefits. 

First, data never needs to be moved out of where it is natively stored. This is far faster and more secure. Second, the performance of the inference logic easily and rapidly scales. As needs change, compute resources allocated dynamically responds.

Local Inference

For local inference, we need to download the model stored in a Snowflake stage and deserialize. Next, we’ll collect the appropriate football table for input and run predictions within our local environment. Once the predictions are finished, we need to write those back to a table within Snowflake. 

For each step, we’ll record how long it takes too.

				
					from datetime import datetime
import pandas


t_start = datetime.now()


pandas_df = hex_snowpark_session.sql("""
       SELECT
       …several columns…
       FROM user_db.nfl.nfl_data_large
       """).to_pandas()       
t_collect = datetime.now()


model_xgb = joblib.load("https://i0.wp.com/www.phdata.io/hex/model/model.joblib.gz")
t_deserialize = datetime.now()


prediction = model_xgb.predict(pandas_df)
pandas_df["PREDICTION"] = prediction[0]
t_predict = datetime.now()




hex_snowpark_session.sql("USE SCHEMA USER_DB.NFL").collect()
hex_snowpark_session.write_pandas(pandas_df,
                                 table_name="predictions",
                                 database="USER_DB",
                                 schema="NFL",
                                 auto_create_table=True)
t_write = datetime.now()


times = {
    '1. Collect': (t_collect - t_start).total_seconds(),
    '2. Deseriialize': (t_deserialize - t_collect).total_seconds(),
    '3. Predict': (t_predict - t_deserialize).total_seconds(),
    '4. Write': (t_write - t_predict).total_seconds(),
    '5. Total': (t_write - t_start).total_seconds()
}

				
			

Snowpark

Instead of going through all those steps for running inference locally (or on any remote compute for that matter), which can be challenging to optimize, we can bring our model to the data with a Snowpark UDF. We can wrap up all of our Python execution steps into a function call, operating on rows of our table. Then, we’ll convert to a UDF, and upload a serialized version of the function to execute in a Snowflake-hosted Python sandbox. Simple.

				
					def predict_xgb_nfl_pandas(data: pd.DataFrame) -> pd.Series:
   import xgboost
   import pandas as pd
   import sys
   import os


   model = read_file("model.joblib.gz")
   prediction = pd.Series(model.predict(data))
   return prediction


predict_xgb_nfl_pandas = F.pandas_udf(
   predict_xgb_nfl_pandas,
   name="predict_xgb_nfl_pandas",
   replace=True,
   is_permanent=True,
   stage_location="@model_stage",
   input_types=inference_types,
   max_batch_size=10000,
   return_type=PandasSeriesType(FloatType()),
   session=hex_snowpark_session,
   imports=["https://i0.wp.com/www.phdata.io/hex/model/model.joblib.gz"],
   packages=["xgboost==1.5.0", "cachetools"]
)

				
			

In the chart below, 750,000 predictions are produced with our xgBoost model. In the local case, time is taken to copy data over, predict with statically allocated compute resources, and then write the predictions back to the Warehouse. 

In the Snowpark case, 100 percent of the steps are performed within the Warehouse, and execute far faster. If our data volume grows even larger, we could simply scale-out the warehouse it runs on.

predictions

Conclusion

Snowpark is simple to use and efficient with auto-scale capacity. Snowpark enables ML teams to deploy with native code running where their data is. 

Want to learn more? We’re hitting the road with Snowflake and giving hands-on labs around the US this Spring of 2023…Stay tuned to phData’s LinkedIn for more updates. 

Can’t wait? Checkout these blogs and reach out to our Data Science and ML team today!

Related Articles

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit