October 14, 2022

How to Train ML Models Using Snowpark for Python

By Andrew Evans

At the 2022 Snowflake Summit, Snowpark for Python was officially released to the public. For those unfamiliar with Snowpark, it enables developers to deploy machine learning in a serverless manner to Snowflake’s virtual warehouse compute engine. 

As an Elite Snowflake partner, our machine learning team got early access to this feature and has since spent a lot of time working and training ML models using Snowpark for Python.

In this blog, we’ll discuss a few of our learnings, especially around design patterns for ML training and inference in Snowpark. Then we’ll dive a bit deeper into how to train and forecast many time-series models at once on CPG data with Snowpark Python UDFs.

Model Training and Inference with Snowpark for Python

Snowpark for Python brings the full power of SQL running in the Snowflake Data Cloud combined with many commonly used Python libraries. Once a model has been trained, it can be wrapped in a UDF and used to run inference in Snowpark. 

This works perfectly for running large batches of inference quickly within Snowflake. Snowpark distributes the rows of data among many worker nodes and runs inference efficiently in parallel.

What about training a model with a UDF? Whether you’re running k-fold cross-validation, wanting to compare many architectures at once on the same set of data, or (as we’ll get into) forecasting many stores and product lines at once, training and inference in Snowpark is pretty straightforward.

Ephemeral Models for Multiple Time-Series Forecasts

Consumer Packaged Goods (CPG) forecasting is a crucial business function. A critical consideration in CPG forecasting is the level of forecast granularity. You may need a sales forecast at both a regional and a store level, for many products. 

There needs to be a balance between the best level of detail to make decisions on the available data.

As a rule of thumb, it’s more accurate to have several specific forecasts (one for each store and product) than to aggregate up to the bigger picture (sales for all stores). In addition, the most accurate forecast will always be based on the most recently collected data.

Tranforming the Data

With Snowpark, we can quickly train and forecast for every store and product, every day. The time series for each product-store pair can be aggregated into one row using the ARRAY_AGG function.

Each array value is a string concatenation of a date and a sales amount, and the UDF will return the same format which we can then FLATTEN and SPLIT. This isn’t the most sophisticated approach, but it’s fairly intuitive.

Two diagrams, one with a lot of data and another with less data representing how easy it is to transform data in Snowpark

Defining the UDF

User defined functions are compiled by Snowpark and can contain any package within Snowpark’s Conda repo. For this example, we need to register the two Python packages that will be used in our UDF, Pandas and Prophet.

For a deeper look at UDFs, check out this blog.

session.add_packages(“pandas”, “prophet”)

Now we’ll define our UDF.

				
					@udf(name='prophet_fit', input_types=[ArrayType(StringType())], return_type=ArrayType(StringType()), is_permanent=False, replace=True)
def prophet_fit(ds_y: list) -> list:
    #Construct a Dataframe from an array containing dates and values
    df = pd.DataFrame({'ds_y':ds_y}).ds_y.str.split('_',expand=True)
    df.columns = ['ds', 'y']
    df.ds = pd.to_datetime(df.ds)
    df.y = df.y.astype(int)
    
    # Enable daily seasonality since we are dealing with daily data
    m = Prophet(daily_seasonality=True)    
    # Now we fit the model
    m.fit(df)
    
    # This next step created a future facing data frame for 90 days
    future = m.make_future_dataframe(periods=90, freq='d')
    
    forecast = m.predict(future)

    return forecast.ds.astype(str)+"_"+forecast.yhat.astype(str)

				
			

Notice that within the UDF, we are just using Pandas and Prophet. We build the dataframe from the string array, and then fit the model. Since we want to forecast another 90 days into the future, we can easily build the future dataframe with a Prophet utility and then return a longer list back out. 

Conclusion

In summary, we can use Snowpark for Python UDFs to train ML models and forecasts, allowing us to quickly and efficiently produce forecasts across many stores and product lines. 

For more discussion of CPG forecasting and an end-to-end example, check out our blog and this Jupyter notebook.

Want to learn more about how this can be used for your business? Contact the phData Data Science team today for questions, help, and answers.  

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit