September 23, 2021

Is Snowflake Good for Machine Learning?

By Dominick Rocco

You may be wondering whether the Snowflake Data Cloud will help your organization with its machine learning (ML) initiatives.  The short answer to this question is a resounding “yes!” To fully answer this question, however, it’s important to recognize that most ML applications generally follow a common lifecycle. Snowflake is in fact great for ML because it enables the entire ML lifecycle. 

The ML lifecycle includes four phases: discovery, training, deployment, and monitoring.  The first two phases are well supported through the Snowflake UI, SnowSQL, and the Snowflake Connector. The latter two phases are supported heavily by Snowpark and UDFs. In the rest of this post, we’ll explain how these tools can contribute to each phase of the ML lifecycle.

A circular graphic that walks through the machine learning lifecycle

Discovery

The first step in developing any ML model is data discovery. In this phase, data scientists must gather or collect all available data relevant to the ML application at hand.  In many cases this can be a major challenge, but an enterprise data warehouse on Snowflake makes this step much easier. If all of your data is already in Snowflake, gathering data becomes trivial.  Using an enterprise data warehouse with common access patterns is an important step in enabling data science and machine learning

Once data has been gathered, data scientists will perform exploratory data analysis and data profiling to better understand the quality and value of that data. Based on the output of this analysis, data scientists might also define new features as columns in new tables. Ad-hoc analysis and feature engineering can easily be accomplished using the Snowflake UI or SnowSQL.

When more sophisticated statistical methods are necessary for data analysis or profiling, the Snowflake Connector for Python works great for extracting data to an environment where the most popular Python data science tools are available.

Training

When it comes to model training, the most important aspect Snowflake provides is access to data – and lots of it! If your organization has lots of data, Snowflake can house it all. In addition to using your own data, however, Snowflake can also give you access to external data through its Data Marketplace. For instance, if your ML project would benefit from incorporating census or weather data, you can purchase that data in the Data Marketplace and incorporate it right into your Snowflake account. The best part of this process is that Snowflake does the heavy lifting of transferring or transforming this data.

Reliably training and maintaining ML models also requires the training process to be reproducible, and lost data is a common issue for reproducibility.  For this, Snowflake’s time travel features can be very handy. Time travel won’t support all use cases due to its limited retention period, but for early prototyping and proof of concept projects, it can save a lot of headaches. 

To see some examples of models that can be trained on data in Snowflake, check out our articles on generating document vectors using word embeddings and demand forecasting.

Deployment

Snowflake support for deployment of ML models has greatly improved with the release of Snowpark and Java user-defined functions (UDFs).  UDFs are Java (or Scala) functions that take Snowflake data as input and produce an output value based on custom logic. Since Java and Scala support arbitrary logic and program flow, this opens the door to a wide array of functionality. Snowpark is still in public preview, so some features are still under development, but the potential is large. If you’re familiar with Spark, check out our Spark Developer’s Guide to Snowpark to understand the similarities and differences.

The difference between UDFs and Snowpark is somewhat subtle. Snowpark itself provides a mechanism for handling tables in Snowflake from Java or Scala to perform SQL-like operations on those tables. This is different from a UDF, which is a function that operates on a single row in a Snowflake table to produce an output. Snowpark of course integrates with UDFs so that the two tools can be used together.

For ML, UDFs provide a mechanism to encapsulate models for deployment using Java or Scala libraries. Another powerful option is to use common formats such as PMML to deploy models trained in other languages. And if pre-or-post-processing is necessary to support ML deployments, both UDFs and Snowpark are great options for transforming data.

Monitoring

Writing ML predictions back to Snowflake makes it easy to follow up and close the loop on the ML lifecycle. Snowflake Scheduled Tasks can provide a valuable orchestration tool for monitoring ML predictions. By scheduling tasks that leverage UDFs or building processes with Snowpark, you can even monitor for complex issues like data drift

When issues are detected, any analyst or data scientist can use the Snowflake UI to dig deeper and understand what is going on. Dashboards based on ML predictions can also be created using the Snowflake connector or integrations with popular BI tools like Tableau.

Conclusion

As you can see, Snowflake is definitely good for machine learning. With the recent release of Snowpark and Java UDFs, Snowflake is especially powerful because it completes the entire ML lifecycle. If you have data in Snowflake and use cases in mind for ML or MLOps, but aren’t sure where to start, get in touch with us at phData – we’re here to help!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit