This solution was co-developed by Zac Zinda, Jennifer Prusa, and Naveen Alan Thomas.
Being able to perform accurate short-term price forecasting can improve trading performance, offer a competitive advantage, and ultimately, improve profitability. While third-party services that provide forecasts exist, doing it in-house is preferred since it reduces costs and dependence on an external provider.
In this blog, we’ll show you how to build a robust energy price forecasting solution within the Snowflake Data Cloud ecosystem. We’ll cover how to get the data via the Snowflake Marketplace, how to apply machine learning with Snowpark, and then bring it all together to create an automated ML model to forecast energy prices.
Why is Snowflake an Ideal Platform for Energy Price Forecasting?
The title of this blog states “end-to-end” with Snowflake, and we genuinely mean it. From the original data sources to the machine learning pipelines and all the way to the front-end application that an analyst could interact with, every component of this project uses native Snowflake features.
Using one platform simplifies a project and reduces the time needed to produce a data product.
Let’s break down each of the components in the remainder of this blog.
Building an End-to-End Energy Price Forecasting Solution with Snowflake
Accessing Training Data through the Snowflake Marketplace
The Snowflake Marketplace platform within the Snowflake Data Cloud allows users to discover, try, buy, and integrate third-party data and solutions directly into their Snowflake environment.
In this solution, we utilize a data share provided by Yes Energy (a company that collects, curates, and sells data from power markets around the world) to train our forecasting model.
The dataset used in this solution is specific to Texas and contains energy prices and weather, as well as actual and forecasted energy demand, solar generation, and wind generation figures provided by ERCOT (the Electric Reliability Council of Texas, which operates Texas’s power grid).
The Snowflake Marketplace allows us to connect to this data with the click of a button without setting up data ingestion or engineering pipelines. While this solution utilizes a single data source, the marketplace has additional datasets that could potentially be useful depending on your needs, including weather information from providers like AccuWeather and Weather Source, mobility data from MapBox, and datasets on oil, gas, and hydrogen from Rysted Energy.
For more information on the Snowflake Marketplace, this blog provides details on monetizing your data as a data provider (different from what we are doing in this solution but good to know as the data consumer).
Applying Machine Learning with Snowpark
Now that we have our data from the Snowflake Marketplace, it’s time to leverage Snowpark to apply machine learning.
Python has long been the favorite programming language of data scientists. Historically, Python was only supported via a connector, so making predictions on our energy data using an algorithm created in Python would require moving data out of our Snowflake environment.
With the release of Snowpark for Python in 2022, it is now possible to utilize Python within Snowflake, where the data lives, as well as use third-party Python libraries. For example, this solution utilizes Pandas, Numpy, Sklearn, XGBoost, and Optuna to process data, train a machine learning model, perform hyperparameter tuning, score a set of testing data, and calculate error metrics.
Snowpark allows you to take your complex, custom logic written in Python, Java, or Scala and deploy it as user-defined functions or stored procedures within your Snowflake environment. For a short demo on Snowpark, be sure to check out the video below.
Utilizing Streamlit as a Front-End
At this point, we have all of our data processing, model training, inference, and model evaluation steps set up with Snowpark. Let’s continue by creating a front-end to enable analysts.
Streamlit, an open-source Python package for building web-apps, has grown in popularity since its launch in 2019. Where Streamlit shines is creating interactive applications, not typical business intelligence dashboards and reporting.
It makes sense for use cases where you’d like to take some sort of user input, run Python code in the background, and produce an output, for example, interacting with a machine learning model and writing model outputs to a database.
With Snowflake’s acquisition of Streamlit in 2022, Streamlit applications are now able to be hosted within your Snowflake environment, eliminating the need for extensive knowledge of Docker, Kubernetes, cloud platforms like AWS, GCP, or Azure, authentication and authorization patterns, etc., that were previously all needed to put your app into production.
It is also pure Python, which eliminates the need for expertise in front-end frameworks and enables the use of Pandas dataframe and visualization packages like Plotly, Altair, and Bokeh. All of this allows data scientists to produce applications in hours or days versus the weeks or months that would typically be needed.
To see how Streamlit can be used to create an ML model that helps forecast energy prices, check out this helpful demo below.
Additionally, our team put together a fun sports-themed blog that demonstrates what’s possible with Streamlit. This blog is especially popular around March Madness.
Deploying ML Models via Dynamic Tables
The final step of this solution is to take the ML model we created via our Streamlit application and automate it so that new predictions are generated by the model whenever new data is available.
Snowflake Dynamic Tables are a new(ish) table type that enables building and managing data pipelines with simple SQL statements. Dynamic tables are automatically refreshed as the underlying data changes, only operating on new changes since the last refresh. While this is beneficial for data engineers, data scientists can use it to perform inference on new rows of data.
Below is example code that can create a dynamic table containing predictions generated by our machine learning model, which we have deployed as a user-defined function.
Every five minutes, Snowflake checks for changes in the source table ENERGY_STAGING
, and if there are rows that are new or modified, our model will score those rows and insert or update them in the target table accordingly. Unchanged rows will not be rescored by our model.
CREATE OR REPLACE DYNAMIC TABLE ENERGY_FORECASTING_TARGET
TARGET_LAG = '5 minute'
WAREHOUSE = ENERGY_WH_XS AS
SELECT
SOURCE.KEY
,UDF_PREDICT_PRICE(
SOURCE.FEATURE_1,
SOURCE.FEATURE_2,
...
SOURCE.FEATURE_N
) AS PREDICTION
FROM ENERGY_STAGING AS SOURCE;
Conclusion
This example of an end-to-end data product showcases the evolution of the Snowflake Data Cloud. What was once a SQL-based data warehousing tool is now so much more. Snowpark supports Python, Scala, and Java as languages, opening the platform up for developing and productionalizing machine learning models. Streamlit provides a front-end solution for interacting with your data.
The marketplace serves as a source of third-party data to supplement your internal datasets. Dynamic tables simplify change data capture and pipeline building.
In other words, Snowflake is not just a powerful platform for data engineers, but for data scientists and analysts as well.
If you need additional help or are curious about leveraging Snowflake as a data science platform or front end, contact our team of Snowflake experts today for help, guidance, and best practices!