Putting the T for Transformation in ELT (ETL) is essential to any data pipeline. After extracting and loading your data into the Snowflake AI Data Cloud, you may wonder how best to transform it. Luckily, Snowflake answers this question with many features designed to transform your data for all your analytic use cases.
In this blog, we’ll explain Snowflake’s best features for transforming your data, how you can use them, and some third-party applications that utilize these features in a standardized way.
What are Snowflake’s Best Features for Data Transformation?
Views
Views are the simplest way to transform data in Snowflake. They let you create virtual tables from the results of an SQL query. But Snowflake doesn’t stop there—materialized views take it further by storing the query results, making data processing faster. The drawback is that Snowflake will trigger the materialized view to refresh if the underlying data changes.
Here is an example of a view with a simple calculated column:
CREATE VIEW summary_view AS
SELECT category, SUM(sales_amount) AS total_sales
FROM sales_data
GROUP BY category;
A view can be as complex as you need it to be, as it is simply a select statement that is easily run over and over. It is important to note that each time the view is referenced, the query is run using compute resources. If the data in a view is referenced frequently but doesn’t change often, using a feature to store your transformed data into a table may be more advantageous.
However, views have some limitations. Most importantly, if the underlying tables to a view change, the view isn’t updated. This can cause views to fail and potentially require a lot of work to fix if they’re complex. Views can also not be altered, so if changes are to be made, the view must be completely recreated.
Stored Procedures
In any data warehousing solution, stored procedures encapsulate SQL logic into repeatable routines, but Snowflake has some tricks up its sleeve. In Snowflake, stored procedures can be created in normal SQL and in Javascript, Python, Java, and Scala (the latter three need to be made using the Snowpark API). With all these languages at your disposal, you can use stored procedures in Snowflake to transform your data in any way you need.
Here’s a very basic example of what transformations you can do with a JavaScript procedure:
CREATE OR REPLACE PROCEDURE transform_data()
RETURNS VARCHAR
LANGUAGE JAVASCRIPT
AS
$$
var result = ``;
try {
// Create the target table
snowflake.execute(`
CREATE OR REPLACE TABLE transformed_data (
customer_id VARCHAR,
order_date DATE,
order_year INT,
order_month INT,
order_amount DECIMAL(10,2),
order_total DECIMAL(10,2)
)
`);
// Insert transformed data
snowflake.execute(`
INSERT INTO transformed_data
SELECT
customer_id,
order_date,
year(order_date) as order_year,
month(order_date) as order_month,
order_amount,
order_amount * 1.05 as order_total
FROM raw_data
`);
result = "Data transformed successfully";
} catch (err) {
result = "Error: " + err;
}
return result;
$$;
Tasks
Paired with stored procedures, Snowflake Tasks can be a major advantage for performing data transformations. Tasks can be used to execute SQL statements, including stored procedure calls, at regular intervals or even based on specific triggers designed by the user. With tasks and procedures, you can potentially do all of your data pipeline transformations within Snowflake without any other transformation tool.
Performing transformations within Snowflake can offer several distinct advantages, including:
Reduced Data Movement: The data never has to leave Snowflake, so there will be no latency or security concerns.
Data Transformation Flexibility: Engineers can easily create any necessary transformation using the complex SQL queries and features that can be created in Snowflake.
Cost Optimization: Costs will be eliminated for any tooling for data transformations.
Snowpark
As mentioned earlier, Snowpark is a feature in Snowflake that enables you to write custom data processing and transformation logic within the Snowflake AI Data Cloud. Developers can write procedures using Scala, Java, or Python, utilizing all the libraries and frameworks available in each language.
This enables much more complex and advanced transformations compared to traditional SQL-based approaches. Snowpark uses native APIs to interact with the data, seamlessly integrating Snowflake’s storage and computing capabilities. Additionally, Snowpark is designed to efficiently process large data sets using distributed computing over multiple compute resources within Snowflake.
Just a few of the possibilities to use Snowpark for data transformations include:
Utilizing Pandas DataFrames: Snowpark allows seamless conversion between its own DataFrames and Pandas DataFrames. This lets you leverage Pandas’s extensive data cleaning and manipulation capabilities within your Snowpark environment.
User-Defined Functions (UDFs): You can create custom Python functions to perform specific transformations on your data. These UDFs can be temporary for one-time use or permanent for reuse across various transformations.
Semi-Structured Data Handling: Snowpark can use JSON and other semi-structured data formats. Functions like
FLATTEN
help you efficiently transform this data into a tabular format suitable for further analysis.
Working with Snowpark data frames in Python will be very familiar to those with experience using Pandas. Here is an example of joining two data frames and performing some transformations in Snowpark:
COPY INTO Transformations
Snowflake also allows you to perform limited transformations on data before it’s even loaded into Snowflake using the COPY INTO
command. You can use these for both bulk loading and Snowpipe. The data transformations that can done while copying data into a table include the following:
Column reordering
Column omission
Casting
Trimming spaces
Enforcing length
Truncating columns
NULLIF()
Data Transformation Tools
If, after seeing all the data transformation features Snowflake has, you think you need more, you might want to look into using a data transformation tool. One of Snowflake’s best features is its entire ecosystem of partners with various types of tools, including for data transformations. Here’s a couple of our favorites:
dbt
dbt is one of the most popular transformation tools on the market. It provides a way to build modular SQL that can be reused throughout your data pipeline, allowing you to adhere to the Don’t Repeat Yourself (DRY) principles while providing version control and automation.
Coalesce
Coalesce is a code-first UI-drive transformation tool used exclusively for Snowflake. What’s great about Coalesce in the context of this article is that it can use literally any feature Snowflake offers in a reusable, standardized way.
Coalesce’s top features include column-level lineage and auto-generated documentation. If you’re looking for an easy-to-use transformation tool that allows you to code it the way you want, Coalesce is for you.
Closing
With Snowflake’s suite of powerful transformation features and third-party application integrations, you can ensure your data is always prepped and ready for your analytic use cases. Snowflake is always developing new ways to transform your data faster and easier.
If you need help transforming your data, contact the Snowflake experts at phData today!