How to Implement Slowly Changing Dimensions in Snowflake

In this blog post, we’re taking a journey into the realm of slowly changing dimensions (SCDs), specifically within the Snowflake Data Cloud. We’ll explore the different types of SCDs (with example scenarios for each type) and discuss how to implement the most frequently used SCDs (Type 1 and Type 2) in Snowflake.

What Are Slowly Changing Dimensions?

Slowly changing dimensions are simply a dimension that stores and manages data over time in a data warehouse. SCDs are considered and implemented as one of the most critical ETL/ELT tasks in tracking the history of dimension records. For example, if we have a fact and dimension tables, the fact table is linked to the dimension tables with the help of foreign keys. If any of the dimension table data changes, we have to keep track of the data changes for reporting purposes.

What Are the Six Types of Slowly Changing Dimensions?

Type-0

Simple and no special action is required.

This type is pretty simple and doesn’t require any special action since type 0-dimensional table has the static data, the values remain the same forever.

Example – Date dimension table (static data)

Date	day	week_in_a_month	week_in_a_year
2021-01-18	Monday	4th week	4th week
2021-02-18	Thursday	3rd week	8th week

Type-1

Insert and overwrite (upsert).

In this type, all the brand new records will get inserted and any change to the existing record will overwrite the old value with a new one. No history data is maintained in this type.

Before the change:

Emp_id	Emp_name	state
1111	Tom	MI
1212	Adam	MN

After the change:

Emp_id	Emp_name	state
1111	Tom	AZ	Exisiting record, so overwritten the old value
1212	Adam	MN
1818	Frank	NY	New record, so inserted

Type-2

Insert a changed record.

In this type, insert all brand new records and all changed records will terminate with the end_date and set the current_flag_status to false. Create a new record for changed data values with open end_date and set the current_flag_status to true.

Each and every record has the effective date, end date, and flag status fields to identify in which time period the record was active.

Before the change:

Emp_id	Emp_name	state	Effective_date	end_date	current_flag_status
1111	Tom	MI	2020-10-08	9999-12-31	True
1212	Adam	MN	2020-10-08	9999-12-31	True

After the change:

Emp_id	Emp_name	state	Effective_date	end_date	current_flag_status
1111	Tom	MI	2020-10-08	2021-10-09	False	Old record terminated
1111	Tom	AZ	2020-10-09	9999-12-31	True
1212	Adam	MN	2020-10-08	9999-12-31	True

Type-3

Store the new and old values in current and previous fields.

In this method, we usually have current and previous value fields. The new value is ingested into the current field and the old value is ingested into Previous field.

Before the change:

Emp_id	Emp_name	current_state	Previous_state
1111	Tom	MI	MI
1212	Adam	MN	MN

After the change:

Emp_id	Emp_name	current_state	Previous_state
1111	Tom	AZ	MI
1212	Adam	MN	MN

Type-4

Maintain data in separate tables (current table, history table).

In this method, we will maintain the data in two separate tables. All the current active data will be seen in the current table and all the history data will be seen in the history table with the effective and end date fields to determine in which time period the record was active.

Current table:

Emp_id	Emp_name	state
1111	Tom	AZ

History table:

Emp_id	Emp_name	state	Effective_date	end_date
1111	Tom	MI	2020-10-08	2021-10-09

Type-6

Type 1 + Type 2 + Type 3 (Combined approach).

This method is a combination of Type 1 (insert and overwrite), Type 2 (insert a changed record), and Type 3 (store the new and old values in current and previous fields) approaches.

Before the change:

Emp_id	Emp_name	current_state	previous_state	effective_date	end_date	current_flag_status
1111	Tom	MI	MI	2020-10-08	9999-12-31	True

After the change:

Emp_id	Emp_name	current_state	previous_state	effective_date	end_date	current_flag_status
1111	Tom	AZ	MI	2020-10-08	2021-10-09	False
1111	Tom	AZ	AZ	2021-10-09	9999-12-31	True

In the above example,

We have overwritten the current_state from MI to AZ, this satisfies the Type 1 approach incorporated.
Created a new record for changed data with an open end date, showing the Type 2 approach incorporated.
Maintained current_state and previous_state fields to track the changes, demonstrating the Type 3 approach incorporated.

Thus, the above example clearly shows Type 6 is a combination of Type 1, Type 2, and Type 3 approaches.

How to Implement Type 1 and Type 2 SCDs in Snowflake

For the remainder of this post, we’ll uncover how to implement (end-to-end) Type 1 and Type 2 pipelines using tasks and streams to avoid manual running of queries.

The first step is to choose the pipeline depending on the project requirement. In this example, we have a source file in S3 that we will be using as a source table to load the file. The source table is always truncated and reloaded with the latest file data. The stage SCD Type 1 table is where Type 1 logic is maintained and staged and the SCD Type 2 table is where Type 2 logic is maintained.

Create the DDLs next, followed by creating tasks.

DDLs

				
					create or replace stage emp_data_stage 
  file_format = 'CSV' 
  storage_integration = PHDATA_S3_READ 
  url = 's3://phdata-snowflake-stage/data/user_pgardhas/test_files/'; 
 
create or replace table emp_source 
( 
  emp_id int, 
  emp_name string, 
  emp_state string 
); 
 
create or replace table emp_stage_scd1 
( 
  emp_id int, 
  emp_name string, 
  emp_state string 
); 
 
create or replace table emp_stage_scd2 
( 
  emp_id int, 
  emp_name string, 
  emp_state string, 
  start_time TIMESTAMP_NTZ, 
  end_time TIMESTAMP_NTZ, 
  current_flag_status string 
);

Next, you’ll want to create the tasks below to perform the SCD Type 1 ETL process:

Create a task to truncate the source table before every load
Create a task to load the file to source table
Create a task to perform a merge operation on the final table (insert the brand new records and update the existing records)
Create a task to remove file from S3 once ETL process is completed

SCD Type-1 pipeline (The pipeline is scheduled to run every day)

				
					// Step1   ===> root task 
create or replace task emp_src_truncate 
warehouse=compute_wh 
schedule='1440 minute'  	// one day 
as 
truncate table emp_source; 
 
// Step2 ===> child_task1 
create or replace task emp_file_to_src_table 
warehouse=compute_wh 
after emp_src_truncate 
as 
copy into emp_source  
from @emp_data_stage 
file_format=(type=CSV, skip_header=1); 
 
// Step3 ===> child_task2 
create or replace task emp_scd1 
warehouse=compute_wh 
after emp_file_to_src_table 
as 
merge into emp_stage_scd1 emp 
using (select * from emp_source) src 
on emp.emp_id = src.emp_id 
when matched and (emp.emp_name!=src.emp_name or emp.emp_state!=src.emp_state) 
then  
update set emp.emp_name=src.emp_name, 
emp.emp_state=src.emp_state 
when not matched  
then 
insert(emp_id,emp_name,emp_state) values(src.emp_id,src.emp_name,src.emp_state); 
 
//Step4 ===> child_task4 
Create or replace task src_file_remove 
warehouse=compute_wh 
after emp_scd1 
as 
remove '@ emp_data_stage/*'

Once all the tasks are created, resume each task from child task to parent task (bottom to upper) manner using the “alter task resume” command.

Now check to make sure all the tasks are in the started state with the help of “show tasks” command. Run the below command to view the status of the tasks whether the task is success/fail.

				
					select * from table(information_schema.task_history());

Next, we’ll want to create the tasks below to perform the SCD Type 2 ETL process:

Create a stream on the landing table to capture the changed data
Create a task to truncate the source table before every load
Create a task to load the file to the source table
Create a task to perform merge operation on the landing table (insert the brand new records and update the existing records) ==> SCD Type 1 design
Create a task to perform the merge operation on the final table (compare final table with stream table)
Create a task to remove the file from S3 once the ETL process is completed

SCD Type 2 Pipeline

				
					//Step1 ===> create stream 
create or replace stream emp_check on table emp_stage_scd1; 
 
// Step2   ===> root task 
create or replace task emp_src_truncate 
warehouse=compute_wh 
schedule='1440 minute'  	// one day 
as 
truncate table emp_source; 
 
// Step3 ===> child_task1 
create or replace task emp_file_to_src_table 
warehouse=compute_wh 
after emp_src_truncate 
as 
copy into emp_source  
from @emp_data_stage 
file_format=(type=CSV, skip_header=1); 
 
// Step4 ===> child_task2 
create or replace task emp_scd1 
warehouse=compute_wh 
after emp_file_to_src_table 
as 
merge into emp_stage_scd1 emp 
using (select * from emp_source) src 
on emp.emp_id = src.emp_id 
when matched and (emp.emp_name!=src.emp_name or emp.emp_state!=src.emp_state) 
then  
update set emp.emp_name=src.emp_name, 
emp.emp_state=src.emp_state 
when not matched  
then 
insert(emp_id,emp_name,emp_state) values(src.emp_id,src.emp_name,src.emp_state); 
 
//Step5 ===> child_task3 
create or replace task emp_scd2 
warehouse=compute_wh 
after emp_scd1 
when  
system$stream_has_data('emp_check') 
as 
merge into emp_stage_scd2 emp  
using (select * from emp_check) chk  
on emp.emp_id = chk.emp_id  
and emp.emp_state = chk.emp_state 
when matched and (chk.metadata$action='DELETE')  
then  
update set end_time = to_timestamp(current_timestamp),  
current_flag_status = 'False'  
when not matched and (chk.metadata$action='INSERT')  
then   
insert(emp_id,emp_name,emp_state,start_time,end_time,current_flag_status)  
values (chk.emp_id,chk.emp_name,chk.emp_state,to_timestamp(current_timestamp),NULL,'True'); 
 
//Step6 ===> child_task4 
Create or replace task src_file_remove 
warehouse=compute_wh 
after emp_scd1 
as 
remove '@ emp_data_stage/*'

Follow the same method from the SCD Type 1 pipeline once the tasks are created to complete.

Looking for more help with Snowflake?

As a Snowflake Elite Services Partner, phData is uniquely positioned to help organizations of all sizes get the most value out of their Snowflake account. From migration assistance to Snowflake administration, we’ve got you covered!

How to Implement Slowly Changing Dimensions in Snowflake

What Are Slowly Changing Dimensions?

What Are the Six Types of Slowly Changing Dimensions?

Type-0

Type-1

Type-2

Type-3

Type-4

Type-6

How to Implement Type 1 and Type 2 SCDs in Snowflake

DDLs

SCD Type-1 pipeline (The pipeline is scheduled to run every day)

SCD Type 2 Pipeline

Looking for more help with Snowflake?

More to explore

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit