November 1, 2021

How to Implement Slowly Changing Dimensions in Snowflake

By Prashanth Kumar Gardhas

In this blog post, we’re taking a journey into the realm of slowly changing dimensions (SCDs), specifically within the Snowflake Data Cloud. We’ll explore the different types of SCDs (with example scenarios for each type) and discuss how to implement the most frequently used SCDs (Type 1 and Type 2) in Snowflake. 

What Are Slowly Changing Dimensions?

Slowly changing dimensions are simply a dimension that stores and manages data over time in a data warehouse. SCDs are considered and implemented as one of the most critical ETL/ELT tasks in tracking the history of dimension records. For example, if we have a fact and dimension tables, the fact table is linked to the dimension tables with the help of foreign keys. If any of the dimension table data changes, we have to keep track of the data changes for reporting purposes. 

What Are the Six Types of Slowly Changing Dimensions?

Type-0

Simple and no special action is required.

This type is pretty simple and doesn’t require any special action since type 0-dimensional table has the static data, the values remain the same forever.

Example – Date dimension table (static data) 

Date

day

week_in_a_month

week_in_a_year

2021-01-18

Monday

4th week

4th week

2021-02-18

Thursday

3rd week

8th week

Type-1

Insert and overwrite (upsert).

In this type, all the brand new records will get inserted and any change to the existing record will overwrite the old value with a new one. No history data is maintained in this type.

Before the change: 

Emp_id

Emp_name

state

1111

Tom

MI

1212

Adam

MN

After the change:

Emp_id

Emp_name

state

1111

Tom

AZ

Exisiting record, so overwritten the old value

1212

Adam

MN

1818

Frank

NY

New record, so inserted

Type-2

Insert a changed record. 

In this type, insert all brand new records and all changed records will terminate with the end_date and set the current_flag_status to false. Create a new record for changed data values with open end_date and set the current_flag_status to true.

Each and every record has the effective date, end date, and flag status fields to identify in which time period the record was active.

Before the change:

Emp_id

Emp_name

state

Effective_date

end_date

current_flag_status

1111

Tom

MI

2020-10-08

9999-12-31

True

1212

Adam

MN

2020-10-08

9999-12-31

True

After the change:

Emp_id

Emp_name

state

Effective_date

end_date

current_flag_status


1111

Tom

MI

2020-10-08

2021-10-09

False

Old record terminated

1111

Tom

AZ

2020-10-09

9999-12-31

True


1212

Adam

MN

2020-10-08

9999-12-31

True


Type-3

Store the new and old values in current and previous fields. 

In this method, we usually have current and previous value fields. The new value is ingested into the current field and the old value is ingested into Previous field.

Before the change:

Emp_id

Emp_name

current_state

Previous_state

1111

Tom

MI

MI

1212

Adam

MN

MN

After the change:

Emp_id

Emp_name

current_state

Previous_state

1111

Tom

AZ

MI

1212

Adam

MN

MN

Type-4

Maintain data in separate tables (current table, history table). 

In this method, we will maintain the data in two separate tables. All the current active data will be seen in the current table and all the history data will be seen in the history table with the effective and end date fields to determine in which time period the record was active. 

Current table:

Emp_id

Emp_name

state

1111

Tom

AZ

History table:

Emp_id

Emp_name

state

Effective_date

end_date

1111

Tom

MI

2020-10-08

2021-10-09

Type-6

Type 1 + Type 2 + Type 3 (Combined approach).

This method is a combination of Type 1 (insert and overwrite), Type 2 (insert a changed record), and Type 3 (store the new and old values in current and previous fields) approaches.

Before the change:

Emp_id

Emp_name

current_state

previous_state

effective_date

end_date

current_flag_status

1111

Tom

MI

MI

2020-10-08

9999-12-31

True

After the change:

Emp_id

Emp_name

current_state

previous_state

effective_date

end_date

current_flag_status

1111

Tom

AZ

MI

2020-10-08

2021-10-09

False

1111

Tom

AZ

AZ

2021-10-09

9999-12-31

True

In the above example,

  1. We have overwritten the current_state from MI to AZ, this satisfies the Type 1 approach incorporated.
  2. Created a new record for changed data with an open end date, showing the Type 2 approach incorporated. 
  3. Maintained current_state and previous_state fields to track the changes, demonstrating the Type 3 approach incorporated. 

Thus, the above example clearly shows Type 6 is a combination of Type 1, Type 2, and Type 3 approaches. 

How to Implement Type 1 and Type 2 SCDs in Snowflake

For the remainder of this post, we’ll uncover how to implement (end-to-end) Type 1 and Type 2 pipelines using tasks and streams to avoid manual running of queries.

The first step is to choose the pipeline depending on the project requirement. In this example, we have a source file in S3 that we will be using as a source table to load the file. The source table is always truncated and reloaded with the latest file data. The stage SCD Type 1 table is where Type 1 logic is maintained and staged and the SCD Type 2 table is where Type 2 logic is maintained. 

Create the DDLs next, followed by creating tasks. 

DDLs

				
					create or replace stage emp_data_stage 
  file_format = 'CSV' 
  storage_integration = PHDATA_S3_READ 
  url = 's3://phdata-snowflake-stage/data/user_pgardhas/test_files/'; 
 
create or replace table emp_source 
( 
  emp_id int, 
  emp_name string, 
  emp_state string 
); 
 
create or replace table emp_stage_scd1 
( 
  emp_id int, 
  emp_name string, 
  emp_state string 
); 
 
create or replace table emp_stage_scd2 
( 
  emp_id int, 
  emp_name string, 
  emp_state string, 
  start_time TIMESTAMP_NTZ, 
  end_time TIMESTAMP_NTZ, 
  current_flag_status string 
); 
				
			

Next, you’ll want to create the tasks below to perform the SCD Type 1 ETL process: 

  1. Create a task to truncate the source table before every load
  2. Create a task to load the file to source table
  3. Create a task to perform a merge operation on the final table (insert the brand new records and update the existing records)
  4. Create a task to remove file from S3 once ETL process is completed

SCD Type-1 pipeline (The pipeline is scheduled to run every day)

				
					// Step1   ===> root task 
create or replace task emp_src_truncate 
warehouse=compute_wh 
schedule='1440 minute'  	// one day 
as 
truncate table emp_source; 
 
// Step2 ===> child_task1 
create or replace task emp_file_to_src_table 
warehouse=compute_wh 
after emp_src_truncate 
as 
copy into emp_source  
from @emp_data_stage 
file_format=(type=CSV, skip_header=1); 
 
// Step3 ===> child_task2 
create or replace task emp_scd1 
warehouse=compute_wh 
after emp_file_to_src_table 
as 
merge into emp_stage_scd1 emp 
using (select * from emp_source) src 
on emp.emp_id = src.emp_id 
when matched and (emp.emp_name!=src.emp_name or emp.emp_state!=src.emp_state) 
then  
update set emp.emp_name=src.emp_name, 
emp.emp_state=src.emp_state 
when not matched  
then 
insert(emp_id,emp_name,emp_state) values(src.emp_id,src.emp_name,src.emp_state); 
 
//Step4 ===> child_task4 
Create or replace task src_file_remove 
warehouse=compute_wh 
after emp_scd1 
as 
remove '@ emp_data_stage/*'
				
			

Once all the tasks are created, resume each task from child task to parent task (bottom to upper) manner using the “alter task resume” command.

Now check to make sure all the tasks are in the started state with the help of “show tasks” command. Run the below command to view the status of the tasks whether the task is success/fail.

				
					select * from table(information_schema.task_history());
				
			

Next, we’ll want to create the tasks below to perform the SCD Type 2 ETL process: 

  1. Create a stream on the landing table to capture the changed data 
  2. Create a task to truncate the source table before every load 
  3. Create a task to load the file to the source table 
  4. Create a task to perform merge operation on the landing table (insert the brand new records and update the existing records) ==> SCD Type 1 design 
  5. Create a task to perform the merge operation on the final table (compare final table with stream table) 
  6. Create a task to remove the file from S3 once the ETL process is completed

SCD Type 2 Pipeline

				
					//Step1 ===> create stream 
create or replace stream emp_check on table emp_stage_scd1; 
 
// Step2   ===> root task 
create or replace task emp_src_truncate 
warehouse=compute_wh 
schedule='1440 minute'  	// one day 
as 
truncate table emp_source; 
 
// Step3 ===> child_task1 
create or replace task emp_file_to_src_table 
warehouse=compute_wh 
after emp_src_truncate 
as 
copy into emp_source  
from @emp_data_stage 
file_format=(type=CSV, skip_header=1); 
 
// Step4 ===> child_task2 
create or replace task emp_scd1 
warehouse=compute_wh 
after emp_file_to_src_table 
as 
merge into emp_stage_scd1 emp 
using (select * from emp_source) src 
on emp.emp_id = src.emp_id 
when matched and (emp.emp_name!=src.emp_name or emp.emp_state!=src.emp_state) 
then  
update set emp.emp_name=src.emp_name, 
emp.emp_state=src.emp_state 
when not matched  
then 
insert(emp_id,emp_name,emp_state) values(src.emp_id,src.emp_name,src.emp_state); 
 
//Step5 ===> child_task3 
create or replace task emp_scd2 
warehouse=compute_wh 
after emp_scd1 
when  
system$stream_has_data('emp_check') 
as 
merge into emp_stage_scd2 emp  
using (select * from emp_check) chk  
on emp.emp_id = chk.emp_id  
and emp.emp_state = chk.emp_state 
when matched and (chk.metadata$action='DELETE')  
then  
update set end_time = to_timestamp(current_timestamp),  
current_flag_status = 'False'  
when not matched and (chk.metadata$action='INSERT')  
then   
insert(emp_id,emp_name,emp_state,start_time,end_time,current_flag_status)  
values (chk.emp_id,chk.emp_name,chk.emp_state,to_timestamp(current_timestamp),NULL,'True'); 
 
//Step6 ===> child_task4 
Create or replace task src_file_remove 
warehouse=compute_wh 
after emp_scd1 
as 
remove '@ emp_data_stage/*' 
				
			

Follow the same method from the SCD Type 1 pipeline once the tasks are created to complete.

Looking for more help with Snowflake?

As a Snowflake Elite Services Partner, phData is uniquely positioned to help organizations of all sizes get the most value out of their Snowflake account. From migration assistance to Snowflake administration, we’ve got you covered!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit