In this blog post, we’re taking a journey into the realm of slowly changing dimensions (SCDs), specifically within the Snowflake Data Cloud. We’ll explore the different types of SCDs (with example scenarios for each type) and discuss how to implement the most frequently used SCDs (Type 1 and Type 2) in Snowflake.
What Are Slowly Changing Dimensions?
Slowly changing dimensions are simply a dimension that stores and manages data over time in a data warehouse. SCDs are considered and implemented as one of the most critical ETL/ELT tasks in tracking the history of dimension records. For example, if we have a fact and dimension tables, the fact table is linked to the dimension tables with the help of foreign keys. If any of the dimension table data changes, we have to keep track of the data changes for reporting purposes.
What Are the Six Types of Slowly Changing Dimensions?
Type-0
Simple and no special action is required.
This type is pretty simple and doesn’t require any special action since type 0-dimensional table has the static data, the values remain the same forever.
Example – Date dimension table (static data)
Date | day | week_in_a_month | week_in_a_year |
---|---|---|---|
2021-01-18 | Monday | 4th week | 4th week |
2021-02-18 | Thursday | 3rd week | 8th week |
Type-1
Insert and overwrite (upsert).
In this type, all the brand new records will get inserted and any change to the existing record will overwrite the old value with a new one. No history data is maintained in this type.
Before the change:
Emp_id | Emp_name | state |
---|---|---|
1111 | Tom | MI |
1212 | Adam | MN |
After the change:
Emp_id | Emp_name | state | |
---|---|---|---|
1111 | Tom | AZ | Exisiting record, so overwritten the old value |
1212 | Adam | MN | |
1818 | Frank | NY | New record, so inserted |
Type-2
Insert a changed record.
In this type, insert all brand new records and all changed records will terminate with the end_date and set the current_flag_status to false. Create a new record for changed data values with open end_date and set the current_flag_status to true.
Each and every record has the effective date, end date, and flag status fields to identify in which time period the record was active.
Before the change:
Emp_id | Emp_name | state | Effective_date | end_date | current_flag_status |
---|---|---|---|---|---|
1111 | Tom | MI | 2020-10-08 | 9999-12-31 | True |
1212 | Adam | MN | 2020-10-08 | 9999-12-31 | True |
After the change:
Emp_id | Emp_name | state | Effective_date | end_date | current_flag_status | |
---|---|---|---|---|---|---|
1111 | Tom | MI | 2020-10-08 | 2021-10-09 | False | Old record terminated |
1111 | Tom | AZ | 2020-10-09 | 9999-12-31 | True | |
1212 | Adam | MN | 2020-10-08 | 9999-12-31 | True |
Type-3
Store the new and old values in current and previous fields.
In this method, we usually have current and previous value fields. The new value is ingested into the current field and the old value is ingested into Previous field.
Before the change:
Emp_id | Emp_name | current_state | Previous_state |
---|---|---|---|
1111 | Tom | MI | MI |
1212 | Adam | MN | MN |
After the change:
Emp_id | Emp_name | current_state | Previous_state |
---|---|---|---|
1111 | Tom | AZ | MI |
1212 | Adam | MN | MN |
Type-4
Maintain data in separate tables (current table, history table).
In this method, we will maintain the data in two separate tables. All the current active data will be seen in the current table and all the history data will be seen in the history table with the effective and end date fields to determine in which time period the record was active.
Current table:Emp_id | Emp_name | state |
---|---|---|
1111 | Tom | AZ |
History table:
Emp_id | Emp_name | state | Effective_date | end_date |
---|---|---|---|---|
1111 | Tom | MI | 2020-10-08 | 2021-10-09 |
Type-6
Type 1 + Type 2 + Type 3 (Combined approach).
This method is a combination of Type 1 (insert and overwrite), Type 2 (insert a changed record), and Type 3 (store the new and old values in current and previous fields) approaches.
Before the change:
Emp_id | Emp_name | current_state | previous_state | effective_date | end_date | current_flag_status |
---|---|---|---|---|---|---|
1111 | Tom | MI | MI | 2020-10-08 | 9999-12-31 | True |
After the change:
Emp_id | Emp_name | current_state | previous_state | effective_date | end_date | current_flag_status |
---|---|---|---|---|---|---|
1111 | Tom | AZ | MI | 2020-10-08 | 2021-10-09 | False |
1111 | Tom | AZ | AZ | 2021-10-09 | 9999-12-31 | True |
In the above example,
- We have overwritten the current_state from MI to AZ, this satisfies the Type 1 approach incorporated.
- Created a new record for changed data with an open end date, showing the Type 2 approach incorporated.
- Maintained current_state and previous_state fields to track the changes, demonstrating the Type 3 approach incorporated.
Thus, the above example clearly shows Type 6 is a combination of Type 1, Type 2, and Type 3 approaches.
How to Implement Type 1 and Type 2 SCDs in Snowflake
For the remainder of this post, we’ll uncover how to implement (end-to-end) Type 1 and Type 2 pipelines using tasks and streams to avoid manual running of queries.
The first step is to choose the pipeline depending on the project requirement. In this example, we have a source file in S3 that we will be using as a source table to load the file. The source table is always truncated and reloaded with the latest file data. The stage SCD Type 1 table is where Type 1 logic is maintained and staged and the SCD Type 2 table is where Type 2 logic is maintained.
Create the DDLs next, followed by creating tasks.
DDLs
create or replace stage emp_data_stage
file_format = 'CSV'
storage_integration = PHDATA_S3_READ
url = 's3://phdata-snowflake-stage/data/user_pgardhas/test_files/';
create or replace table emp_source
(
emp_id int,
emp_name string,
emp_state string
);
create or replace table emp_stage_scd1
(
emp_id int,
emp_name string,
emp_state string
);
create or replace table emp_stage_scd2
(
emp_id int,
emp_name string,
emp_state string,
start_time TIMESTAMP_NTZ,
end_time TIMESTAMP_NTZ,
current_flag_status string
);
Next, you’ll want to create the tasks below to perform the SCD Type 1 ETL process:
- Create a task to truncate the source table before every load
- Create a task to load the file to source table
- Create a task to perform a merge operation on the final table (insert the brand new records and update the existing records)
- Create a task to remove file from S3 once ETL process is completed
SCD Type-1 pipeline (The pipeline is scheduled to run every day)
// Step1 ===> root task
create or replace task emp_src_truncate
warehouse=compute_wh
schedule='1440 minute' // one day
as
truncate table emp_source;
// Step2 ===> child_task1
create or replace task emp_file_to_src_table
warehouse=compute_wh
after emp_src_truncate
as
copy into emp_source
from @emp_data_stage
file_format=(type=CSV, skip_header=1);
// Step3 ===> child_task2
create or replace task emp_scd1
warehouse=compute_wh
after emp_file_to_src_table
as
merge into emp_stage_scd1 emp
using (select * from emp_source) src
on emp.emp_id = src.emp_id
when matched and (emp.emp_name!=src.emp_name or emp.emp_state!=src.emp_state)
then
update set emp.emp_name=src.emp_name,
emp.emp_state=src.emp_state
when not matched
then
insert(emp_id,emp_name,emp_state) values(src.emp_id,src.emp_name,src.emp_state);
//Step4 ===> child_task4
Create or replace task src_file_remove
warehouse=compute_wh
after emp_scd1
as
remove '@ emp_data_stage/*'
Once all the tasks are created, resume each task from child task to parent task (bottom to upper) manner using the “alter task resume” command.
Now check to make sure all the tasks are in the started state with the help of “show tasks” command. Run the below command to view the status of the tasks whether the task is success/fail.
select * from table(information_schema.task_history());
Next, we’ll want to create the tasks below to perform the SCD Type 2 ETL process:
- Create a stream on the landing table to capture the changed data
- Create a task to truncate the source table before every load
- Create a task to load the file to the source table
- Create a task to perform merge operation on the landing table (insert the brand new records and update the existing records) ==> SCD Type 1 design
- Create a task to perform the merge operation on the final table (compare final table with stream table)
- Create a task to remove the file from S3 once the ETL process is completed
SCD Type 2 Pipeline
//Step1 ===> create stream
create or replace stream emp_check on table emp_stage_scd1;
// Step2 ===> root task
create or replace task emp_src_truncate
warehouse=compute_wh
schedule='1440 minute' // one day
as
truncate table emp_source;
// Step3 ===> child_task1
create or replace task emp_file_to_src_table
warehouse=compute_wh
after emp_src_truncate
as
copy into emp_source
from @emp_data_stage
file_format=(type=CSV, skip_header=1);
// Step4 ===> child_task2
create or replace task emp_scd1
warehouse=compute_wh
after emp_file_to_src_table
as
merge into emp_stage_scd1 emp
using (select * from emp_source) src
on emp.emp_id = src.emp_id
when matched and (emp.emp_name!=src.emp_name or emp.emp_state!=src.emp_state)
then
update set emp.emp_name=src.emp_name,
emp.emp_state=src.emp_state
when not matched
then
insert(emp_id,emp_name,emp_state) values(src.emp_id,src.emp_name,src.emp_state);
//Step5 ===> child_task3
create or replace task emp_scd2
warehouse=compute_wh
after emp_scd1
when
system$stream_has_data('emp_check')
as
merge into emp_stage_scd2 emp
using (select * from emp_check) chk
on emp.emp_id = chk.emp_id
and emp.emp_state = chk.emp_state
when matched and (chk.metadata$action='DELETE')
then
update set end_time = to_timestamp(current_timestamp),
current_flag_status = 'False'
when not matched and (chk.metadata$action='INSERT')
then
insert(emp_id,emp_name,emp_state,start_time,end_time,current_flag_status)
values (chk.emp_id,chk.emp_name,chk.emp_state,to_timestamp(current_timestamp),NULL,'True');
//Step6 ===> child_task4
Create or replace task src_file_remove
warehouse=compute_wh
after emp_scd1
as
remove '@ emp_data_stage/*'
Follow the same method from the SCD Type 1 pipeline once the tasks are created to complete.
Looking for more help with Snowflake?
As a Snowflake Elite Services Partner, phData is uniquely positioned to help organizations of all sizes get the most value out of their Snowflake account. From migration assistance to Snowflake administration, we’ve got you covered!