December 29, 2022

phData Toolkit December 2022 Update

By Nick Goble

Happy holidays and welcome to our December update of the phData Toolkit. We hope you’ve had a fantastic holiday season, filled up on delicious food, and are as excited as us to kick off the 2023 calendar year.

The phData Toolkit continues to have additions made to it as we work with customers to accelerate their migrations, build a data governance practice, and ensure quality data products are built.  

One of the guiding principles of phData is automation and that’s exactly what many of the offerings in the toolkit aim to do. This automation includes things like SQL translation during a data platform migration (SQLMorph), making changes to your Snowflake information architecture (Tram), and checking for parity and data quality between platforms (Data Source Automation).

Over the course of the last month, we’ve made updates to both the phData Toolkit UI and the Toolkit CLI. Let’s dive in and take a deeper look at these.

SQL Translation Updates

Over the course of the last few months, we’ve had a focus on increasing the coverage of our SQL translations from Microsoft SQL (MSSQL) to the Snowflake Data Cloud

We’ve seen a number of customers performing migrations from MSSQL platforms to Snowflake and thus need support for automating this translation as much as possible.  

One of the major additions that we’ve made to the tool is adding support for the CREATE FUNCTION statement. Functions are commonly used to encapsulate a piece of logic that can be reused for every given input and are heavily used in both MSSQL and Snowflake.  

An example of this could be how net sales are calculated.  

Instead of copying this logic into each stored procedure or query, that logic is encapsulated in a function and can be reused across data markup language (DML) queries. This for example would look like the following:

MSSQL:

				
					CREATE FUNCTION sales.udfNetSale(
  @quantity INT,
  @list_price DEC(10, 2),
  @discount DEC(4, 2)
)
RETURNS DEC(10, 2) AS
BEGIN RETURN @quantity * @list_price * (1 - @discount);
END;
				
			

Snowflake:

				
					CREATE OR REPLACE PROCEDURE sales.udfNetSale(
  quantity INT,
  list_price DECIMAL(10, 2),
  discount DECIMAL(4, 2)
)
RETURNS DECIMAL(10, 2)
LANGUAGE SQL as $$
BEGIN RETURN :quantity * :list_price * (1 - :discount);
END;
$$;
				
			

While this is a more simple translation (besides the conversion from SQL to Javascript within Snowflake’s UDF) due to having feature parity between the two platforms, translations are frequently much more challenging.  

For example, MSSQL allows for branching logic through the use of GOTO statements and Snowflake does not. GOTO statements were frequently used in languages like FORTRAN and BASIC and have carried over into other platforms, but they are challenging to debug and support.

While not supported, we’ve added the ability to translate these statements.  For example:

MSSQL:

				
					create procedure proc1 as
begin
DECLARE @Counter int;
SET
  @Counter = 1;
WHILE @Counter < 10
BEGIN
SELECT
  @Counter
SET
  @Counter = @Counter + 1
  IF @Counter = 4 GOTO Branch_One
END Branch_One:
SELECT
  'Jumping To Branch One.'
End;
				
			

Snowflake:

				
					CREATE OR REPLACE PROCEDURE proc1 ()
RETURNS STRING
LANGUAGE SQL as $$
BEGIN
LET Counter INT;
Counter := 1;
WHILE (:Counter < 10) DO
SELECT
  :Counter;
Counter := :Counter + 1;
IF (:Counter = 4) THEN -- ERROR: line 9, character 20: Snowflake does not support `GOTO` statement.
/* GOTO Branch_One; */
END
IF;
END
WHILE;
-- ERROR: line 12, character 0: Snowflake does not support `GOTO label` statement. Translated the label definition statements.
/* Branch_One:; */
SELECT
  'Jumping To Branch One.'
end;
END;
$$;
;
				
			

We’ve also introduced a number of fixes to our MSSQL to Snowflake translation:

  • Translate IF statement with OBJECT_ID function for DROP statement
  • Translate SQL text within EXEC statement
  • Add a warning message for CHECK OPTION clause in CREATE VIEW statement
  • Add a warning message for WITH SCHEMABINDING clause in CREATE VIEW statement.
  • Add a few keywords as identifiers in Snowflake
  • Enclose snowflake reserved keywords with double quotes
  • Translate the suser_sname() function to current_user() in Snowflake
  • Add an INFO message for the REFRESH statement
  • Translate the LOG function
  • Translate functions that returns a table
  • Parse Select statement with more than one PIVOT clause
  • Parse table_hint clause
  • Translate the @@ROWCOUNT method
  • Parse the RETURN statement
  • Translate the DELETE statement with FROM clause

Project Administration Updates

Tram, our project administration tool, aims to automate the governance and change management of an information architecture within Snowflake.  But what does this actually mean?  

Traditionally, database administrators (DBAs) would run scripts that were manually generated through each environment to make changes to the database. This includes things like creating and modifying databases, schemas, and permissions.  

Tram instead leverages a git repository and takes an infrastructure-as-code (IaC) approach to make changes similar to products like Terraform. This allows customers to follow a gitops approach to deploying changes where changes to environments are controlled via branches, changes can go through an approval/review process, and modifications to the environment are made programmatically via continuous integration and continuous deployment (CI/CD) processes.

Over the course of the last month, we’ve introduced some minor fixes and improvements:

  • Added a fix to not perform tag operations on Standard edition of Snowflake
  • Added a fix to avoid running SHOW TAGS command when there are no tags defined in TRAM
  • Added a fix for not generating CREATE TAG statements when tag_name is not specified
  • Fixed an issue related to Schema objects not being cached correctly
  • Added a fix to remove the warning message – Requested to apply TAG for object, when there are no tags defined
  • Fixed an issue where disable user statements are not generated when –max-drops flag is enabled

These fixes largely are due to Snowflake introducing native support for TAGS.  Tram previously managed a separate TAG table manually for customers to be able to apply TAGs to different objects in their information architecture. 

This is important to call out as phData’s Toolkit frequently introduces functionality that is needed for customers but a current gap in the platforms they use.

Toolkit CLI Updates

One of the largest pieces of work that we’ve introduced over the last month is version 0.5.0 of the Toolkit CLI. The Toolkit CLI is a collection of tools that are built to be used either manually or within your CI/CD environment.

These tools include things like profiling data sources, validating data migrations, generating data pipelines and dbt sources, and bulk translating SQL. This introduces a more programmatic approach to the tooling rather than the Toolkit CLI.

Some of the major improvements that have been made are within the data profiling and validation components of the Toolkit CLI.  We’ve introduced significant performance improvements when trying to run a data source match against large source and target databases.

This command aims to line up the tables in both systems programmatically leveraging things like database name, schema name, and table name.  We’ve also improved this match logic by picking a best match when things don’t exactly line up.

For example, given a Hive table called db1.table1 and a Snowflake table called db1.schema1.table1, match on the Snowflake database and table (ignoring the schema).

In the profile command, we’ve also added support to specify conditions. These conditions are leveraged to limit queries against the source or target system.  An example of this would be limiting a profile to only contain objects that have been created in the last 30 days.

Example:

				
					ds {
  datasources {
    ds1 {
      profile {
        conditions = [
            {
                name: "-30 days"
                condition: "WHERE ds <= DATE_SUB(CURRENT_DATE(), 30)"
                filters: {
                   database {
                        includePattern: ".*"
                   }
                }
            },
            {
                name: "-30 days (Snowflake)"
                condition: "AT(TIMESTAMP => DATE_ADD(DAY, CURRENT_DATE(), -30);" // Snowflake Time Travel
                filters: {
                   database {
                        includePattern: ".*"
                   }
                }
            }
       ]
      }
    }
  }
}
				
			

We’ve also added some quality of life improvements to some of the Toolkit CLI commands. Previously, if you needed to specify a template, you would have to pass the full path to the template.  

These commands will now make assumptions about where those should be located by default, although they can be overwritten. For example, Instead of passing –templates tools/ds/templates/scan/dbt-project/common, you can use the shorter version –templates dbt-project.

Bringing It Together

It is our hope that this update has been both educational and informative. As we roll into the new year, data is only going to be increasingly valuable. Platforms such as Snowflake are increasingly necessary to derive business value and serve data to end consumers in a highly available and high-performance environment. 

Customers will continue to need to evaluate their existing data platform and data strategy. The phData Toolkit is an incredibly valuable tool to augment and supplement these initiatives.

Happy holidays and happy new year everyone!

phData Toolkit

If you haven’t already explored the phData Toolkit, we highly recommend checking it out!

We encourage you to spend a few minutes browsing the apps and tools available in the phData Toolkit today to set yourself up for success in 2022. 

Be sure to follow this series for more updates to the phData Toolkit tools and features. 

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit