June 6, 2023

phData Toolkit May 2023 Update

By Nick Goble

Hello and welcome to the latest installment of our phData Toolkit blog series!  This series is meant to keep the data community up-to-speed on the latest and greatest project accelerators, data governance enablement, and automation that phData has to offer.  

This toolkit is leveraged by our staff and customers as we continue to work with customers across a variety of data practices, verticals, and projects.

Our engineering team has been hard at work iterating on the toolkit to deliver you a best-in-breed solution.  Just like “April showers bring May flowers” we too have been watching the fruits of our labor blossom into an exciting update for you.

Let’s dive in.

Toolkit CLI Updates

The Toolkit CLI is an extension of our Toolkit UI and aims to provide a programmatic interface for a variety of tasks such as SQL translation, profiling of source/target databases, and generating scripts using those profiles.

When developing automation and scriptings tools, one of the most time consuming tasks is debugging.  When using a 3rd party library, it can be very difficult to debug an error when you don’t have documentation, good error handling, good error messages, or open source code.  

One of the major improvements we’ve made over the last month is improving error messages within our provisioning tool.  We now provide enough context to failed SQL commands that the error can be traced back to the specific group and template that caused the error.  We also have improved the error messages when parsing stacks in the same vein.

Another issue that frequently occurs is accidental duplication.  Perhaps two different people put in a ServiceNow request without knowing the other person did.  The team responsible for servicing those tickets either has to have a process to identify that duplicate or maybe it gets noticed coincidentally.  

Frequently though, these duplicates are missed and both of those tickets will be resolved.  When using Infrastructure-as-Code (IaC) type tooling, such as the provision tool, this can lead to undefined or accidental behavior. 

We’ve now added a check in our provision tool to prevent duplicate resource definitions to make provisioning behavior more deterministic and avoid these accidents.

We’ve also greatly improved our documentation to aid in debugging and guiding users of the Toolkit CLI.  This includes specific Jinja syntax examples both from basic to complex examples, greatly improved type info with examples, and many others.

The other major improvement that we’ve made comes in the form of Type Templates.  When migrating from one platform to another, you run into situations where data types don’t perfectly match between both systems.  

For example, Oracle defines their VARCHARs in characters whereas AWS Redshift defines VARCHARs in bytes.  This can lead to strings overflowing when you attempt to load your data into Redshift.  Within the type info in the data source tool, we now provide additional control over how types are mapped using Jinja templates.

We’ve also made a number of smaller updates to the CLI including:

  • Executing ALTER WAREHOUSE statements to set resource monitors separately from CREATE statements
  • Fixed property checking when provisioning multiple issues in trigger-based provisioning
  • Prevent extraneous ALTER statements when schemas do not have managed access explicitly enabled or disabled
  • Made –approve implicit when running the tool in a non-interactive shell, like a build server.
  • Fixed comparison for some user properties between the stack and Snowflake
  • Fixed some minor issues when migrating from a Tram workspace to provision workspace
  • Added source and target row counts to the grid as when they are available in the profile diff

SQL Translation

SQLMorph, our SQL translation tool, is focused on accelerating projects and platform migrations by converting between different SQL dialects.  If you’re not familiar with what a dialect is, take for example American English and British English.  Is it a hood or a bonnet?  Is it a trunk or a boot?  Is it “favorite” or “favourite”?  

These minor differences in the same language cause headaches for migrations because every SQL statement needs to be combed through and adjusted to fit the correct dialect.  This is exactly what SQLMorph programmatically addresses!

Over the last month, we’ve made a handful of updates across many dialect pairings.  We’ll cover a few of these.

In the Hive Family (Hive, Impala, Databricks SparkSQL and SparkSQL) to Snowflake translation, we’ve added support for aliasing derived columns.  While Hive doesn’t have a requirement that computed columns have an alias, Snowflake does!

Hive Family:

Snowflake:

Sometimes translations can be a little more tricky to handle than simply adding in default column names.  For example, in Netezza there is an AGE function whereas in Snowflake there isn’t.  The conversion of this function is not as straightforward as you might expect.

Netezza:

Snowflake:

As you can probably imagine, if you had to do this across thousands of SQL files manually, you’d spend significantly more time translating your SQL than driving additional value on your new platform or focusing on critical components like data governance.  This is why we bring a conversion tool to our projects and customers.

Wrapping Up

We hope that by reading this update that you’ve gained additional insight into how we at phData approach projects and development.  We always aim to bring a best-in-breed solution and provide automation to accelerate time-to-value and speed-to-market for our customers.  

If you’re interested in learning more about the Toolkit or about phData, give us a shout!

phData Toolkit

If you haven’t already explored the phData Toolkit, we highly recommend checking it out!

We encourage you to spend a few minutes browsing the apps and tools available in the phData Toolkit today to set yourself up for success in 2023. 

Be sure to follow: this series for more updates on the phData Toolkit tools and features.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit