June 26, 2023

phData Toolkit June 2023 Update

By Nick Goble

Welcome to the latest installment of the phData Toolkit blog series! We’re already halfway through the year (crazy, right?!) in this June episode of the blog. Summer is here, the kids are out of school, and people are enjoying the sunshine and warm weather. You might even be golfing like our Boston based employee Sunny Yan!

Between enjoying the weather and all the wonders that summer brings, our product development team has been hard at work bringing new features and functionality to our customers. The Toolkit aims to increase speed-to-market as well as stability for our customers.  Whether you’re performing a platform migration or augmenting your data stack with new technology, the Toolkit has you covered!

Over the course of the last month, we’ve made significant upgrades to our Toolkit CLI. The CLI not only allows you to leverage a significant amount of functionality offered through our UI-based tools programmatically but provides a number of additional features.

Let’s dive in!

What Is the Toolkit CLI

Some of you may be wondering, “Why would I leverage a CLI over a UI?” This is a great question and has a very nuanced answer. While many of our customers leverage our UI for tools like our SQL Translation or Privilege Audit tooling, there are limitations when it comes to using a UI. It’s very difficult (and not recommended) to leverage a UI within any sort of programmatic context.

Imagine that you needed to perform translations against thousands of files. You wouldn’t want to pay someone (or perform yourself) to manually copy/paste each file into a browser window and copy/paste the translated SQL back. A CLI such as the Toolkit CLI allows you to automate this translation against all the files you wish to translate.  A UI is great in comparison if you only want to translate a couple of SQL statements and don’t want the hassle of getting a CLI up and running (or don’t understand how to).

In other instances, a CLI makes sense if you need to manipulate a lot of data or if you have sensitivity from a data or security perspective. You can install and run a CLI within your own environment and control the data and network security components.

The Toolkit CLI specifically has a number of tools within it, including:

  • Advisor

    • Quickly and easily identify opportunities to improve the performance, security, and cost-efficiency of your Snowflake environment.

  • Provision

    • High-performance template-driven tool to provide customers the flexibility to define and apply their own information architecture in a standardized way

  • Data Source

    • Ability to automate data platform processes at scale with ease by targeting data sources with a variety of commands and templates

  • SQL Translation

    • Easily converts SQL from different dialects into Snowflake SQL

Now, let’s get into the updates!

Toolkit CLI Updates

While we regularly make updates to the specific tools within the Toolkit CLI, we also frequently make updates to the overall CLI itself. While each tool in the Toolkit is a separate command within the CLI, there are a set of commands that are overarching across the entire set of tools.

At phData, we believe in and practice following an agile development pattern. This means that we iterate on our software and incrementally release changes/updates to the platform. However, we’re still human, and sometimes we haven’t thought about every edge case, or we might even have a “bug” in our code.

When errors do happen, we want customers (or employees leveraging the toolkit at a customer) to have the ability to provide enough information back to the development team so we can triage and resolve the issue. To make this easier, we have added a diagnose command to the toolkit! This will automatically collect diagnostic information and package it into a zip file that can easily be provided back to the engineering team.

Another quality of life improvement that we’ve made to the Toolkit CLI is adding a shell command.  This will configure the toolkit CLI executable to be within your PATH environment variable and can also enable auto-completion. If you’ve ever had to configure your PATH variable or dealt with your operating system not being able to find a particular command, you’ll understand why this is so handy!

Data Source Tool Updates

The data source tool has a number of use cases, as it has the ability to profile your data sources and take the resulting JSON to perform whatever action you want to take. This includes things like:

  • Platform migration validation

  • Platform migration automation

  • Metadata collection and visualization

  • Tracking platform changes over time

  • Data profiling and quality at scale

  • Data pipeline generation and automation

  • dbt project generation

What makes the data source tool really interesting is that the metadata generated from profiling and scanning data sources can be used in a wide variety of ways. You can perform actions like diffing multiple sources to better understand table and column statistics for things like data migrations. You can also leverage the profiled tables against templates to programmatically generate code or scripts to perform updates as part of a migration. The use cases really are endless!

As we’ve iterated on the software, we’ve made the following fixes and updates:

  • Fixed the codegen –templates argument when the value is an individual file.

  • Fixed a profile condition validation regression caused by column filters.

  • Snowflake and Oracle scanners now use null as the default for length, precision, and scale instead of 0 to provide consistent diffing when no values are set.

SQL Translation Updates

SQL Translation is another major component of the Toolkit CLI.  As customers are performing platform migrations, they frequently need to translate existing stored procedures, data definition language (DDL), and data markup language (DML) to the target system. If you’re performing a lift and shift migration, this is even more relevant to you. If you were to try and translate thousands of SQL statements manually, it would be tedious, expensive, and error-prone. By programmatically performing the translation, you can focus your efforts on defining information architecture, implementing more data governance, and deriving business value faster.

As with any conversion tool, you have a source and a target. While we don’t support every single source/target pairing, we do focus on adding more coverage to combinations as we see customers using them. Lately, that has been Microsoft SQL Server (MSSQL) and Snowflake.

We’ve added support for the following in our MSSQL to Snowflake translation:

  • Translate Remove keyword as Identifier.

  • Translate CATALOG_COLLATION in CREATE DATABASE

  • Add BOM-aware file reading so that files with a BOM are read with the encoding specified.

  • Add encoding heuristics to detect when a file is incorrectly read as UTF-8 or UTF-16BE.

We’ve also introduced bug fixes for our MSSQL to Snowflake translation:

  • Add comments to the SET statement

  • Parse issue with COLUMN_INDEX in CREATE TABLE statement.

  • Parse issue with XMLNAMESPACES.

  • Parse issue with RAISEERROR statement in PROCEDURE.

  • Parse issue with standalone RAISEERROR statement.

  • Translate a number of keywords as identifiers.

  • Fix Boolean value expression and function.

Bringing It All Together

While we continue to iterate on the Toolkit, we want to hear any and all feedback that you can provide. We will be continuing to invest heavily towards bringing additional functionality to our users and customers through not only new functionality but any additional quality-of-life improvements we can introduce to the platform. We hope that by staying up-to-date on this Toolkit blog series, you have learned more about the Toolkit, the reasoning for the Toolkit, and what’s new over the course of the last month!

phData Toolkit

If you haven’t already explored the phData Toolkit, we highly recommend checking it out!

We encourage you to spend a few minutes browsing the apps and tools available in the phData Toolkit today to set yourself up for success in 2023. 

Be sure to follow: this series for more updates on the phData Toolkit tools and features.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit