October 4, 2022

phData Toolkit September 2022 Update

By Nick Goble

Welcome back to another installment of your favorite monthly blog. The primary objective of this blog is to keep you up-to-date with the latest features and functionality that have been added to the phData Toolkit.

For those who are new to the phData Toolkit, this is a set of tools that are free to use and aim to accelerate and differentiate your platform migration, management, and auditing capabilities. 

As we continue to work with customers to modernize their data platform, increase their data maturity, and scale their data practice, we’ve identified areas where additional tooling is necessary. 

This includes functionality such as pairing infrastructure-as-code capabilities with ITSM tools, translating SQL dialects, visual auditing and best practice recommendations for your Snowflake information architecture, and our latest tool: Data Source.

Our primary focus over the last month has been on building out this Data Source tool, so let’s jump in and learn all about it!

What is the phData Data Source Tool?

Brand new to the phData Toolkit, our Data Source tool utilizes simple but powerful commands. You can automate your data platform processes at scale with ease. Some of the use-cases the data source tool is built to solve include:

  • Platform migration validation
  • Platform migration automation
  • Metadata collection and visualization
  • Tracking platform changes over time
  • Data profiling and quality at scale
  • Data pipeline generation and automation
  • dbt project generation
A diagram showing how the phData Data Source Tool works

The Data Source tool has the ability to connect to a number of platforms leveraging JDBC connections and SchemaCrawler to scan and profile information about:

  • Entire databases 
  • Individual tables
  • Each column within a table
  • Column data types and other metadata (like a primary key)
  • Column metrics like count, null count, min/max, and any other aggregation

Once these data points have been generated, we output profile information in JSON, YAML, or CSV file formats. From these profiles, you’re able to generate comparisons against both systems or take the output and utilize it for additional scripts. 

For example, you could profile your source and target systems to generate dbt scripts to replicate existing transformations. This is an experimental command that we’re still working on, which you can read more about here.

If you want to visualize the results of the differences between your source and target systems, the output of the tool can be viewed on your local machine as an HTML file. This allows you to get a user-friendly visualization that looks like the following:

A screenshot from the Data Source tool in the phData Toolkit that results of the shows the differences between your source and target systems.

How Has the Data Source Tool Been Used Thus Far?

Cool, so we have a new tool. What does that mean for our customers?  

One of the most frequent tasks in platform migrations is ensuring that data has been accurately and fully replicated between your source and target system. This includes row and column level information such as “Do I have all the records I should have?” and “Do I have all the columns from my source?”. 

Each migration has multiple stages that require validation:

  • Build out of tables in new system
  • Enable loading of new data into new system
  • Copy historical data into new system
  • Connect users/consumers to the new system

These stages can vary in length from days to weeks to months.  As teams continue to work on “keeping the lights on” within the old system, new columns may get added to support business use cases and these additional columns may be missed by the migration team. 

Developers may also miss adding a column which results in additional work/complexity to migrate those fields into the new system.  

This is where the Data Source tool comes in!

The Data Source tool can proactively profile both systems and provide metadata about the state of your migration. This allows you to answer questions like:

  • Am I missing data for the previous day? If so, how much?
  • How has my data changed over time?
  • Which columns have been migrated from my source to my target? Which columns do I still need to migrate?
  • How do I visualize the state of my migration when I have thousands of tables to migrate?

We actively have customers leveraging this tool to validate their migration efforts. Check it out today!

Interested In Learning More?

We hope this introduction to the Data Source tool (within the phData Toolkit) will help you understand the value that this kind of tooling brings to your migration and validation efforts. 

This free-to-use tool, in conjunction with other phData tools, significantly accelerates your platform migration by enabling you to quickly identify gaps, consistently measure your migration status, and eliminate what would have been manual validation time/energy on driving business value.

phData Toolkit

If you haven’t already explored the phData Toolkit, we highly recommend checking it out!

We encourage you to spend a few minutes browsing the apps and tools available in the phData Toolkit today to set yourself up for success in 2022. 

Be sure to follow this series for more updates to the phData Toolkit tools and features. 

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit