February 2, 2024

Top 5 Use Cases of phData’s Data Source Tool

By Justin Delisi

Founded in 2014 by three leading cloud engineers, phData focuses on solving real-world data engineering, operations, and advanced analytics problems with the best cloud platforms and products. 

With these years of experience, we’re constantly looking for ways to leverage our knowledge to make ourselves more efficient. This search for efficiency led us to create the Data Source tool, which is part of the phData Toolkit. 

In this blog, we’ll explore the phData Toolkit, why the Data Source tool is an integral part of it, and the top five use cases for the Data Source tool.

What is the phData Toolkit?

While working on many data engineering projects, phData found patterns of issues that would come up regularly when migrating clients’ data. Instead of reinventing the wheel with each new project, we built simple tools to help speed up our engagements and provide even better service. 

This evolved into the phData Toolkit, a collection of high-quality data applications to help you migrate, validate, optimize, and secure your data. The Toolkit offers tools that range in capabilities, from translating SQL from one dialect to another to identifying opportunities to make your Snowflake Data Cloud account more efficient and secure; the Toolkit is an automation powerhouse!

What is the Data Source Tool?

The Data Source (DS) tool is one of the cornerstones of the Toolkit, enabling users to collect, compare, analyze, and act on data source metadata and profile metrics. It’s a lightweight Command Line Interface (CLI) that takes a low code approach to profiling data sources to allow anyone to access the power of the tool. 

The DS tool can collect metadata and data quality metrics and compare them between data sources for almost any platform, but the most commonly supported platforms include:

  • Snowflake

  • Hive/Impala

  • Oracle

  • SQL Server

Top Use Cases for the Data Source Tool

Data Migration Validation

When migrating from one platform to another, it is essential to diligently check if the data did not change during the migration. This can be a painstaking process, especially when the number of objects being moved is numerous. Not only should row counts of tables and views be checked for consistency, but so should other profile metrics, including null counts, max/min/average of each column, etc., and DDL metrics like data type and nullability. 

Simplifying this process is the reason the DS tool was created in the first place. 

Once connected to the source and target, the DS tool can collect metadata, run profiling metrics on both the source and target, and create an easy-to-read report with the differences it finds. It can be run for one table, many tables, or an entire database to give you a full report on how your migration is going.

Above is a sample of the HTML report the DS tool provides. This example shows that the client and loan tables were only found in the source and that distinct and trans tables have a row count difference. 

Columns can then drill this down to see the other various metrics the profiling created:

Data Quality Checks

Data quality can also be checked using the DS tool. Similar to how one can use it for data migration validation, you can run profiles on a single source to check the quality of your data. There are many metrics built in that can be useful for quality checks, such as NULL count, max and min, and min length. 

What makes the DS tool stand out is that it allows you to easily create your own metrics that ensure the data follows your business rules. 

For instance, here is a custom metric to check whether all columns with a numeric data type in a table are within the range of 1-100:

				
					{
    name = "out_of_range_count"
    description = "The count of out of range values"
    expression = "SUM(CASE WHEN {{ column_name }} NOT BETWEEN 0 AND 100 THEN 1 ELSE 0 END)"
    includeTypes = [numeric]
    enabled = true
}
				
			

Since a specific quality check such as this one may be only useful on a subset of tables, it can be quickly enabled or disabled by setting enabled to be true or false. 

Running the DS tool for data quality checks on a monthly, weekly, or even daily can help track data quality over time.

Metadata Collection

Metadata collection is the process of gathering and storing information about the structure, organization, and attributes of the data stored within a database. This includes data types, primary and foreign keys, and other constraints. 

This information helps facilitate efficient data management, querying, and maintenance operations. It provides a comprehensive understanding of the database schema, enabling administrators, developers, and users to navigate, manipulate, and optimize the database effectively.

Many organizations will rely on manually created spreadsheets or the database system itself to display metadata information about objects, but spreadsheets can be inaccurate, and the output from database systems can often be hard to read. 

The DS tool is perfectly suited to help in this regard. 

In an easy-to-read HTML output, metadata is displayed to the user, including data types, keys, and other constraints, so developers can easily read it. For instance, in this example, the DS tool is showing that the number column is an integer, does not have a NOT NULL constraint, and is not a primary key of this table:

Promote Database Changes

The proper process for making changes in a database is to develop it in a non-production development (DEV) environment, promote it to a User Acceptance Test (UAT) environment for some quality checks, and then promote it again to your production environment. 

However, while developing in DEV, engineers may not track the changes they made to DDL or may have to create 20 objects they are keeping track of. Whatever the case, keeping track of and applying those changes to the next environment is not a simple task. 

The DS tool solves that problem with the code-gen feature. Performing a DDL scan on both the lower and upper environment, the DS tool can see the differences between an object in one environment and another and even track missing objects from the environment up. 

Using these differences, the DS tool creates SQL statements to align the environments and can even execute the statements automatically.

For example, we can use this command to use the latest difference file between PROD and DEV to generate the SQL required to bring PROD up to speed with DEV:

				
					>> toolkit ds codegen dev.prod:diff:latest --templates tools/ds/templates/diff/drift/sqlite --output .
				
			

This then gives the output file of changes needed to be made to PROD:

				
					ALTER TABLE account ADD COLUMN frequency VARCHAR NOT NULL DEFAULT 0;

ALTER TABLE account DROP COLUMN number;

CREATE TABLE IF NOT EXISTS client (
    client_id INT NOT NULL,
    gender VARCHAR NOT NULL,
    birth_date DATE NOT NULL,
    district_id INT NOT NULL
);
CREATE TABLE IF NOT EXISTS loan (
    loan_id INT NOT NULL,
    account_id INT NOT NULL,
    date DATE NOT NULL,
    amount INT NOT NULL,
    duration INT NOT NULL,
    payments DECIMAL NOT NULL,
    status VARCHAR NOT NULL
);
				
			

This can simply be copied and pasted into our system or run automatically using this EXEC command:

				
					>> toolkit ds exec prod --file table-operations.sql
				
			

This powerful command makes promoting code to the next environment much simpler and more efficient.

Track Database Changes

Tracking changes to the DDL of your database can be difficult. With many data engineers or database administrators working on your system, some changes may not get committed to version control, or you may not be using version control at all. 

The DS tool has a fantastic feature called SQL-collect, which collects all the DDL statements for objects in your system. This would be ideal to run regularly to track what your database looked like historically. 

Say one of your pipelines broke, this will help diagnose problems as you could see if any changes were made to the tables involved in that pipeline recently. 

Other Tools in the Toolkit

The Data Source tool is just one of the many powerful tools we’ve created for the phData Toolkit. We recommend checking out them all to see how they can benefit your organization, but for a quick look at our most popular tools, here’s a quick overview:

SQL Translation

Automated translation is indispensable when migrating between data platforms. The SQL Translation application instantly converts queries from one SQL dialect to another. Essentially, it functions like Google Translate — but for SQL dialects. 

SQL Translation helps save time and minimize errors whenever you need to translate SQL from one dialect to another. It is also a helpful tool for learning a new SQL dialect. Users can write queries they are familiar with and use the translation result to learn the equivalent syntax.

Advisor Tool

With the Advisor tool, you can quickly and easily identify opportunities to improve the configuration, security, performance, and efficiency of your Snowflake environment. We have worked closely with our Snowflake data and operations experts to ensure our Advisor tool delivers actionable recommendations based on best practices and phData’s vast real-world experience on the Snowflake Platform.

Provision Tool

A major bottleneck to getting value from a Snowflake Data Cloud environment is creating the databases, schemas, roles, and access grants that make up a data system information architecture. Creating, applying, and evolving this information architecture can be a time-consuming and error-prone process. 

Our template-driven Provision tool automates onboarding users and projects to Snowflake, allowing your data teams to start producing real value immediately.

Closing

phData is committed to using our vast knowledge of data engineering to improve the Data Source tool and all other tools in the Toolkit to increase the efficiency of our engineers and our clients’ engineers. 

With these applications, anyone can drastically reduce time spent on the most time-consuming tasks, such as data quality checks and pipeline generation. 

Learn more about our growing list of capabilities and automation features by visiting the phData Toolkit homepage today!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit