The Snowflake Data Cloud allows data teams to bring value to businesses rapidly. However, if the team isn’t careful, they can create a spider web of interconnected processes that are a nightmare to maintain. Therefore, it is essential for development teams to lay a foundation following best practices and putting standards in place.
At its most fundamental building block, having standardization in Snowflake is simply having a plan. This plan can include many pieces, including a common way to name objects, release new code to production, transform data, and others.
In this blog, we’ll explore the various approaches to help your business standardize its Snowflake environment.
Why is Standardization Important?
Before going into how to combat the lack of standardization in Snowflake, it is crucial to understand why standardization is important.
One of the first issues that a team likely faces without having standards in place is a lack of trust from the business. It only takes a couple of instances where a bug is deployed into production to erode trust and create problems for the team. Having standards in place reduces the risk of bad code, bad data, and bad processes.
Easier Developer Onboarding
When your code and processes are standardized, new developers and engineers can onboard much more quickly. Conversely, without consistent naming conventions for objects in Snowflake, it becomes challenging for newcomers to understand the logic and locate specific items without relying on the tribal knowledge of more experienced team members.
Implementing standards helps to create a clean and organized environment, making it easier for new team members to understand and work with the code and processes.
Easier Business Onboarding
Adhering to naming conventions and patterns can significantly improve the experience of business users when working with the data provided. For example, when fields are clearly defined and easy to locate, users are more likely to trust and use the data.
On the other hand, If the data is hard to find and poorly defined, the business users will likely return to their beloved spreadsheet on the shared drive.
When patterns and standards have been established, developers can move much more quickly. Without a pattern to follow, every project becomes brand new, causing the team to solve the same problems repeatedly. This also leads to multiple ways to tackle the same problem.
By standardizing, the team can follow the same repeatable patterns and save time and effort. It also is a heck of a lot easier to maintain.
Good testing is an essential part of ensuring the integrity and reliability of data. Without testing, it is difficult to know whether the data is accurate, complete, and free of errors. To have a sound testing strategy, you need a plan. And the plan, to be scalable, must be consistent and standardized across all objects.
Below, we will walk through some baseline tests every team could and should run to ensure data quality.
Approaches to Standardize your Snowflake Environments
Create Naming Standards
The key to having a clean, understandable Snowflake environment begins with good naming standards. The best place to start is to create a table outlining various Snowflake objects and a naming standard to be used by your team when creating the objects.
You will also want to think through the flow and movement of the data in your environments. For example, at phData, we recommend breaking your data into three main areas: Raw, Staging, and Transformed (these layers may be called different things, but ultimately exist in some fashion).
You might want three different databases or schemas depending on your use case. Either way, you should have a common way to name objects so one can instantly know which part of the data layer the data resides in.
At phData, we have built a solution called Tram that helps manage account objects in Snowflake in a reusable/repeatable way.
Things to Take into Consideration
- Snowflake objects can be as long as 255 characters. This means there is really no limit to how specific you make your names. Some teams prefer abbreviating when possible; however, abbreviating leads to more difficult onboarding for those who are not as familiar with the data. Being more specific leads to no questions about what the name actually says.
- Snowflake objects are always seen as uppercase (unless quoted), so it is recommended to use an underscore as a separator instead of camel or pascal case. For example, fnMyCoolFunction will be shown in the Snowflake UI as FNMYCOOLFUNCTION, which is difficult to read.
- While Snowflake allows using spaces in column names by using double quotes, this can lead to unintended consequences and great annoyance when querying data later. To keep consistent, we recommend always separating words by an underscore.
Here is an example of a very simple naming convention table as a starter.
Use Pull Requests
The number one way to ensure that good code is being added to any environment is through Pull Requests (PR) or code reviews. This is because other team members can look at the code and make comments, thereby catching code that could cause issues downstream.
PRs assume that your code is source controlled in a version control system like git. PRs by default promote asynchronous review which means developers can spend their time reviewing the code rather than being walked through it.
Major Benefits of PRs
- Reduces the risk of introducing bugs into your code
- Allows the team to move faster
- Allows for automated testing (CI) built into the PR
- Improves code quality
- Historical context of why code is a certain way
Use a Linter
Linting has long been part of software engineering but is more difficult with SQL code and, thus, is not implemented by many data teams. Thankfully there are open-source projects that don’t make you parse SQL into grammars yourself (ain’t nobody got time for that!), such as SQLFluff.
SQLFluff will not only automatically format code, but it can also check for many anti-patterns and flag them before your code is merged into main. Many rules are out of the box, but you can create custom rules if you desire. You could even write rules to enforce the naming conventions defined above! The possibilities are really endless.
SQLFluff can be implemented locally via their CLI tool or in a CI environment. This means that you can ensure your code is formatted and passes all basic rules before any team members review the code in the Pull Request.
SQL Linting saves tons of time and ensures your team is looking for deeper logical issues in the PR instead of basic naming and formatting mistakes.
Use Metadata Driven Pipelines When Possible
Standardizing ETL/ELT processes can be difficult. With the many low-code tools available, it is very easy to clone one pipeline to create a second, and after some time, you have 50 pipelines that are nearly identical but have repeated logic and/or code everywhere.
This is what we should avoid as software engineers specializing in data. We should always try to adhere to the DRY (don’t repeat yourself) principle when writing pipelines.
Metadata-driven pipelines can help cut down on repeated code. By parameterizing one main pipeline, you can often use a configuration file to manage the tables flowing through the pipeline.
Interested in exploring the most popular native methods for data ingestion in Snowflake? Don’t miss our comprehensive blog!
Top Reasons to Use Metadata-Driven Pipelines
- Reusability: Metadata-driven pipelines allow for the reuse of code and processes, saving time and resources. For example, if you have a process that is used frequently, you can define it once and reuse it throughout the pipeline.
- Maintainability: Metadata-driven pipelines are easier to maintain because changes to the pipeline can be made by modifying the metadata rather than the code. This makes it easier to update and maintain the pipeline over time.
- Collaboration: Metadata-driven pipelines are helpful because they allow different team members to work on different parts of the pipeline without having to understand the entire codebase.
Create Standard Process Patterns
As a team defines the various stages of data as it flows through their Snowflake environment, it is essential to document a few standard patterns. Very likely there will not be one simple pattern, so planning ahead for multiple patterns is appropriate.
What is a Pattern?
A pattern in this instance, can be understood as a logical combination of objects and data movement that fit together. For example, consider a very small data team of two or three using Snowflake. They may not have the largest datasets and can get away with move from Raw data directly to their Data Marts. They may do that by using a CREATE OR REPLACE TABLE AS… directly from the raw source data.
As a team, it wouldn’t make sense to do that half the time while another developer likes using MERGE statements to load a target table. Both might be valid approaches, but the team should agree on a single approach for most use cases.
We’ve found that including the entire data team is very helpful and important when making these decisions on what standard patterns and templates to follow.
Using a transformation tool like dbt would drastically cut out some of the decisions in this section as dbt maintains a lot of the underlying materialization of the data.
Write Re-usable Unit Tests
Unit testing is very common and expected on high-functioning software development teams; however, unit testing is very lacking in the data landscape. This is because not only do you need to test your schema/structures, but you have to test the actual data.
Many up-and-coming tools like dbt have automated testing built in. And there are even more Data Observability products/companies building testing directly into their platforms (Great Expectations, Soda.io, Datafold, Monte Carlo, and many more).
But if you don’t have any of those tools, there are some very simple tests you can create just within Snowflake.
1. Duplicate Key Tests
- Keys in Snowflake are not enforced (for good reasons), which can be a shock to someone used to a legacy relational database. Having a test that finds these duplicates is important.
- An easy way to accomplish this is to use a metadata approach.
- Create a table with two columns (Table, Primary Keys).
- Write a procedure that loops through each row in the table and runs a similar query select <primary key column>, count() from my_table group by <primary key column> having count() > 1.
- You can schedule the procedure via a task and fail the task when there are errors (assuming you have monitoring on failed tasks).
2. Data Growth Checks
- Taking the idea of reusable tests to the next level is tracking data growth and alerting based on anomalies.
- To accomplish this, create a stored procedure capturing a snapshot of the row counts for all tables in your database.
- Then, simply compare yesterday’s count to today’s count for a given table.
- Pick a threshold like >25% diff or use a z-score for simple anomaly detection. A fail-safe test could be if a table has 0 rows. Likely something very bad happened.
3. Begin Writing Data-Specific Tests
- These tests can be added to a specific column on a specific table.
- By using the framework described above, these tests will run every time a change is made to the table. Therefore, you will know if a change to the table or new data caused your test to fail.
4. Source to Target Tests
- One of the most common exercises a data team will complete is loading data from a source database into a target database.
- Having a framework to test the record counts, data types, etc. is very important to ensure issues are not introduced into the downstream environment.
- There are many options such as writing your own framework, a full-fledged SaaS product, or phData’s Data Source tool.
Using some of these tests can go a long way in standardizing data and processes, which inevitably leads to better-quality data and a more trusted environment.
Automate, Automate, Automate
Many of the approaches covered earlier that combat the lack of standardization use automation. This is because, as humans, we tend to deviate from things that could be repeatable. For example, let’s say you had to carry out a set of instructions that had 50 steps, and you had to do it ten times in a row. How many steps would you miss or repeat? Probably more than a computer if you gave it the same instructions!
Automation and Standards go hand-in-hand in that when you automate more, your environment naturally becomes more standardized. If you automate low-level tasks like naming conventions or metadata-driven processes, your team can worry more about the quality of the data and logic needed to transform data into business value.
Standardization in Snowflake is essential for building trust with the business, facilitating developer and business onboarding, speeding up development, and ensuring data quality. Without standards, teams risk creating a confusing and unreliable environment that is difficult to maintain and use.
To standardize a Snowflake environment, teams can implement a range of approaches, including creating naming standards, establishing a common way to name objects, setting up a release process, and implementing testing strategies.
By following these best practices, data teams can ensure that their Snowflake environment is organized, efficient, and effective at delivering value to the business.
phData’s team of experts can help you get started on building a plan to combat a lack of standardization. As the 2022 Snowflake Partner of the Year, phData thrives at helping businesses of all sizes succeed with Snowflake. Explore our Snowflake consulting services today!