July 12, 2023

How to use Snowflake’s Features to Build a Scalable Data Vault Solution

By Justin Delisi

Businesses today are grappling with vast amounts of data coming from diverse sources. To effectively manage and harness this data, many organizations are turning to a data vault—a flexible and scalable data modeling approach that supports agile data integration and analytics. 

Among the platforms that can support a data vault solution, the Snowflake Data Cloud stands out as a front-runner. 

From its cloud-native architecture to its elastic scalability and parallel loading features, Snowflake offers a range of features that support the development and maintenance of a highly adaptable and scalable data vault.

At phData, we’ve had the privilege of helping many clients successfully implement data vaults using Snowflake, witnessing some truly impressive results in the process.

In this blog, we’ll take a deep dive into why Snowflake helps make our customers successful while exploring the core principles of the data vault methodology and discussing how Snowflake’s unique capabilities align with and enhance those principles. 

Understanding Data Vault Architecture

Data vault architecture is a data modeling and data integration approach that aims to provide a scalable and flexible foundation for building data warehouses and analytical systems. It is designed to address challenges such as changing business requirements, data integration complexity, and data lineage tracking.

Data vault does require some learning curve, as it’s different from more traditional data modeling architectures.

At phData, we’ve implemented many data vaults for our customers and have leveraged our knowledge to consult clients on optimal implementations of a data vault. Before these optimizations can occur, we’ve found it’s imperative to understand the basics of the methodology.

Data vaults are created with three core structures:

  • Hubs
    • A table that represents a core business entity within a company.
    • Contain a defined business key and a few mandated data vault fields.
    • Business keys are required to be unique per row.
    • Do not contain any context or details about the business entity.
  • Links
    • A table that defines a relationship between two or more Hubs.
    • Contain no context or details about the business entity.
    • One row represents the relationship between two entities with their unique business key.
  • Satellites
    • Tables that contain all contextual information about entities.
    • Rows contain a unique key, row entry date, and business key to link to the Hub for the entity.

Optionally, a fourth structure can be added called a Point in Time (PIT) table. These tables include:

  • Store snapshots of the core business data at different points in time.
  • Designed to simplify querying of business objects.

These three structures and then composed into three distinct layers:

  • Raw Vault
    • The raw vault is the foundation of the architecture and is designed to store the raw data from source systems without any business rules or transformations applied to it.
  • Business Vault
    • The business vault extends the raw vault by applying hard business rules, such as data privacy regulations or data access policies, or functions that most of the business users will find useful, as opposed to doing these repeatedly into multiple marts.
    • This vault is an entirely new set of tables built off of the raw vault, akin to a separate layer in a data warehouse with “cleaned” data.
  • Information Mart
    • The information mart is the final stage, where the data is optimized for analysis and reporting.
    • Soft business rules, such as naming conventions and data quality standards, may be applied at this stage to refine the data further and ensure that it meets the specific needs of the business users; typically, it comprises dimensions and facts.

Using data vault can address many challenges, but it also has its drawbacks. It is an insert-only methodology, which creates tables with huge amounts of data that grow bigger as time progresses. This creates a need for a data warehousing solution that can scale up with this data in both storage and compute power.

Data vault modeling has many advantages but the insert-only methodology and the increasing complexity of the model as more fields and concepts are added can traditionally create scaling and query performance challenges for legacy technologies.

Fortunately, as we’ll see, Snowflake has the features and capabilities to allow for successful data vault implementations.

Building a Scalable Data Vault Solution with Snowflake

When designing a scalable data vault solution in any data warehousing system, there are several key considerations to keep in mind.

These design considerations help ensure that the data vault can accommodate growing data volumes, changing business needs, and evolving technological requirements.

Luckily, there is Snowflake. 

Snowflake is a cloud-based data platform designed to address the challenges of modern data management. Its architecture and key features are tailored to deliver a highly scalable, flexible, and performant solution for data storage, processing, and analytics.

Some of the features relative to data vault include:

Scalable Architecture

Snowflake’s architecture allows users to scale compute resources up or down independently from storage, providing seamless elasticity to handle fluctuating workloads and data volumes.

A data vault needs to be essentially infinitely scalable, as new data is inserted only. This is very difficult with a traditional on-premises solution, as more storage must be manually added as they get full.

With Snowflake’s pay-as-you-go elastic storage capability, this is not a concern at all.

Parallel Data Loading

Snowflake’s multi-table insert is a powerful feature that allows you to insert data into multiple tables using a single SQL statement. It simplifies and optimizes the process of loading data into multiple destination tables, reducing the need for numerous individual insert statements or complex data manipulation operations. 

By leveraging Snowflake’s multi-table insert (MTI) feature, you can streamline the process of loading data into the various tables of a data vault.

This increases the speed of inserts by performing them in parallel and ensures data quality as similar data is inserted into all tables simultaneously.

The syntax for using MTI is also simple.

For example, if we have two tables to insert into (a hub and a satellite), we can select the necessary columns from our source and insert them directly into both hub and satellite in parallel.

Metadata Management

Snowflake’s built-in metadata management capabilities provide significant benefits when implementing a data vault architecture. Snowflake’s query optimizer leverages the metadata about table statistics to optimize query execution plans. This can significantly improve query performance on queries involving multiple joins. 

Snowflake also has a result cache feature, which holds the results of every query for the last 24 hours and is available across all virtual warehouses. If the same query is run more than once (and the results haven’t changed), Snowflake will simply return the same results, bypassing query processing altogether. This can substantially reduce query time because Snowflake bypasses query execution and, instead, retrieves the result directly from the cache.

Going one step further, materialized views can be created for queries that contain common, repeated queries, significantly improving performance for long-running queries in your data vault. 

Creating Cluster Keys

Snowflake’s clustering feature works by grouping data in a table based on the values in one or more columns. When data is clustered, rows with similar values in the specified columns are stored together on disk, creating data blocks that contain related data. This clustering arrangement benefits query performance because it increases the likelihood that the data accessed by a query is physically located in the same or nearby data blocks.

Creating clustering keys on hubs, satellites, and links does not significantly improve performance because of the unique hash key and load timestamp design.

However, in Snowflake, using clustering keys on PIT (Point-In-Time) tables can significantly improve query performance and storage efficiency. PIT tables are used to store historical snapshots of data at different points in time and are similar to a fact table from a star schema.

For particularly large PIT tables, if a column is being used frequently in joins and filters of queries, creating a clustering key on it will improve query performance overall.

Pictured below is an example of a simple PIT table with a cluster key.

Note: If clustering by more than one column, the column with the lower cardinality should be first, as this is optimal in Snowflake).

Best Practices and Tips for Success

Once you have your data vault solution up and running in Snowflake, there are a few tips to keep it running smoothly and efficiently. 

Monitoring and Performance Tuning in Snowflake

By leveraging Snowflake’s built-in performance monitoring capabilities, you can gain insights into query performance, identify optimization opportunities, and ensure efficient resource utilization within your data vault architecture.

Snowflake automatically maintains views for query history, warehouse utilization, and warehouse concurrency levels that users can monitor. Snowflake also provides extensive query profiling and query execution plans to monitor individual query performance.

For a basic example, take a look at the query profile provided by Snowflake for the following query:

Two unique images, once has a few lines of code, the other is of a table hierarchy in Snowflake.

We can see that the self-join is creating a Cartesian product (or “exploding join”), causing 1,635 results to be returned from a table with only 100 total rows.

Using this query profile feature to optimize queries can not only save time, but greatly reduce compute costs. 

Data Security

Snowflake offers built-in data masking techniques to protect sensitive data while maintaining data quality. You can apply masking rules to hide sensitive information, such as personally identifiable information (PII), without compromising the integrity or validity of the data.

Snowflake’s Secure Data Sharing feature enables controlled data sharing with external parties. With data sharing, you can ensure the security of your data by sharing curated datasets and controlling access to the data through defined privileges and access controls.

For example, a masking policy could be created on the description column of our product satellite table, only allowing users with the ANALYST role to see the actual data.

A few lines of code that showcase how Snowflake's Data Masking feature works.

Closing

Building a scalable data vault solution is a critical step for organizations seeking to leverage their data assets effectively. Snowflake’s advanced features and cloud-native architecture provide a solid foundation for developing a robust and adaptable data vault. 

By harnessing these capabilities, organizations can unlock the full potential of their data assets, enabling informed decision-making, driving innovation, and staying ahead in a competitive landscape.

Looking for help leveraging Snowflake’s features to build a scalable data vault solution at your organization? phData can help! 

One of the best features of data vault is that it has built-in data lineage and traceability features as it is an insert-only architecture, so a simple query filtering for a key ordered by the insert timestamp will give you data lineage within the vault. However, Snowflake also offers native data lineage support through its ACCESS_HISTORY view.

Snowflake has many more features that can be used to optimize your data vault, including using materialized views and performance monitoring tools to help users enhance the performance of their queries.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit