The world will have created and stored 200 Zettabytes of data by 2025, which is the equivalent of every person on the planet carrying around 400 iPhones. Half of that will be stored in cloud environments.
As more and more data is created by systems, devices, and transactions, the complex challenges surrounding that data haven’t magically disappeared. That’s why data governance is more critical than ever.
In this first article in our series on data governance within Snowflake, we focus on building foundations.
Enterprises historically have been focused on data performance, whether that be the frequency of updates or query execution time. However, getting data ASAP isn’t exactly a North Star metric when it comes to enterprise data governance. You likely have many other concerns related to your data’s:
While this might seem self-explanatory, if your data is performant but not available, usable, validated, or secure, then the data could be adding risk to your enterprise. This could be in the form of your ability to trust the quality and consistency of your data, or errors in your data being made public through misreporting revenue. Consequently, if you’re sacrificing security in order to provide the availability to your users or customers, you could be putting your enterprise at risk.
Within this article, we cover each of these elements of data governance in depth — along with the controls that the Snowflake Data Cloud provides.
What Is Data Governance?
In data engineering, the term data governance can take on a lot of meanings and is heavily overloaded.
Data governance describes how enterprises put in place a collection of processes and practices that effectively manage data, data assets, and metadata within the platform. This includes controls for data availability, usability, integrity, and security.
While data governance has a much broader scope in totality and complexity, this post will cover three of the many layers of data governance:
For this post, we are focusing on the yellow boxes in this process.
Data governance is iterative, not static.
You must regularly re-evaluate and update your controls as your enterprise changes. Oftentimes, data stewards review existing practices, meet with business stakeholders to determine the controls that are necessary, and audit the controls that are being enforced within the enterprise.
Why Does It Matter?
Data governance controls ensure that data is consistent and dependable within the data’s lifecycle. This includes everything from initial creation and ingestion from a source to complex use cases such as a machine learning model result.
By enforcing specific standards for data governance, you ensure that quality data is being used to drive key business decisions, inform customers, and empower users.
Without data governance processes and practices, you run the risk of data inconsistencies. For example, if you’re pulling addresses from multiple sources that don’t follow a particular standard, you will have to determine how to resolve these addresses. Are they duplicates? Are they accurate? Without effective data governance in place, these questions frequently require time and cost your enterprise money.
These same concerns apply to all kinds of data, from time zones on dates stored in your database to missing data points between systems.
How Do I Build Data Controls in Snowflake?
While many data governance controls require manual auditing and validation, Snowflake provides functionality to reduce the number of controls required. In some instances, Snowflake can also automate data governance controls.
Let’s take a look at each of our previously mentioned data governance concerns and how they’re managed within Snowflake.
It’s critical that your data be available when your business needs it.
Snowflake is available on AWS, Azure, and Google Cloud Platform, and it inherits many of the high availability features that these cloud providers offer.
Snowflake builds on top of the availability of these cloud provider offerings with:
- Fault tolerance of data nodes
- Automatically spreading data across multiple availability zones
- Time travel for accessing deleted data and performing data backups
- Separation of compute and storage resources
- Data replication and fail-over (cross-region and cross-cloud)
These additional features are critical for maintaining data availability.
For example, an availability zone within AWS goes down temporarily. Since not only your data but also your compute resources have been distributed to multiple availability zones, Snowflake will automatically handle failure cases and reroute your request away from the unavailable zone.
Image by Snowflake
But software and hardware are never perfect. Inevitably, there’s going to be an outage that impacts your enterprise. In the event that Snowflake itself has a service outage, Snowflake will issue account credits (depending on your level of support). While this doesn’t change the impact on your enterprise, it does help offset any costs incurred as a result of the outage.
Another key aspect of availability is having access to data when you need it. In many traditional systems, your data is spread across multiple databases or multiple clusters. In order to aggregate that data together, you need to create a data pipeline to move data from one system to another.
This creates opportunities for data to become unavailable if that pipeline goes down. Snowflake on the other hand is built on a centralized data model, which removes the need for data pipelines to move data from one location to another, and further increases the availability of your data throughout your enterprise.
It’s important to define processes and practices that ensure your data is usable, documented, labeled, and can be easily found by consumers. It’s critical that data ingestion implements controls for consistent, usable data. Services provide data in many different ways, and it’s critical that your tooling be able to enforce the controls your enterprise has put in place.
What Snowflake Tooling Does
Snowflake has a few different processes for data ingestion (like Stages, SnowPipe, and Kafka). These processes can ensure that data conforms to particular standards. When defining these ingestion processes, you define a schema that the data is expected to adhere to. This schema definition enforces your data types and the precision of those data types.
What Snowflake’s Account Access Does
Snowflake’s centralized data model also promotes usability. While many enterprises may only have one Snowflake account, Snowflake allows for multiple accounts to segregate out business costs and organize data by business unit. With Snowflake’s data sharing functionality, a single data point can be referenced within multiple Snowflake accounts.
Snowflake also gives you the ability to share data outside of your Snowflake instance with reader accounts. This allows consumers to have read-only access to specific data without needing to unload or copy your data elsewhere.
The sharing approach accelerates the value that your data provides the enterprise and increases the usability of your data without increasing risk. This reduces the need to apply further data governance practices for your data sharing and unload processes, ultimately increasing the velocity of your data.
Once data has been ingested and stored, it’s critical that your data be resilient, accurate, complete, and consistent.
You should put controls in place to ensure that your enterprise can rely on your data. There are two main types of data integrity that an enterprise needs to be concerned with within Snowflake: physical and logical.
Let’s take a look at each of them.
It’s critical that data can be read and stored consistently without data loss.
In the event of the loss of an availability zone or data center, the data should retain its integrity both in a replicated environment and when that availability zone or data center recovers. As previously mentioned, Snowflake automatically replicates your data across multiple availability zones to prevent data loss and maintain physical integrity.
Logical integrity in data governance refers to ensuring that your data is logically accurate and complete with regard to your enterprise and individual domains.
This requires building and evolving controls within your data ingestion and storage to ensure that you have up-to-date data and that the data continues to conform to the expected types, values, and constraints.
In order to validate data completeness or accuracy, you will need to create controls and processes separate from Snowflake. These controls will need to compare your ingested data against the source system that it was ingested from, and will require resources outside of Snowflake to perform this comparison. However, since Snowflake separates out your read and write compute resources, reading data out for validation doesn’t impact data ingestion performance.
Security should always be front of mind in any digital system. Users should be assigned the bare minimum set of permissions that allows them to perform their job or task. Your data should also be encrypted in-transit and at-rest to ensure that intercepted traffic or data leaks do not provide attackers with your data.
Data security takes shape within Snowflake in a few different ways.
When a query is submitted to Snowflake, it will validate that the current session’s role is allowed to perform the action in the issued query. Snowflake provides many layers of access control, and each of these has a particular use case. Let’s take a look at each of them.
Role-Based Access Control (RBAC)
Snowflake allows for you to create a role hierarchy, where roles can inherit from each other. Each role contains a set of permission grants that allow an assigned user to perform actions against your Snowflake resources. These actions might include reading from a database, creating integrations, or assigning permissions to other roles. As a basic example, your role hierarchy might look like this:
Discretionary Access Control (DAC)
Within Snowflake, each resource has a defined owner. The owner of a resource is a role defined at the time of creation. The ownership of a resource can also be updated by either the owning role, a parent of that role, or the account administrator. It is generally recommended that all custom roles (that don’t involve user management) ultimately roll up to the SYSADMIN role within Snowflake, as this role is responsible for owning all custom resources.
RBAC and DAC within Snowflake are combined into one security model, allowing for flexibility depending on your use case. Every resource has an owner (a role), and roles can be assigned to users which allows them to perform actions against resources.
Snowflake supports end-to-end encryption for all ingestion methods and encrypts all data it ingests by default.
For data outside of Snowflake that you wish to ingest, it is recommended that you encrypt your data client-side and then start the ingestion process into Snowflake. If this data isn’t encrypted prior to ingestion, Snowflake will encrypt it before storing it. This is the best practice as it ensures that your data is secure — it’s only readable by those who have the correct permissions granted to them via roles.
In the case of client-side encryption, Snowflake requires the following:
- The Snowflake customer creates a secret master key, which remains with the customer.
- The client generates a random encryption key and encrypts the file before it is ingested into Snowflake. The random encryption key is then encrypted with the customer’s master key.
- The encrypted file and the encrypted random key are uploaded to the cloud storage service. The encrypted random key is stored with the file’s metadata.
Authorization and Authentication
In order to control access to your data, it’s recommended to integrate Snowflake with your enterprise’s authorization and authentication strategy.
We also recommend that you use federated authentication and the System for Cross-domain Identity Management (SCIM) specification in order to sync your users and their roles between your identity provider and Snowflake. This allows you to automate the provisioning and de-provisioning of users and creates a single source of truth for managing user access.
For example, if you are using Okta as an identity provider, you would assign groups to a user. When setting up your SCIM integration with Snowflake and Okta, you would map what groups in Okta map to what groups in Snowflake. This way, you can manage access inside of your existing identity provider, thereby simplifying the data access controls needed for your data governance strategy.
Once this setup is complete, it’s important to assign the correct grants or permissions to your roles within Snowflake. You will want to restrict the permissions in these roles to be the minimum permissions necessary to perform the role’s designated tasks.
Row Access Policies
These policies allow you to control which rows in your table are returned based on the user’s current role. This, as part of your data governance strategy, allows you to apply governance policies. A single policy can be reused across multiple tables or views.
A good example of this would be a table that contains all the employees and their salaries. You wouldn’t want everybody to be able to see all the records in that table, but you do want individuals to be able to see their own salary and managers to be able to see their direct report’s salaries. You would apply a row level filter that verifies the user has the manager role in order to be able to get additional results back.
Columnar Access Policies
Also known as dynamic data masking, this is for when you want to conditionally include data in the result of your query. If a user doesn’t have access to that column, the result will have a mask applied to hide sensitive information. This can be created as a single policy and applied to multiple columns across your Snowflake schema.
Reusing our row access policy example of salaries, perhaps you want people to be able to see the full list of people within the company, but that table has sensitive information such as salary, address, phone number, etc. Instead of filtering at the row level, we could simply mask all the personally identifiable information (PII) and salary data using dynamic masking. If the current role doesn’t have access to see that information, you can replace the data value with a mask like “********” or whatever fits your business preferences and requirements.
Auditing Access History
When applying various methods for protecting sensitive data, it’s also important to have an audit process defined to ensure that your data is being accessed following the governance practices you’ve defined. This is where Snowflake’s access history functionality is critical. This records the read operations for each query and identifies which columns the query accessed directly or indirectly. These records should be used to facilitate regulatory compliance and data governance compliance by data stewards.
Access history also gives you visibility into what data is being accessed, how frequently, and allows you to validate the usage of your data as well. It’s common for data stewards to have to identify who is using what data and how frequently to drive decisions like archiving or deleting datasets that aren’t used. This can also be used to drive decisions around high volume datasets which may require more attention or sensitivity.
Putting It All Together
Your data should always have controls in place to make sure that it’s providing value to your business, only accessible by the allowed entities, secure from data leaks and hackers, and highly available.
If sensitive data gets into the wrong hands or is inaccessible, there’s an immediate impact upon your business.
Snowflake has many built-in controls to alleviate the custom processes and engineering costs typically associated with data warehouses. When implementing a data warehouse or pipeline, be sure to evaluate the necessary controls for data governance, and regularly re-evaluate and update these controls.
Next up in the Series
Common Data Governance Questions
Snowflake provides levels of integration for user management and user access.
Snowflake also gives you the ability to replicate your data automatically between availability zones. If you want an extra layer of availability and integrity with your data, you have the ability to create a data share across multiple Snowflake accounts on different cloud providers. Snowflake also keeps your data secure automatically by encrypting all data at rest and providing mechanisms to encrypt data in transit.
If you’re using a separate identity provider, you will need to set up federated authentication to allow your users to authenticate with Snowflake. In order to automate provision or de-provision your users, or automate the roles that a user is granted in Snowflake, you will need to use SCIM and map your user’s groups to Snowflake roles. This gives you a single source of user management and allows you to automate syncing user groups to Snowflake permissions.