In our previous post in this series, we covered how the Snowflake Data Cloud allows for data availability, usability, integrity, and security. These controls and practices are necessary to establish trust and manage risk with your data.
However, data governance encompasses much more than those foundational elements. While these controls validate that data is stable and reliable, you need to have a defined information architecture, process for providing access, and a group responsible for managing these rules and patterns.
These are complex controls, and regularly evolve within an organization. In this follow-up article, we cover those three more complex elements of data governance within Snowflake:
- Information Architecture
- Provisioning and Rights Management
- Data Stewardship
Information Architecture with Snowflake
Let’s start with the elephant in the room: ‘information architecture’ is a heavily overloaded term, with a variety of meanings depending on the context and industry. Within the context of information architecture in Snowflake, we will be focusing on how data is:
Data availability and reliability for the end consumers of the data is the goal of each of these elements.
With a strong information architecture practice and design, you can put standards in place that ensure commonality and consistency between different areas of an organization. This is critical as it facilitates users not only finding data, but finding the correct data.
Let’s take a look at how this data governance aspect takes shape within Snowflake.
How Does Snowflake Impact Your Information Architecture?
As previously discussed, your data needs to be set up in a way that ensures your users can find data intuitively — meaning related data should be grouped together. Within Snowflake, there are a few different processes that enable predictable and repeatable data storage patterns.
Data Organization and Structure
Imagine that you’re in school. You likely have a few different classes with different topics, and you want to take notes on the material you’re learning. You want your notes to be grouped together by class and section, and you need these notes to be easily searchable and accessible for future studying.
This is the same with your data in Snowflake. Within your information architecture, you need to define how you’re going to group data together.
Snowflake gives us three tiers for organizing data: databases, schemas, and tables.
Each tier lives within the previous tier. When setting up these tiers within your organization, it could be as simple as defining all financial data to go in your financial database, setting up schemas for different areas within your financials, and finally creating tables to hold the specific data.
The way your data is grouped together and organized needs to be regularly evaluated by data stewards (more on this later), and updated accordingly. This ensures that you’re setting your users or consumers up for success and that the correct data is being accessed.
Data Labeling and Discovery
With large amounts of data, finding the right data or structure can be challenging. Imagine you’re trying to find a database table, but you’re unsure which database or schema this table is owned by. For example, you might have financial data in another system that you need to join to.
While Snowflake doesn’t natively provide a tagging system (coming soon), they do provide an Information Schema in each database. This schema holds information about tables, schemas, functions, stages, and much more metadata automatically. This allows users to quickly see all the resources that they have access to within a particular database
This means that it’s critical to name databases, schemas, and tables in a context that a consumer would understand.
Provisioning Rights and Management
Security should always be the primary concern in any system. We frequently hear about data breaches in the news where customer data wasn’t properly secured and accessed by unwanted entities.
In order to protect your users, their data, and your enterprise’s data, it’s critical to have standards in place that provide the minimum level of access for consumers of your data to perform their function. The orchestration of access should also be standardized and easily repeatable to reduce human error.
How Do I Manage Access to Snowflake?
There are three main ways to orchestrate user creation and rights management within Snowflake: System for Cross-domain Identity Management (SCIM), Tram from phData, and manual administration. Customers often choose to use SCIM for user provisioning and then Tram for provisioning of access rights.
While manual administration of your Snowflake instance is viable in non-production installations with fake data, it’s strongly recommended not to administer your Snowflake instance manually.
If using SCIM, you will have to integrate with an existing identity provider such as Okta or Azure Active Directory. This requires configuration on both the identity provider and your Snowflake instance. Once configured, this will sync user creation and permissions in your identity provider with Snowflake based on the configuration.
You will need to map groups or roles that exist in your identity provider to their equivalent in Snowflake.
Tram was built by phData to facilitate a pattern familiar to engineers. Resources are organized into templates, and these templates have members. For example, you might have a user_workspaces template that defines a warehouse, database, schema, table, and stage for users to be able query and store data. This template can uniformly be applied to all the members of the template, ensuring that the same access control pattern is used for each user.
Tram has two deployment modes. The first is connecting your ITSM tool such as ServiceNow, Remedy, JIRA, etc. to your Directory system such as Active Directory. In this mode, you can build a workflow with approvals that automatically provisions objects in the directory which Tram uses to automatically provision objects in Snowflake.
Tram also has a GitOps deployment mode, which gives you the ability to preview changes to your environment before they’re applied to your Snowflake instance. When a user or admin wishes to make a change to Snowflake, they update the templates directory, add any appropriate members to the template, and submit a pull request to the customer’s main repository. Administrators then have the ability to perform a “dry run” and see what SQL would be run against their Snowflake instance once merged. Tram also automates the application of the changes to your environment based on the new SQL statements.
Since the changes to your environment are either stored in your ITSM tool or version control, it’s easy to manage who has access to apply changes to your Snowflake instance, audit what changes were made when, and audit who applied the changes.
Data stewardship is one of the most critical roles within an organization.
This role is responsible for oversight of controls, quality assurance, and definitions around data management and assets. The primary concern of a data steward is that data is functional, available, accurate, and accessible within an organization.
In any data warehouse solution, the amount of data (and the complexity of its relationships) grows quickly. It’s likely that your organization is aggregating data from multiple vendors, departments, and internal systems. The data steward’s role and responsibility is to ensure that this data is formatted, stored, and processed in a way that allows the data to immediately provide value to the organization.
However, validating these controls against your data is very complex. How do you ensure that access is controlled against your data steward’s rules? How do you ensure that your schemas are defined and grouped appropriately?
We see our customers using two approaches.
In addition to access rights, Tram can enforce your information architecture as well. As discussed above, information architecture is critical to ensure your data is organized and to success on your data platform.
But how do you make sure the information architecture is followed?
Delegating the provisioning of databases, schemas, and warehouses to Tram makes this easier by using the customizable templates to provision users.
Instead of having users request databases, schemas, and warehouses as one-off requests, they will submit requests for new workspaces which include these objects. Once approved, the objects will be automatically provisioned by Tram using standard Snowflake SQL.
Within a data warehouse, administrators regularly find themselves asking questions like:
- Who has access to this database?
- What does this user have access to?
- What database tables are within this schema?
- What permissions are granted for this user on this resource?
phData has built a tool to answer this and much more. We quickly identified a need for a tool for data governance and built with data stewardship in mind.
This empowers data stewards and administrators to quickly identify new data controls that are required and validate existing ones. This greatly simplifies and reduces the work necessary for data stewards to be effective.
Putting It All Together
Data governance is a complex and growing concern in organizations today.
As we continue to create unimaginable amounts of data worldwide, it’s becoming increasingly difficult to categorize, visualize, and define controls and access to that data. By creating an information architecture around your data, and having data stewards responsible for maintaining and enforcing rules to govern your data, you ensure that data is working for you rather than against you.