The Snowflake Data Cloud has tools that allow you to define a hierarchical security strategy for warehouses, databases, tables, and other internal operations.
But when it comes to more granular levels of security, like row and column level requirements, you’ll run into some extra work in order to build out this security requirement in Snowflake.
Side note: this topic is becoming increasingly more important for Snowflake, as the platform announced granular security controls as a roadmap item back in their November 2020 virtual data event.
In the meantime, we can explore what has worked for us across our customer base.
Defining Security Access in Snowflake
Currently we have a high level definition of which users can access the system and what users can access specific data through Snowflake role hierarchies.
Using this data, you will be able to define some simple restrictions that can be used to define data access in targeted situations or in POCs.
Simple Column Level Access & Masking
Column level access and masking requires the system to know what level of access a user has and then ensure the correct level of visibility is applied when the user attempts to access data through a query. The following graphic shows how 2 users may see 2 different levels of detail when querying the same dataset.
The simplest form of looking at access would be assessing the user’s role and comparing it to a list of values in a CASE statement.
SELECT id, CASE WHEN current_role() IN ("ALL") THEN ssn WHEN current_role() IN ("OBFUSCATED") THEN substr(SSN, 8,11) WHEN current_role() IN ("NONE") THEN '***-**-****' ELSE '***-**-****' END AS ssn, region_id FROM customer_details;
This example is a simple version of column level security, but it clearly doesn’t scale since it requires the definition to be defined on every required column. It will also require the logic function to be run at query time, which can cause significant overhead if it is a query being run multiple times.
Simple Row Level Access
Row level security is the process of defining and restricting rows in tables based on a users required level of access. As an example of the end goal here, this picture demonstrates what we are trying to accomplish.
You can restrict rows fairly simply by defining a VIEW and then granting access via roles.
CREATE VIEW v_customer_details AS SELECT id, first_name, last_name, ssn, address, region_id FROM customer_details WHERE region_id = 1; GRANT SELECT ON VIEW `v_customer_details` TO ROLE `REGION_ID_1_RO`;
This allows any user in the Read Only role for region ID equal to 1 to have access to the underlying data.
The previous examples show that defining which roles have access to data is possible — but the methods above are not very scalable due to:
- the manual nature of the definitions, and
- the amount of objects that would be created as a result.
Instead of maintaining a manual process, phData prefers to automate this process by introducing tools that allow us to dynamically define access to the underlying data.
Having the ability to scale how data is accessed (and what data is accessed) is critical to having a successful data strategy.
The majority of our customers handle authentication and authorization via Active Directory (AD) by defining AD Groups that map to applications or data access. This may not fit your specific environment, and we have seen other variations work, but in general scaling access requires a centralized location to manage users and groups that will allow us to build the following elements:
- User Information
- Authorization Information
Build Access Definition Lookup Tables for Snowflake Security at Scale
Being able to plug into a system like AD to pull user and group mappings allows us to build lookup tables that define access. As a preview, our automation would look something like the following:
CREATE TABLE row_filters ( role_name text, table_name text, row_filter integer ); INSERT INTO row_filters (role_name, table_name, row_filter) VALUES ('REGION_ID_1_RO', 'customer_details', 1); CREATE VIEW v_customer_details AS SELECT id, first_name, last_name, ssn, address, region_id FROM customer_details WHERE region_id = (SELECT row_filter FROM row_filters WHERE table_name = 'customer_details' AND role_name = current_role());
At this point we have an access lookup table that we can reference in a view and dynamically add records into the filtering table. Behind the filtering table, we will also have the AD groups to Snowflake roles mapped, as defined by our security automation tool (phData Tram).
Snowflake Secured Views
Now that we have a way to scale our access via lookup tables, we still have everything out in the open — which could potentially leak PII in the available metadata or cause confusion around what tables users have access to.
Using a similar approach, we can make use of a Snowflake Secured View to ensure data is only accessed by authorized individuals. The purpose of this method is to ensure only views are exposed to end users — secure data is stored in a “private” database location.
Following this guide provided by Snowflake allows us to see how what we built above should be moved to a “private” database while the view is pushed to the Public space. We also have to switch to using CURRENT_ACCOUNT instead of ROLE or USER to define access:
CREATE TABLE db.private.row_filters ( snowflake_account text, table_name text, row_filter integer ); INSERT INTO db.private.row_filters (snowflake_account, table_name, row_filter) VALUES ('ACCT1', 'customer_details', 1); CREATE OR REPLACE SECURE VIEW db.public.sensitive_customer_details AS SELECT id, first_name, last_name, ssn, address, region_id FROM db.private.customer_details WHERE region_id = (SELECT row_filter FROM db.private.row_filters WHERE table_name = 'customer_details' AND snowflake_account = current_account()); GRANT SELECT ON db.public.sensitive_customer_details TO PUBLIC;
This introduces a bit more complexity, since the lookup table requires account level information to be stored. But the approach allows for the right amount of metadata to be exposed to the end user.
Dynamic Column & Data Masking
With the simple column restrictions demonstrated above, we can define what different users see based on views and roles. In practice, this is very difficult to scale since it hides business logic and is not easily auditable.
Fortunately, Snowflake has addressed this with Column Data Masking. Once the permissions have been set to create masks, we can take the case statement that we have defined above and apply it to a global masking policy, which then can be applied to a table definition. This ensures that however the column is accessed, the proper security policies will be applied:
CREATE OR REPLACE MASKING POLICY ssn_mask as (val string) returns string -> CASE WHEN current_role() IN ("ALL") THEN ssn WHEN current_role() IN ("OBFUSCATED") THEN substr(SSN, 8,11) WHEN current_role() IN ("NONE") THEN '***-**-****' ELSE '***-**-****' END; ALTER TABLE IF EXISTS customer_details MODIFY COLUMN ssn SET MASKING POLICY ssn_mask;
It is also possible to tokenize data for stronger security so that sensitive data is not stored on disk in a readable format. However, this increases complexity because the data must be modified before uploading data into Snowflake and requires extra processing to decrypt the data on read.
When a masking policy requires an update or the underlying table requires a change, there is a process required to “unapply” the masking definition on the column. A high level process example is as follows:
- Revoke access to the table
- Unapply column masking
- Perform necessary table, view, or masking DDL/DML
- Reapply column masking
- Apply access revoked from step (1)
A Caveat on Snowflake Security
In practice, we have seen secured views impact performance due to optimization being turned off for complex views.
We believe it is worthwhile to explore each of the options presented above to figure out if the secured view fits your needs, or if you can build it yourself to achieve scale and ensure grants are correct. In certain cases, the secure view removes performance shortcuts so a malicious party can’t infer the presence of data — but it is very costly on performance.