December 14, 2020

How to Identify PII in Text Fields and Redact It

By Travis Hegner

As our customers move data into the cloud, they commonly face the challenge of keeping that data protected. Most applications are designed to store sensitive data in designated fields. Despite interface designers’ best efforts, however, sensitive data is inevitably stored in unintended fields, often as free form text in a generalized “notes” field. So, we need a solution to help us identify PII in text fields.

Today, we are going to explore the ability to redact sensitive data from freeform text stored in the Snowflake Data Cloud using Dataiku DSS, and AWS Comprehend Medical. Understandably, it’s quite difficult to find public datasets which contain real-life PII. Fortunately and unfortunately, we discovered the Enron email dataset, which contains plenty of PII to give us a realistic glimpse into the performance of such a process.

Connecting DSS to Snowflake

Dataiku DSS comes with a very large array of built-in choices to connect to external data platforms. The Snowflake connector is nice in that it allows you to pull data into DSS, do your work, and write data back out to Snowflake. Some operations can be pushed down directly to Snowflake, and don’t even require data copies. With that said, the Snowflake connector could use some improvements around user isolation.

Current Connection Methods

A connection in DSS is a global concept. The Snowflake connection requires parameters for “Database” and “Role”, which are oftentimes user-specific. Further, since only “User” and “Password” are able to be per-user parameterized, it is not possible to allow users to use key-pair authentication, since those parameters would be shared with the connection. This poses a problem for Snowflake users with single-sign-on configured. As of this writing, there are two possible workarounds for this issue:

  1. Use a shared workspace and service account to connect to Snowflake from DSS, so that no one user can authenticate as another user to Snowflake. You can further expand isolation by creating a service account per DSS project.
  2. Create a separate connection for each user who needs access to Snowflake, and limit that connection to “Selected groups” in the security settings, where the selected group only contains the authenticating user.

The latter option adds a larger administrative burden, so we recommend the former until the Snowflake connector allows arbitrary per-user parameters. Even with the ability to specify the private key on a per-user basis, that private key must be stored on the filesystem in a place accessible to the user that DSS runs as. It is possible to encrypt the private key, and store its passphrase as a secret in DSS (once that parameter is also made per-user), so there is a reasonable degree of security there. Be warned, however, that the “secret” in DSS is stored as plain text in a JSON file, and is readable by the Dataiku user.

Create the Snowflake Connection

To create the Snowflake connection, we need to complete the following steps:

  1. Navigate to the “Administration” page
  2. Click the “Connections” tab
  3. Click the “New Connection” drop down
  4. Select the “Snowflake” option under “SQL databases”

Creating the Snowflake connection

Next, fill out the parameters for your Snowflake account. If you are using private key authentication, you can provide the path and passphrase to your private key as “Advanced JDBC properties”. Don’t forget to provide a Schema for DSS to write to under the “Naming Rules For New Datasets” section. This Schema must exist in Snowflake, so be sure and create it before attempting to write a managed dataset.

Importing a Dataset

With a successful connection established, we can create a new project and import a dataset into DSS. Thankfully, the data is not actually copied and stored within DSS (despite the process being called “import”), it is simply read directly from Snowflake at the time that it’s needed. From our new projects main page, click on the “Import Your First Dataset” button.

Import your first data set to start identifying PII in text

This will bring you to a page where you can select the source of the dataset, choose “Snowflake” in the SQL section.

Set Snowflake as the source of your dataset

Finally, that shows you a page where you can select from the different Snowflake connections to which you have access. If necessary, select the connection we created earlier. Here, you have a few options. If you know the Schema and table name, you can populate them directly, you can select “Get Table List” which will query Snowflake for all available tables in Snowflake, or you can choose, “SQL query” mode, and write a SQL statement whose results will represent an imported dataset. After you’ve tested the table, previewed the data, and validated the derived column set, you can then give the dataset a name in the upper right corner, and click the “Create” button.

Identifying PII Data in Text Fields

With the Amazon Comprehend Medical plugin installed, there is a recipe available out of the box, which allows you to select a raw-text column from an existing dataset, and it will create a new dataset, appending several columns which identify the PII from the raw text column.

It isn’t immediately obvious within the recipe, but before it will work, you must define a “Preset” which contains your authentication information for AWS, as well as some parallelization limits. A preset can be created at several different levels within the hierarchy. You can create them at the plugin level, the project level, or the recipe level. At the time of writing this (plugin v1.0.0), we discovered a bug with the project-level presets not being honored by the recipe itself, so we opted to allow the preset to be specified within the recipe. We reported this issue to Dataiku, and they were quick to confirm.

With the Enron Email Dataset loaded into Snowflake, our authentication preset created, and our recipe configured, we can execute the recipe and inspect the output. After execution, the column set for our output table looks like this:

The column set for the output table

The original dataset contained “ROWID” and “CONTENT”, and the rest of the fields were created by the comprehend recipe. The “entity” columns each contain an array of the detected sensitive data. The “response” column contains a JSON blob with some API response information. This includes an array of entity objects, where you can get the confidence score, and beginning and ending offset of the sensitive data within its original text blob. Lastly, the error columns are populated when an individual record fails within the AWS API call.

Now that we’ve identified PII in text fields, we can write a custom recipe to replace the original content column with a redacted version of itself.

Redacting PII Data

Now that the PII/PHI has been identified in our sample of Enron emails, we need to write some custom code which will parse the “response” column and generate new raw text without the sensitive data. To accomplish this, we added a “Spark Scala” recipe to our flow, set the input to the output of the Comprehend recipe, and the output to a new managed table in Snowflake.

The core of the code to accomplish this is pretty straightforward:

// Recipe inputs
val enron_DETECTED_PHI = dkuContext.getDataFrame(sqlContext, "ENRON_DETECTED_PHI")

val enron_REDACTED_CONTENT = enron_DETECTED_PHI.select("ROWID", "CONTENT", "medical_phi_api_response").as[(Long, String, String)].map( r => {
    val rowid = r._1
    val content = r._2
    val json = r._3
    val newContent = new StringBuilder(content)
    
    val obj = JSON.parseFull(json).getOrElse(Map()).asInstanceOf[Map[String, Any]]
    val ents = obj.getOrElse("Entities", List()).asInstanceOf[List[Map[String, Any]]]
    
    ents.foreach(e => {
        val start = e.getOrElse("BeginOffset", 0).asInstanceOf[Double].toInt
        val end = e.getOrElse("EndOffset", 0).asInstanceOf[Double].toInt
        
        newContent.replace(start, end, ("X" * (end-start)))
    })
    
    (rowid, newContent.toString)
}).toDF("ROWID", "REDACTED_CONTENT")

// Recipe outputs
dkuContext.save("ENRON_REDACTED_CONTENT", enron_REDACTED_CONTENT);

Dataiku DSS provides us with some boilerplate to establish the Dataframes to which you want to read and write, each representing the input and output datasets in the project’s flow. With those established, we simply transform each row by parsing the API-response JSON, then iterating over the entities identified in that JSON to replace the detected PII with a string of repeating “X” to match the length of the redaction.

DSS Flow

After configuring and completing all of the recipes and datasets, we are left with a simple DSS flow that looks like this:

The DSS Flow to help identify PII data in text fields.

After executing this entire flow, we have a new dataset stored in Snowflake that matches the original one with all of the sensitive data removed from the document. You could easily expand the redaction recipe to include analysis of the confidence score, and selectively redact only information with a high confidence level. Here is a sample of what the redacted text looks like after execution:

This is a sample of what the redacted PII data looks like after we've executed the DSS Flow.

Comprehend Medical Performance

Outside of a few oddities, the AWS Comprehend Medical API did a fine job with the small handful of records we selected to test. Bear in mind, this service is quite expensive. We calculated the overall cost to do the entire Enron email dataset to be somewhere in the $12,000 neighborhood. So if you are just experimenting with a personal account, be very careful.

Here are some of the oddities we noticed:

  • SSNs were detected as phone numbers
  • Arbitrary number was detected as a date
  • Arbitrary number was detected as ID
  • Department name detected as address
  • Labeled password missed
  • Telephone extensions were missed

Even with those few oddities, we were impressed with these detections:

  • All dates
  • All email addresses
  • All mailing addresses
  • All people’s names
  • All phone numbers (including international, excluding extensions)
  • SSN and Birthday detected without much context (though the SSN was detected as a phone number)

Practical Application

With Snowflake quickly becoming a very popular enterprise data storage platform, and Dataiku DSS quickly becoming a very popular data science platform, it’s easy to see how combining the two allows for a very powerful business intelligence architecture. Dataiku DSS has a vast array of predefined recipes and plugins which allow for doing point-and-click data analysis and data science.

Simultaneously, it allows for building custom recipes in a variety of languages, so that your data team can build things that have never been done before. It’s easy to use where it can be but stays out of the way when the work requires it. Though the current version of the Snowflake connector has some rough edges, overall the tool allows you to do some amazing things with data from a multitude of different sources.

Protecting sensitive data (like identifying PII in text fields so you can redact it) is just one of the many ways that this combination of tools can be beneficial to your organization. Dataiku DSS offers the ability to do everything from basic data transformations to advanced machine learning for video classification. If you’d like to know how phData can help you create an architecture like this for your data team, please contact us.

If you’re interested in learning more about machine learning, be sure to download our Ultimate MLOps Guide.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit