Implementing Metadata as Part of Data Management

Data centralization without careful metadata implementation is like stocking a warehouse without sorting and labeling all the boxes. Yes, you may have everything you need in there; but your end users will be wandering around lost.

For example, how would someone looking for manufacturing material know, without access to metadata, that they needed to use the MARA table from your SAP offload? And without metadata, how would they know when the table was last refreshed in the centralized data platform, or who to reach out to with questions?

In this post, we’ll be discussing how to implement metadata governance as a key pillar of data management, including:

How to choose your metadata and iterate as needs change
How to manage your metadata
How to integrate with Navigator and/or Atlas

Determining Metadata Taxonomy and Categories

The first step, of course, is to decide which metadata categories to add, and to agree on a common data vocabulary and taxonomy. However, reaching consensus is usually easier said than done. It’s all too easy for this step to spiral out of control, as you suss out intra-departmental variations and custom usages, into hundreds of hours of discussion and debate amongst various parts of the organization.

At phData, we recommend starting simple. Define the most fundamental metadata requirements, implement them, and iterate from there. Build from the bottom-up, rather than from the top-down; end users will educate the data product teams on what’s valuable versus what isn’t.

Here are the very basics we recommend starting with:

Clear, Useful Description (e.g. Plant Material Table)
Data Owner (e.g. Maple Grove Manufacturing)
Last Updated Date (e.g. 2019-09-01 00:00:00)
Source System (e.g. SAP, Salesforce)
Personal Data (i.e. is it personal data such as PII, PHI, or PCI?)

Remember: zeroing in on the right level of metadata for your organization is an iterative process. Don’t hold up progress trying to figure everything out on day one. Instead, start with the essentials. Then, using the process outlined below, you can continuously tune metadata to the organization’s needs.

Defining Your Metadata Management System

Ultimately end users’ data journeys should start by using a standard metadata repository to search and find data. When using the Cloudera platform, this would be either Navigator (CDH) or Atlas (CDP). Both of these tools allow end users to search for data and retrieve its metadata.

For example, imagine an end user searching for the MARA table to retrieve a description (i.e. General Material Data) and the date it was last updated. From there, the user would know where to find material data that was refreshed in early September. There’s a number of ways to define this metadata and keep it updated.

At phData, we’d recommend defining a simple table and using that as the source of truth for metadata. This table would be Kudu, ideally, so you could efficiently do inserts, updates, and deletes — though it could be any normal database table as well. It would look something like this:

Database	Table	Column	Description	Owner	Last Updated Date	Source System	Personal Data
MyDB	Table1	fname	First Name	Quality Systems	2019-09-01 00:00:00	Global Complaints	None
MyDB	Table1	lname	Last Name	Quality Systems	2019-09-01 00:00:00	Global Complaints	None
MyDB	Table1	ssn	Social Security Number	Quality Systems	2019-09-01 00:00:00	Global Complaints	Yes
YourDB	Table2	id	Unique ID	Manufacturing	2019-09-07 00:00:00	MES	None
YourDB	Table2	part_num	Part Number	Manufacturing	2019-09-07 00:00:00	MES	None

This table is owned by the data platform team that gathers the information from the data owners. The data owners are responsible for defining the data. The data platform team then owns the process of updating it in the system.

Integrating with Navigator and Atlas

Once the metadata is in a tabular format, we recommend integrating it with Navigator and/or Atlas for end users to search and refine. Both tools offer APIs making this easy to accomplish. In the example below, we’ll be using the Navigator API to upload the data.

The first step is to get the entity identification for the column you want to add metadata to. This is done using a query on the table name (e.g. table1) and database name (e.g. default).

				
					curl 'http://yournavigatorserver:7187/api/v13/entities?query=(originalName:col1)AND(parentPath:"/default/table1")' -u username:password -X GET

This will return a JSON object with a lot of information in it. Below is an abbreviated example. Look for the “identity” column which is the unique identifier for the column you want to add metadata to.

				
					[ {
  "originalName" : "col1",
  "parentPath" : "/default/table1",
  "identity" : "159454798"
} ]

Now that you have the identity, you will use it to upload your table metadata for the column. Here is the command for using the Navigator API to upload this data.

				
					curl 'http://yournavigatorserver:7187/api/v13/entities/104372948' \
-u username:password \
-X PUT \
-H "Content-Type: application/json" \
-d '{ 
"description":"First Name", 
"properties":{"data_owner":"Quality Systems", 
"last_update_date":"2019-09-01 00:00:00", 
"source_system":"Global Complaints", 
"personal_data":"yes"} 
}'

You can then query the column and see the newly added metadata.

				
					curl 'http://yournavigatorserver:7187/api/v13/entities?query=(originalName:col1)AND(parentPath:"/default/table1")' -u username:password -X GET

[ {
  "originalName" : "col1",
  "parentPath" : "/default/table1",
  "description" : "First Name",
  "properties" : {
    "personal_data" : "yes",
    "source_system" : "Global Complaints",
    "data_owner" : "Quality Systems",
    "last_update_date" : "2019-09-01 00:00:00"
  },
  "identity" : "104372948",
  "internalType" : "hv_column"
} ]

Finally, you can use Navigator search to search the governed data you added. In this example, we are pulling back all the columns that have personal data in your data platform.

NOTE: The “up_” prefix on the search expression is to indicate you want “user-provided” data (i.e. the metadata you just uploaded).

If you’d like to try this yourself, phData provides a base application that reads from a Kudu table and then uses the Navigator APIs to upload this data into Navigator. This allows you to simply define metadata in a table and then automate the process of getting it into a searchable dashboard in Navigator.

Conclusion

Building a data governance and metadata management system is critical to ensuring that your centralized data platform doesn’t turn into a data swamp, and that your end users don’t find themselves floundering. End users need clear, descriptive names for tables and columns; compliance officers need simple, efficient ways to search for personal data.

Here are three simple rules for ensuring success with your governance:

Start simple, and build from the basics. If you start off trying to add hundreds of metadata definitions, you risk getting bogged down right from the beginning.
Use a centralized table — managed by the data owner and owned by the data platform owner — as a single source of truth.
Automate the process of pulling data from the central source of truth to update Navigator or Atlas. Do NOT add metadata manually.

Following these three simple rules will ensure quick wins in metadata governance — and a solid foundation for your data platform to build on as it grows and evolves.

This blog post was written by Mac Noland and Raghavendra Shyambhat.

Implementing Metadata as Part of Data Management

Determining Metadata Taxonomy and Categories

Defining Your Metadata Management System

Integrating with Navigator and Atlas

Conclusion

More to explore

How to Trigger a Slack Notification When a Pipeline Fails in Fivetran

How Does Fivetran Drive Business Value?

How to Change a Snowflake Connection From Public to Private in Power BI Datasets

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit

Industries

Solutions

Company

Technology Partners

Other Technology Partners

Check out our latest insights

How to Trigger a Slack Notification When a Pipeline Fails in Fivetran

How Does Fivetran Drive Business Value?

Data Engineering

Consulting, Migrations, Data Pipelines, DataOps

Change Management, Enablement & Learning

COE, Coaching, PMO

Data Science and Machine Learning Services

MLOps Enablement, Prototyping, Model Development and Deployment

Strategy Services

Data, Analytics, and AI Strategy, Architecture and Assessments

Reporting, Analytics, and Visualization Services

Self-Service, Integrated Analytics, Dashboards, Automation

Elastic Operations

Data Platforms, Data Pipelines, and Machine Learning