This blog post was written by Mac Noland and Raghavendra Shyambhat.
Data centralization without careful metadata implementation is like stocking a warehouse without sorting and labeling all the boxes. Yes, you may have everything you need in there; but your end users will be wandering around lost.
For example, how would someone looking for manufacturing material know, without access to metadata, that they needed to use the MARA table from your SAP offload? And without metadata, how would they know when the table was last refreshed in the centralized data platform, or who to reach out to with questions?
In this post, we’ll be discussing how to implement metadata governance as a key pillar of data management, including:
- How to choose your metadata and iterate as needs change
- How to manage your metadata
- How to integrate with Navigator and/or Atlas
Determining What Metadata to Include
The first step, of course, is to decide which metadata categories to add, and to agree on a common data vocabulary and taxonomy. However, reaching consensus is usually easier said than done. It’s all too easy for this step to spiral out of control, as you suss out intra-departmental variations and custom usages, into hundreds of hours of discussion and debate amongst various parts of the organization.
At phData, we recommend starting simple. Define the most fundamental metadata requirements, implement them, and iterate from there. Build from the bottom-up, rather than from the top-down; end users will educate the data product teams on what’s valuable versus what isn’t.
Here are the very basics we recommend starting with:
- Clear, Useful Description (e.g. Plant Material Table)
- Data Owner (e.g. Maple Grove Manufacturing)
- Last Updated Date (e.g. 2019-09-01 00:00:00)
- Source System (e.g. SAP, Salesforce)
- Personal Data (i.e. is it personal data such as PII, PHI, or PCI?)
Remember: zeroing in on the right level of metadata for your organization is an iterative process. Don’t hold up progress trying to figure everything out on day one. Instead, start with the essentials. Then, using the process outlined below, you can continuously tune metadata to the organization’s needs.
Managing Your Metadata
Ultimately end users’ data journeys should start by using a standard metadata repository to search and find data. When using the Cloudera platform, this would be either Navigator (CDH) or Atlas (CDP). Both of these tools allow end users to search for data and retrieve its metadata.
For example, imagine an end user searching for the MARA table to retrieve a description (i.e. General Material Data) and the date it was last updated. From there, the user would know where to find material data that was refreshed in early September. There’s a number of ways to define this metadata and keep it updated.
At phData, we’d recommend defining a simple table and using that as the source of truth for metadata. This table would be Kudu, ideally, so you could efficiently do inserts, updates, and deletes — though it could be any normal database table as well. It would look something like this:
|Database||Table||Column||Description||Owner||Last Updated Date||Source System||Personal Data|
|MyDB||Table1||fname||First Name||Quality Systems||2019-09-01 00:00:00||Global Complaints||None|
|MyDB||Table1||lname||Last Name||Quality Systems||2019-09-01 00:00:00||Global Complaints||None|
|MyDB||Table1||ssn||Social Security Number||Quality Systems||2019-09-01 00:00:00||Global Complaints||Yes|
|YourDB||Table2||id||Unique ID||Manufacturing||2019-09-07 00:00:00||MES||None|
|YourDB||Table2||part_num||Part Number||Manufacturing||2019-09-07 00:00:00||MES||None|
This table is owned by the data platform team that gathers the information from the data owners. The data owners are responsible for defining the data. The data platform team then owns the process of updating it in the system.
Integrating with Navigator and Atlas
Once the metadata is in a tabular format, we recommend integrating it with Navigator and/or Atlas for end users to search and refine. Both tools offer APIs making this easy to accomplish. In the example below, we’ll be using the Navigator API to upload the data.
The first step is to get the entity identification for the column you want to add metadata to. This is done using a query on the table name (e.g. table1) and database name (e.g. default).
This will return a JSON object with a lot of information in it. Below is an abbreviated example. Look for the “identity” column which is the unique identifier for the column you want to add metadata to.
Now that you have the identity, you will use it to upload your table metadata for the column. Here is the command for using the Navigator API to upload this data.
You can then query the column and see the newly added metadata.
Finally, you can use Navigator search to search the governed data you added. In this example, we are pulling back all the columns that have personal data in your data platform.
NOTE: The “up_” prefix on the search expression is to indicate you want “user-provided” data (i.e. the metadata you just uploaded).
If you’d like to try this yourself, phData provides a base application that reads from a Kudu table and then uses the Navigator APIs to upload this data into Navigator. This allows you to simply define metadata in a table and then automate the process of getting it into a searchable dashboard in Navigator. You can find it here: https://github.com/phdata/kudu-navigator-utility.
Adding metadata governance is critical to ensuring that your centralized data platform doesn’t turn into a data swamp, and that your end users don’t find themselves floundering. End users need clear, descriptive names for tables and columns; compliance officers need simple, efficient ways to search for personal data.
Here are three simple rules for ensuring success with your governance:
- Start simple, and build from the basics. If you start off trying to add hundreds of metadata definitions, you risk getting bogged down right from the beginning.
- Use a centralized table — managed by the data owner and owned by the data platform owner — as a single source of truth.
- Automate the process of pulling data from the central source of truth to update Navigator or Atlas. Do NOT add metadata manually.
Following these three simple rules will ensure quick wins in metadata governance — and a solid foundation for your data platform to build on as it grows and evolves.