December 16, 2021

What’s a Data Catalog and How to Choose the Right One

By Christine Carroll

Your business might be moving to the cloud, just completed, or have been established with it for a little while, and you are likely wondering, “what data catalog tool is best for me?”

The short answer is…it depends. 

There are a lot of options available, and choosing the right data catalog for your business will highly depend on:

  • What drives your business
  • Your data needs
  • Your unique data culture
  • How you can support your data

To provide you with the best possible chance of success on your data catalog journey, this blog series will take a closer look at:

  • What is a data catalog and why is it so important?
  • How to set goals for your data catalog
  • How to establish business drivers for your data catalog
  • What value does a data catalog provide your business?

In blog 2, we’ll explore: 

  • Who are the users of a data catalog?
  • What team will support your data catalog the best?

In blog 3, we’ll look into:

  • What data catalog options are available?
  • How to continue your data catalog journey after launch?

What is a Data Catalog and Why is it Important?

A data catalog is a tool to store and access metadata about your data. To better understand data catalogs, let’s use a relatable example.

Have you ever walked into a brand new store with a very particular shopping list in mind? Have you had the frustration of walking down aisle after aisle (backtracking even) only to end up with a percentage of your list and a major headache to boot?

Now compare that to shopping at a well-established online store. You have the luxury of a search bar to automatically find everything on your list. Each product has a description and details like product dimensions, weight, color, and even reviews to help ensure you’re getting what you expect. You can even search on those attributes, allowing you to find a more niche item like gluten-free, dairy-free bread.

What was a time-consuming, frustrating task, often leading to products not being found has been simplified through the internal cataloging of products and their details by the business.  It is then served up through an easy-to-navigate UI with the ability to search.  This cataloging will usually mean you complete your shopping list, and find even a couple of extra items you didn’t know you needed.

A data catalog is very similar to the example above, it brings businesses ease of use and discoverability to their data.

Looking for data in a major organization can often mirror going to an unknown store. When you want to find a particular subject, where do you start? Who do you ask? What happens when all the columns are in acronyms? How do you begin to make sense of it? Is the data you found any good? Or has it not been updated in 10 years? The primary task of this nifty tool is to provide metadata (information about your data) throughout your organization. A data catalog provides a place to gather, explore, enhance, and maintain your metadata. Additionally, it offers one centralized place to find answers quickly and accurately.

A data catalog’s main purpose is to alleviate most of this frustration by working alongside your data. 

What is Metadata?

Metadata is information about your data. This can usually be best seen when there is a lack of metadata.

A table of data that's missing a few key data points.

Metadata has been around for decades and is crucial to nearly every industry, even to those outside of data engineering. Examples of metadata include a librarian cataloging their books or an art collector detailing provenance on a particular piece of art. 

What are the Different Types of Metadata?

After its long history in many different fields, three major categories have emerged for metadata: descriptive, structural, and administrative.

Descriptive 

Descriptive metadata is information that allows the data to be easily found, explored, and understood. For data, this could be:

  • A description of what the dataset contains
  • Definitions for vocabulary terms found in the dataset
  • Annotations or notes about the dataset
  • Ownership of the data
  • Whether any information contains Personal Identifiable Information (PII)

Structural 

Structural metadata is information that assists in the navigation of the data and helps determine how everything is set up. Structural metadata can help answer questions like:

  • What are the data types?
  • What is the primary key?
  • What restrictions are placed on the data values?
  • How is the dataset connected to other pieces of datasets?

Administrative

Administrative metadata is information about maintaining the data. Administrative metadata can help answer questions like:

  • How big is the dataset?
  • Where is it located?
  • What is the data lineage?
  • Who is allowed to access it?
  • How often is it updated?

Metadata Automation

Looking at the different types of metadata, some things might stand out as information that can be automatically gathered. For example, a table schema can give you information about the data types and restrictions on a table. Materialized views and ETL jobs can give you insight into data lineage.

At the same time, there is some metadata management that will need to be driven by humans. What does certain data mean? What team supports it?

A data catalog provides both the tools to automatically gather available information while providing an environment for your team to work on giving the metadata additional meaning.

When looking at getting started with a data catalog, it is vital to understand that it is not just a tool to install, flip a switch, and watch it go — it’s a journey to understand what this tool will mean to your company and how it fits into your company’s data culture. To do this, it is important to understand the value you are after and set realistic goals.

What Are The Goals for Your Data Catalog?

Now that you have a pretty good idea about what a data catalog is, it’s a good time to understand what you hope to achieve with one. Goals should be determined by what will bring you the most value.  

The value of a data catalog can change from company to company. To determine the value, you have to discover what is important to your company and the problem you are trying to solve with a data catalog.  

Data catalogs are often considered a cost center, meaning a program or department that costs money but doesn’t produce it. These types of programs can be lost or forgotten in the daily grind and are easily cut when times are tough. Because of this, it is important to establish the value a data catalog can bring, measure it, communicate its value, and continue to sell it throughout your company. This can be accomplished by aligning your goals for the data catalog with key business drivers.

Your company has important objectives that are essential to achieving success as a company.  These drivers will sometimes be well circulated, articulated, and understood throughout the company. Other times, it can feel like it varies widely from department to department.  

Alignment with Business Drivers

So what are the business drivers at your company? Every company is different, and this can take time to figure out. But take the time! This is information that will be valuable even beyond your data catalog. If you can prove a value that aligns with your company, you will have peoples’ attention. For a place to start, there are usually three basic types of company drivers: 

  • Increase revenue 
  • Manage cost and complexity
  • Reduce risk

Increase Revenue

Money drives everything. If your company is not profitable, you won’t have a company for long.  What drives revenue will differ from business to business but a powerful argument is to align your data strategy with revenue. Examples include using data to retain current customers, find new customers, or justify prices. Other revenue drivers could be the ability to know how you track customers through different lines of your business.  

Manage Cost and Complexity

A major way a data catalog can help to reduce cost and complexity is by just knowing what data you have. For example, you might have 14 customer tables in your company. How do you tell which one is the most up-to-date? What is the purpose of each? Are they maintained? Can you centralize them into a single source?

A data catalog can help establish what standards are used in your data.

How are your dates stored in every table? Are some tables storing in local time while others are storing in UTC? How does your data represent null values? Establishing what is out there can help form standards and policies to create a baseline for all your data to work together.

Data is often moved and transformed. How far is your data off from the original source? What transformations are being done before it gets to you (including transformations not triggered by you)? Are other teams doing similar transformations? A data catalog can help determine what is happening to your data, ultimately streamlining it into one, simple process.

Lastly, maybe your data is just really hard to find. Every time you need something, it eats away at work hours. Employees are frustrated, and critical systems might be down for longer than they should. These frustrations are all related to costs. It is more cost-effective to have a Data Scientist analyzing data than searching for it.

Reduce Risk

Industries everywhere have regulations and audit standards that they have to follow to remain in compliance and stay in business — data is no exception. For instance, in 2018, the California Consumer Privacy Act was signed into law. This act gives residents of California the right to:

  • Know the personal information a company collects on them and how it is used and shared with other companies
  • Delete personal information collected about them
  • Opt-out of the sale of their personal information
  • Nondiscrimination over-exercising these rights

In 2020, the restrictions were increased to include:

  • Reasonably minimize data collection
  • Limit data retention
  • Increase data security by requiring companies to conduct risk assessments and audits that can be submitted to regulators

Just protecting a customer’s data is not sufficient enough to meet California regulations.  Instead, the company needs to know where the customer’s data can be located, how it is being used, how it is being shared, and have the ability to remove it. A data catalog can help you track datasets with PII data, where it is being consumed, and any manipulation that is occurring. It is better to be proactive than reactive when it comes to a known risk for your company.  

How to Determine Important Business Drivers for Your Data Catalog

There isn’t necessarily a business driver that is more important than another, and you should not be restricted to just one. The more business drivers, the better! The key thing is to understand what is important to your business and create goals around how a data catalog can help. 

For instance, do you need your data catalog to track PII data? Do you have audits that require an understanding of how data is changed from one area to another? Are you a data company that needs to be able to find data fast in order to provide new offerings for your customers? Is your company bogged down by miscommunication and duplication of data? Align your goals with what can provide value. 

As you determine your business drivers, meet with other departments and the potential data users of your data catalog tool. Talk to them about the problems they face when trying to work with data for their job. All metrics should be gathered and documented. For instance, how long does it take for your Data Scientists to find what they need? Do they always find it on the first try? What things have completely wasted their time? 

These metrics do not have to be tied to a dollar amount, instead just a stat that a Data Scientist spends X amount of time every day trying to find the information they need will be helpful. These meetings can also work at gathering support and allies for your data catalog tool. Having more people excited about your tool can be invaluable as you onboard others.

As you gather your metrics, establish your value, and get to setting up your goals, do not try to boil the ocean. This is most likely a very large project and something that will not be completed overnight. Instead, make sure you come up with short-term goals as well as long-term goals. Ensure that you are able to show value while you are working towards your larger data culture goals.  

Now that we have a better idea of the goals of a data catalog, let’s explore the value it can provide.

What Value Can a Data Catalog Provide? 

In this section, we’re going to uncover a few of the key (value-driving) advantages of a data catalog and explore some best practices to keep in mind. It’s important to note that these advantages will depend largely on your business drivers. 

Data Lineage

Data lineage is the understanding of how data gets to its destination and the transformations that occurred to get it there. Tracking this process is important for many reasons, including:

  • Auditing and government regulations – There may be requirements to show where data is transferred and what manipulations are being done to it.
  • Creating trust in data – Knowing more about where your data is coming from and the transformations that occur can increase the level of trust in that dataset.
  • Simplify processes and remove duplication – Reviewing the data lineage might reveal that data has been changed one way just to be changed back later down the road.  Transformations might be happening to the data that are no longer needed and ignored.  Having a clear view of the process can help someone clean up and simplify if needed.
  • Downstream Impact – When making changes to data, decommissioning a dataset, or figuring out its value, it is important to know who is downstream and who will be impacted by your choices. If you decided to change your data from a daily load to a monthly load, who might be impacted? If you change the type of data in a certain column, what might you break?

Owning the Narrative

Speaking of knowing what’s downstream, imagine making a choice that did break someone consuming the data. Many of us have been there, we broke a pipeline. When data hasn’t been updated or if it has been corrupted, who do you talk to? Who do you inform? How do you fix it and get anyone affected back on track? 

If you don’t contact these people first, you could add to a culture of distrust in the data. Establishing ownership of your data can help your business come up with a plan of attack when something unsavory happens. Ultimately, this allows you to address the problem, discover how it occurred, layout steps to fix it, and ensure it doesn’t happen again.

Scanning Different Data Sources

Most data catalogs are designed to work with many different data sources. Your data catalog can connect to these many sources and help you determine what is in each. This process can also help if you are moving to a single datasource, like the Snowflake Data Cloud. Your data catalog can help determine if a dataset is worth moving and what data is most valuable to your company.

Creating and Enforcing Policies

Before you start your data catalog journey, (or even along the way) certain data points will be necessary and valuable for your company. These requirements can be made into policies and built into your data catalog as well as the data culture for your company. Examples of these policies could be a tagging strategy. Before a dataset is considered production-ready, it is required to have these tags:

  • Data Owner
  • Data Team Email
  • Data Source
  • Contains PII

Establishing these policies can create a culture of trustworthy data.

Data Quality

It is not rare to think you found the perfect dataset only to discover that only half of the data is populated, duplicated, or even unknown. We were once asked to investigate a dataset that was providing the number of days a judge would take to reach a verdict. The problem our business partners were seeing was the median of days was often coming out as 0 days. Turns out, the default number of days was 0 when no other day could be determined. So many data points were coming in with the 0 days default.  

It was actually rare for the median not to be 0. Data catalogs can often analyze your data for different metrics. What is the distribution of a certain column, how often is it populated, or how many duplicates can be found.

Establish a Gold Standard

It is not uncommon to find duplication of datasets or subject matters across a company. It could have happened organically where different lines of business kept data that was relevant to them but is later found to be overlapping with other data in different areas of the business. With a data catalog, finding these duplicate datasets becomes easier. The datasets can then be analyzed to determine which is the most complete and point users to the correct dataset using the data catalog.

Capture Implicit Knowledge Built Into the Company

Too often, all the knowledge for a dataset boils down to one individual. They have been at the company for a while and can work the dataset in their sleep. Their knowledge might have been shared in emails, SharePoint, or whiteboards. Data catalogs can often offer a singular place for this knowledge to live alongside other metadata for the dataset. It can be as intricate as a wiki or as simple as a textbox depending on the data catalog.  

Create and Centralize a Business Glossary

A business glossary is a group of terms defined by your company or through a department. This is highly useful for industries that have a lot of specialized jargon or acronyms. 

Self Service

Perhaps one of the biggest values a data catalog can offer is a self-service option. Imagine having just one place for you and your business partners to easily find metadata about data.  Self-service can cut down on time looking for data and dealing with misdirections when requesting access to that data.

Metrics

A data catalog might be able to provide you with insights about your data but what about datasets that are most requested? What terms are searched for most often? These metrics could help you determine what data is valuable and should be invested in. 

You can also determine potential members for your team or future stewards. Who does the most searching?  Who contributes the most to wikis or contributes the most meaningful metadata? You can vet who is contributing the most to your data catalog and might be a good asset to focus on the data catalog as a data steward.

In Closing

We just went through a lot of background content about what a data catalog is and what value it can bring. Now that you have a better understanding of data catalogs, don’t miss the next blog in the series that explores the users of a data catalog and what team will be best to support it. 

Next up in Series

Need Expert Help Making Your Data Catalog a Success?

phData has years of experience helping businesses of all shapes and sizes unlock more value from their data. Whether you need help building an actionable data strategy or advice on how to make your data catalog a smashing success, our data experts would love to help!

FAQs

A data dictionary is the start of a data catalog.  A data dictionary is information describing your data like content, structure, relationships, origin, format, etc.  This information is usually stored in tables, schemas, excel worksheets, and various other, sometimes random, places.  A data catalog is the next step. It gathers all of this information, builds on it with the help of stewards and AI-driven processes, and makes it searchable in one convenient place.  It creates an environment of collaboration to help your metadata grow and strengthen.  

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit