December 16, 2021

What Should I Look For in a Data Catalog Tool?

By Christine Carroll

In our previous blog in this series, we spent a lot of time exploring why a data catalog is valuable and who you might need to support it. 

With that background information in mind, we’re ready to take a look at some actual tools and properly uncover what’s the best data catalog for your business.

What Things Should I Consider Before Choosing a Data Catalog Tool?

In order to place one tool above the rest, you’ll have to look beyond just the features. Perhaps one of the most crucial elements in deciding a data catalog that’s right for your business is exploring how your tool is designed when it comes to infrastructure. This is important because it will decide how it will fit into your current systems as well as what you will need to support it.  

Here are a couple of common offerings you might find:

SaaS

A SaaS data catalog will mean the support required for upgrades, outages, scaling, and other support-based operations will fall onto the provider. This frees your company up to work with the tool and breeze through onboarding but leaves a number of things out of your control. As a result, it is important to know what is happening to your data. Is your data just being scanned, or is it being moved? Where is your metadata stored? How easy is it to offboard your data if you choose another tool?

With a SaaS tool, it is important to figure out if and how the tool will meet the security and regulatory requirements for your data and your business.

Cloud-Native

If the setup and support of the tool will remain with your team and you are working primarily in the cloud, it is important to consider if the tool has a cloud-native design. Tools built specifically for the cloud will often take advantage of autoscaling, failover, cheap storage, and other noteworthy features the cloud can easily offer.  

Non-cloud-native tools can be stood up in the cloud, but they can be more of a lift and shift model. A lift and shift model often means it was built for on-prem and does not take advantage of some benefits offered by the cloud. Cloud-native architecture could also mean that infrastructure as code (IaC) is a possibility. With IaC, the setup, installation, and configuration of the data catalog tool could be done automatically using templates and scripts.  This would lead to a repeatable solution for standing up your data catalog across environments or redeploying the tool to an initial state.  

What are the Common Payment Options for Data Catalog Tools?

Besides supporting the tool, another important aspect is how and what you will be paying for it. Let’s explore a few common models.

Subscription Model

How does the tool charge for its product? This is important because it will ultimately determine how much you open your data catalog to your company. Is the subscription based on the number of users? Is it modeled on each individual user or does it have levels, like the same price for up to 30 users? This type of subscription model could cause hesitancy against things like crowdsourcing, often leading to only opening the tool to a select few, highly qualified individuals.

Other subscription models might be based on data sources, which often shift the focus on the most valuable data sources while leaving out others.  

It is also important to understand what kind of technical assistance comes with your subscription and what you will need to support and get your data catalog running. Does a more pricey subscription come with more technical support or is technical support something you can just pay hourly for?

A data catalog website will often show off its flashiest features, but not everything. The website might not dive into what type of infrastructure is supported, whether it is a SaaS solution or Cloud Native. The website will also almost never go into pricing. Pricing details are usually hidden behind an NDA. 

What Are the Best Data Catalog Tools?

What is the right tool for you will be determined by your business needs and goals you gathered earlier. In order to evaluate the best possible data catalog tool for your business, we recommend that you come up with a list of criteria you would like to evaluate and then contact a list of vendors all at once. When you meet with the vendors, try to include people that will be supporting the data catalog and any relevant business partners.  

After you and your team have met with the vendor, meet with your team as soon as possible.  Go over things you liked and disliked about the tool while it is still fresh in your mind. Do not make any decisions till you hear from all your vendors, and make sure they know you are shopping around. 

With that being said, here is a list of popular data catalog tools and a brief overview of each:

Alation

Alation launched its first data catalog solution in 2015 and has continued that journey with its recently launched SaaS offering in April of 2021. Alation gives your business the option to either manage the data catalog yourself or allow Alation to manage it for you. Listed below are some of the interesting features we found:

Query Log Ingestion and the Behavioral Analysis Engine (BAE)

Alation has a tool that will ingest and parse queries. This feature will help build lineage and will let you see any current usage patterns. You can determine which datasets are popular and might be a good candidate to transition to the data catalog right away. With these user patterns, you can also find potential stewards for your data catalog team by understanding who is working with the data often.

Once Alation is being used to curate metadata, machine learning pattern recognition is used to determine usage patterns inside the catalog itself. This will show popularity rankings and help in recommending other data sets to users based on their individual usage patterns.

API and Open Connector SDKs

Alation is already configured to work with a wide range of data sources like Snowflake, Redshift, and Google Big Query to name just a few. Beyond the already pre-built connectors, Alation allows users to create their own connectors in order to connect other data sources.

Automated Business Glossary

Alation will automatically suggest popular business terms to incorporate into your Business Glossary. This will help accelerate the building of your own Business Glossary and allow you to focus on areas unique to your company.

Crowdsourcing and Wikis

Alation offers a variety of ways people can contribute their knowledge to the data catalog. One way is by providing tools so the user can create wiki articles to share their knowledge with other users. Collaboration tools like conversation inbox, trust flags, and dashboards for Stewards can also help further define your business’s metadata.

Alation TrustCheck

When a user is working with the data catalog, TrustCheck can deliver information about the data in use. It can let the user know about the quality and age of the data as well as any policies attached to it. This can help a user avoid wasting their time with bad data or even making sure they are working with correct data according to company policies — all in real-time.

Informatica

Informatica offers a wide range of tools for your data needs. Besides a data catalog, they offer tools for master data management, data governance, and data quality to name just a few. Here are a few interesting features from their data catalog option, Enterprise Data Catalog (EDC):

Metadata Knowledge Graph

EDC maintains a knowledge graph alongside your metadata. This graph will capture relationships between your data assets and help to determine non-obvious relationships. These possible new connections could help clean up duplicate data or build a more comprehensive picture of data valuable to you.

Automatic Classification with Intelligent Domain and Entity Recognition

EDC will automatically identify entities and domains, like customers or products. This automation will allow users to easily search and filter through metadata that may not have reached a Data Steward yet. The automation tools will also recommend connections to business glossary terms. Outside of the established 60+ domains already included in EDC, custom domains and rules can be set up to accelerate the metadata journey for your business.

Crowdsourcing

EDC offers many collaboration options, including the ability to rate and review datasets inside the data catalog. It also offers the option to ask and answer questions in a way that all users can see and participate.

Lineage

Informatica offers advanced scanners to gain a comprehensive view of the lineage of your data.  These scanners are designed to work with different ETL tools, extract data from stored procedures, and even scan custom code. EDC also offers visualizations of this lineage to clearly see how data flows through your system to get to its final destination.

In these visualizations, you can easily pinpoint a data point that is important to you, and figure out what can be done to build and expand on it. The lineage tool can also offer a downstream impact summary, showing who or what might be impacted by changes or outages for this dataset.

Collibra

Collibra has a Data Intelligence Cloud platform that incorporates its many products, including tools for quality, lineage, governance, privacy, and data catalog.  Here are a few of the interesting features we found from its data catalog offering:

Data Intelligence Cloud

Collibra’s product offerings are SaaS tools. This means all updates and maintenance of the tools will fall to Collibra.

Edge

Collibra has something called an Edge component. This component ensures all data and analysis will only happen in the company’s firewall. Only metadata or anonymized data will be sent to the data catalog in Collibra’s Data Intelligence Cloud.

Privacy Stewards

Collibra has tools and dashboards specifically built for a Privacy Steward on your team.  A Privacy Steward’s main responsibility is to ensure your company and your customers’ data is protected. One of the ways Collibra does this is through Privacy Classification Dashboards, which categorizes sensitive or private data. When categorizing data in a data catalog, these terms can be automatically tagged during normal processing or be tagged when processed by a Data Steward.

In Closing

In this blog, we covered just a few of the most popular data catalog tools and offered our take on the features that stood out to us. Your Data Catalog journey may be different, but if you take the time and resources, your data catalog can be an invaluable part of your data culture.

Your best way forward is to know what your company needs from a data catalog and contact the vendors for demos and presentations. If you need additional information or help, our data experts are happy to lend a hand!

Need Expert Help Making Your Data Catalog a Success?

phData has years of experience helping businesses of all shapes and sizes unlock more value from their data. Whether you need help building an actionable data strategy or advice on how to make your data catalog a smashing success, our data experts would love to help!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit