Onebridge

With organizations dealing with larger and larger amounts of data, there has been increasing momentum toward creating a “modern data stack” of tools to manage it. The good news is that this new stack is fast, scalable, and does not require as much overhead – Snowflake for cloud data warehousing is a good example. But when it comes to governance, trust, and context, metadata is needed. That is where data cataloging can serve as a powerful vehicle for data democratization and governance within the modern data stack.

What is a data catalog?

A data catalog’s primary function is to inventory the data assets of an organization. This inventory is often augmented with metadata, domain knowledge, and search tools. Data catalogs allow users to find the data they need, understand its context, and evaluate whether the data is useful for their intended use. They can be a critical component of an organization’s data governance. It can also go under other names such as metadata management, knowledge sharing, and data discovery.

What can it do?

Today’s users are more diverse than ever, including everyone from data scientists to product managers and business analysts. Each of these workers has a different way of approaching data, which can make collaboration difficult. Businesses need a cataloging solution that keeps up with innovation and advances in today’s modern data stack to serve all different types of users.

Plus, data management, searching, inventory, and evaluation all depend on metadata collection. This is more difficult today because our metadata is so varied. Modern data catalogs focus on the available datasets and automatically connects them with rich information for the user, making analysis easier for everyone.

Key benefits of data cataloging include:

Improve analyst efficiency – less time spent on trial and error.
More detailed context – understand data’s relationships.
Reduce risk of error – communicate nuances and exceptions.
Better data analysis – give users confidence in their analysis.

‍Why do we need it?

Year by year, your data assets grow exponentially in size. Having proper documentation and insight into these assets seems like it would be a critical requirement. I have consulted for many companies and have yet to encounter a functioning data catalog. Here are two examples where a data catalog could have helped me.

I was building an executive dashboard for a client and the product owner had a question about a chart that showed incurred cost over time. Calculating cost in healthcare is complex and they wanted to know what exactly those numbers represented. This was a bit of a problem as all I could see in the database was a column called Total Cost with a data type of decimal. It took two days of emails, Teams chats, and meetings before I got an answer. Ideally, I would go to a data catalog, search for Total Cost, and get the information I needed in seconds.

Another example came from a manufacturing client. They wanted to add supplier phone number to a report. Easy enough, right? The problem was that no one knew exactly where the data was. They had dozens of databases, with hundreds of tables, and thousands of columns. Needle in the data haystack. We then found multiple possible candidates. The next questions were:

Which is the right one?
Is the data reliable?
How do I merge this back into my main data set?
Does this scenario sound familiar?

‍

The problem is not that the software does not exist. There are at least six open-source projects and a dozen commercial offerings. The problem is usually some combination of the following:

Management does not see the importance of a data catalog to their data governance strategy.
Leadership is reluctant to invest the necessary time and money.
A catalog was built, but people stopped updating it, thus it grew stale and untrusted.
Catalogs are siloed across multiple systems and departments.
Users don't know that one exists or they do not have the proper training.
Subject matter experts are reluctant to share knowledge due to the perceived power it gives them.
Catalog only provides basic metadata information (table name, column name, data type) and not more useful, clearly understandable descriptions.

Note that most of these are problems that technology alone can't solve. I’m not saying technology doesn’t matter. Good software can make the process much easier but catalogs are at risk of failure if not paired with domain expertise, company culture, and management buy-in. No silver bullets. Just hard work and coordination.

This is a problem that plagues the smallest to largest companies. For example, many big tech companies have made large investments into the metadata and knowledge sharing problem:

Linkedin DataHub
Lyft Amundsen
WeWork Marquez
Airbnb Dataportal
Spotify Lexikon
Netflix Metacat
Uber Databook
Apache Atlas

Most of these tech companies have open sourced their data catalogs and made them freely available. Plus, there are a growing number of commercial offerings:

A good data catalog can provide outsized benefits for your company, but it can also be a significant undertaking. If you need help navigating the metadata/data cataloging world, give us a call. We can help you design the best solution, tailored closely to your unique needs and current data stack. We also have a variety of flexible engagement options, so you can get the help you need exactly when and how you need it.

‍

About the Author:

Bradley Nielsen

Senior Tech Specialist

Bradley is a well-rounded developer in the field of data science and analytics. He has been a developer and architect on a wide range of data initiatives in multiple industries. Bradley's primary specialty is in data engineering: developing, deploying, and supporting data pipelines for big data and data science. He is proficient in Python, C#, SQL Server, Apache Spark, Snowflake, Docker, and Azure.

Related Partners

Alteryx

Snowflake

Tableau

Mulesoft

Related Services

Data Asset Cataloging

Data Governance

Data Management

Meta Data Curation

BI and Data Managed Services

Related Technologies

Alteryx

Microsoft Azure

Qlickview

Snowflake

Tableau

Watson Analytics

Related Industries

Stay in Touch with Onebridge

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Hey there! We hope you've noticed that none of our content is "gated," meaning we don't force you to provide your information in order to read our content. We work hard to provide valuable information to serve our audience and our clients, and we're proud of it.

If you'd like to be notified of new content, events, and resources from Onebridge, sign up for our newsletter here. After signing up, you'll get a profile link where you can tell us what topics you want to hear about.With Onebridge, you control your data.

Please follow us on social media to see upcoming events and other resources, like blogs, eBooks, and more!

Data Catalogs for Metadata Management