With organizations dealing with larger and larger amounts of data, there has been increasing momentum toward creating a “modern data stack” of tools to manage it. The good news is that this new stack is fast, scalable, and does not require as much overhead – Snowflake for cloud data warehousing is a good example. But when it comes to governance, trust, and context, metadata is needed. That is where data cataloging can serve as a powerful vehicle for data democratization and governance within the modern data stack.
What is a data catalog?
A data catalog’s primary function is to inventory the data assets of an organization. This inventory is often augmented with metadata, domain knowledge, and search tools. Data catalogs allow users to find the data they need, understand its context, and evaluate whether the data is useful for their intended use. They can be a critical component of an organization’s data governance. It can also go under other names such as metadata management, knowledge sharing, and data discovery.
What can it do?
Today’s users are more diverse than ever, including everyone from data scientists to product managers and business analysts. Each of these workers has a different way of approaching data, which can make collaboration difficult. Businesses need a cataloging solution that keeps up with innovation and advances in today’s modern data stack to serve all different types of users.
Plus, data management, searching, inventory, and evaluation all depend on metadata collection. This is more difficult today because our metadata is so varied. Modern data catalogs focus on the available datasets and automatically connects them with rich information for the user, making analysis easier for everyone.
Key benefits of data cataloging include:
- Improve analyst efficiency – less time spent on trial and error.
- More detailed context – understand data’s relationships.
- Reduce risk of error – communicate nuances and exceptions.
- Better data analysis – give users confidence in their analysis.
Why do we need it?
Year by year, your data assets grow exponentially in size. Having proper documentation and insight into these assets seems like it would be a critical requirement. I have consulted for many companies and have yet to encounter a functioning data catalog. Here are two examples where a data catalog could have helped me.
I was building an executive dashboard for a client and the product owner had a question about a chart that showed incurred cost over time. Calculating cost in healthcare is complex and they wanted to know what exactly those numbers represented. This was a bit of a problem as all I could see in the database was a column called Total Cost with a data type of decimal. It took two days of emails, Teams chats, and meetings before I got an answer. Ideally, I would go to a data catalog, search for Total Cost, and get the information I needed in seconds.
Another example came from a manufacturing client. They wanted to add supplier phone number to a report. Easy enough, right? The problem was that no one knew exactly where the data was. They had dozens of databases, with hundreds of tables, and thousands of columns. Needle in the data haystack. We then found multiple possible candidates. The next questions were:
- Which is the right one?
- Is the data reliable?
- How do I merge this back into my main data set?
- Does this scenario sound familiar?
The problem is not that the software does not exist. There are at least six open-source projects and a dozen commercial offerings. The problem is usually some combination of the following:
- Management does not see the importance of a data catalog to their data governance strategy.
- Leadership is reluctant to invest the necessary time and money.
- A catalog was built, but people stopped updating it, thus it grew stale and untrusted.
- Catalogs are siloed across multiple systems and departments.
- Users don't know that one exists or they do not have the proper training.
- Subject matter experts are reluctant to share knowledge due to the perceived power it gives them.
- Catalog only provides basic metadata information (table name, column name, data type) and not more useful, clearly understandable descriptions.
Note that most of these are problems that technology alone can't solve. I’m not saying technology doesn’t matter. Good software can make the process much easier but catalogs are at risk of failure if not paired with domain expertise, company culture, and management buy-in. No silver bullets. Just hard work and coordination.
This is a problem that plagues the smallest to largest companies. For example, many big tech companies have made large investments into the metadata and knowledge sharing problem:
- Linkedin DataHub
- Lyft Amundsen
- WeWork Marquez
- Airbnb Dataportal
- Spotify Lexikon
- Netflix Metacat
- Uber Databook
- Apache Atlas
Most of these tech companies have open sourced their data catalogs and made them freely available. Plus, there are a growing number of commercial offerings:
- Alteryx Connect
- Azure Purview
- GCPData Catalog
- IBM Watson Catalog
- Informatica Enterprise Data Catalog
- Qlik Catalog
- Tableau Catalog
A good data catalog can provide outsized benefits for your company, but it can also be a significant undertaking. If you need help navigating the metadata/data cataloging world, give us a call. We can help you design the best solution, tailored closely to your unique needs and current data stack. We also have a variety of flexible engagement options, so you can get the help you need exactly when and how you need it.