With organizations dealing with larger and larger amounts of data, there has been increasing momentum toward creating a “modern data stack” of tools to manage it. The good news is that this new stack is fast, scalable, and does not require as much overhead – Snowflake for cloud data warehousing is a good example. But when it comes to governance, trust, and context, metadata is needed. That is where data cataloging can serve as a powerful vehicle for data democratization and governance within the modern data stack.
A data catalog’s primary function is to inventory the data assets of an organization. This inventory is often augmented with metadata, domain knowledge, and search tools. Data catalogs allow users to find the data they need, understand its context, and evaluate whether the data is useful for their intended use. They can be a critical component of an organization’s data governance. It can also go under other names such as metadata management, knowledge sharing, and data discovery.
Today’s users are more diverse than ever, including everyone from data scientists to product managers and business analysts. Each of these workers has a different way of approaching data, which can make collaboration difficult. Businesses need a cataloging solution that keeps up with innovation and advances in today’s modern data stack to serve all different types of users.
Plus, data management, searching, inventory, and evaluation all depend on metadata collection. This is more difficult today because our metadata is so varied. Modern data catalogs focus on the available datasets and automatically connects them with rich information for the user, making analysis easier for everyone.
Key benefits of data cataloging include:
Year by year, your data assets grow exponentially in size. Having proper documentation and insight into these assets seems like it would be a critical requirement. I have consulted for many companies and have yet to encounter a functioning data catalog. Here are two examples where a data catalog could have helped me.
I was building an executive dashboard for a client and the product owner had a question about a chart that showed incurred cost over time. Calculating cost in healthcare is complex and they wanted to know what exactly those numbers represented. This was a bit of a problem as all I could see in the database was a column called Total Cost with a data type of decimal. It took two days of emails, Teams chats, and meetings before I got an answer. Ideally, I would go to a data catalog, search for Total Cost, and get the information I needed in seconds.
Another example came from a manufacturing client. They wanted to add supplier phone number to a report. Easy enough, right? The problem was that no one knew exactly where the data was. They had dozens of databases, with hundreds of tables, and thousands of columns. Needle in the data haystack. We then found multiple possible candidates. The next questions were:
The problem is not that the software does not exist. There are at least six open-source projects and a dozen commercial offerings. The problem is usually some combination of the following:
Note that most of these are problems that technology alone can't solve. I’m not saying technology doesn’t matter. Good software can make the process much easier but catalogs are at risk of failure if not paired with domain expertise, company culture, and management buy-in. No silver bullets. Just hard work and coordination.
This is a problem that plagues the smallest to largest companies. For example, many big tech companies have made large investments into the metadata and knowledge sharing problem:
Most of these tech companies have open sourced their data catalogs and made them freely available. Plus, there are a growing number of commercial offerings:
A good data catalog can provide outsized benefits for your company, but it can also be a significant undertaking. If you need help navigating the metadata/data cataloging world, give us a call. We can help you design the best solution, tailored closely to your unique needs and current data stack. We also have a variety of flexible engagement options, so you can get the help you need exactly when and how you need it.