The Data Planet - Vol. 12

November 18, 2021
November 18, 2021

Whether it’s recent news or just new to you, The Data Planet serves up fascinating insights and resources ABOUT the Data analytics and BI WORLD EVERY MONTH.

Our snack-size summaries skip straight to the point.

Happy Thanksgiving! This month’s Data Planet is a bit different. We’ll get more technical than usual, taking a look at the unassuming, yet valuable hash function. On a separate subject, we’ll also point you to a great podcast about distributed databases and containerization. There's lots of resources on these topics, so check it out.

This edition of the Data Planet includes:

  • A Look at the Hash Function: Used and Misused in Data Systems Everywhere
    Examples: The Hash Function in Cryptography & Data Comparison
  • CockroachDB: Distributed Databases and Containerization

Used And Misused In Data Systems: Focus On The Hash Function

It’s very important to understand the hash function because using it incorrectly will yield disastrous results. You’re probably already familiar with hash functions, as you see them all over the place in data systems. They’re humble, but incredibly useful. Simply put, a hash function takes an arbitrary value and maps it to a finite set of numbers. Let’s look at two common examples of where this would be useful.

Cryptography

Let’s say you have a table full of user names and passwords. Storing this in plain text is obviously problematic because a hacker could run off with the entire table. You could encrypt the table, but if the key gets compromised, you’re back to the problem of a plain-text table. Not only that, it still allows admins and developers full access to everyone’s passwords. The solution is to use a one-way hash function.

A one-way function quickly converts a value like ‘password1’ to a number like-8365479907532038635, but it’s very difficult to go from the number to the value. One-way functions aren’t truly one way. With enough computing power, you could reverse the hash. Longer and more complex passwords take longer to crack then shorter ones. That’s why websites demand minimum password length and complexity (not just to annoy you).

A 12-character password with a modern cryptographic hash function would take a super-computer thousands of years to crack. So, for practical purposes, a good password is basically uncrackable.

If you do find yourself in the situation of needing to hash passwords (also useful for places that use social security numbers as a primary key in the database 🤮), there are other things to consider. A naive implementation can still end up leaking information. Here are two in-depth articles that walk you through the details.

Deep dive into storing passwords in a database

Learn how Secure Hash Algorithm 2 (the most popular hashing algorithm) works

Data Comparison

You’ve downloaded a large amount of source code from the internet. How do you ensure that it hasn’t been corrupted or tampered with in transit? Use a hash function! A 1-gigabyte document can be condensed to a 32-byte number. Changing a single character in it will result in a different value. The website will post a hash value, and you run the hash on your end. If they match, you know you’re good to go.

Data Vault 2.0 uses hash functions in this way. In DV 2.0, hash functions are used to combine numerous columns into one for comparison purposes. Primary keys and Type II dimensions are common use cases. (Data Vault 2.0 and Hash Keys).

However there is a drawback to this method. Theoretically, two different values can compute to the same hash key. This is known as a collision, and the odds of getting a collision are more common than you would think due to the birthday problem.

For example if you have a 32-bit hash function (4.3 billion hash values), you only need about 77,000 rows to get a 50% chance of getting a collision.

If you have a 256-bit hash function (1077 hash values), you’d need 1,038 rows to get a 50% chance of a collision. (You’ll need a lot of hard drives.)

There is a tradeoff. Bigger keys have lower collision chances, but are slower to calculate and take up more space. Look at these valuable resources to learn more.

Hash functions in SQL Server

Hash functions in Python

Hash function in R

CockroachDB: Distributed Databases and Containerization

CockroachDB’s motto is scale fast, survive anything, thrive everywhere. What is CockroachDB? It’s a partial open-source, distributed, Docker/Kubernetes native, relational database. It is more of a competitor to PostgreSQL and SQL Server than Snowflake and Azure Synapse. Cockroach Labs recently raised $150 million in funding, so expect an advertising/marketing push from them.

This podcast is an interview with Cockroach Labs CEO Spencer Kimball, who will explain more about distributed databases and containerization.

Listen to the podcast: Distributed databases and containerization

Read the podcast transcript

Knowledge is power, and you want that power at your fingertips.
Stay tuned for the next edition of the Data Planet.

Stay in Touch with Onebridge

* Indicates required field
Thank you for subscribing! Check your email for a confirmation and link to your profile.
Oops! Something went wrong while submitting the form.
Hey there! We hope you've noticed that none of our content is "gated," meaning we don't force you to provide your information in order to read our content. We work hard to provide valuable information to serve our audience and our clients, and we're proud of it.

If you'd like to be notified of new content, events, and resources from Onebridge, sign up for our newsletter here. After signing up, you'll get a profile link where you can tell us what topics you want to hear about. With Onebridge, you control your data.

Please follow us on social media to see upcoming events and other resources, like blogs, eBooks, and more!