Happy Thanksgiving! This month’s Data Planet is a bit different. We’ll get more technical than usual, taking a look at the unassuming, yet valuable hash function. On a separate subject, we’ll also point you to a great podcast about distributed databases and containerization. There's lots of resources on these topics, so check it out.
It’s very important to understand the hash function because using it incorrectly will yield disastrous results. You’re probably already familiar with hash functions, as you see them all over the place in data systems. They’re humble, but incredibly useful. Simply put, a hash function takes an arbitrary value and maps it to a finite set of numbers. Let’s look at two common examples of where this would be useful.
Let’s say you have a table full of user names and passwords. Storing this in plain text is obviously problematic because a hacker could run off with the entire table. You could encrypt the table, but if the key gets compromised, you’re back to the problem of a plain-text table. Not only that, it still allows admins and developers full access to everyone’s passwords. The solution is to use a one-way hash function.
A one-way function quickly converts a value like ‘password1’ to a number like-8365479907532038635, but it’s very difficult to go from the number to the value. One-way functions aren’t truly one way. With enough computing power, you could reverse the hash. Longer and more complex passwords take longer to crack then shorter ones. That’s why websites demand minimum password length and complexity (not just to annoy you).
A 12-character password with a modern cryptographic hash function would take a super-computer thousands of years to crack. So, for practical purposes, a good password is basically uncrackable.
If you do find yourself in the situation of needing to hash passwords (also useful for places that use social security numbers as a primary key in the database 🤮), there are other things to consider. A naive implementation can still end up leaking information. Here are two in-depth articles that walk you through the details.
Deep dive into storing passwords in a database
Learn how Secure Hash Algorithm 2 (the most popular hashing algorithm) works
You’ve downloaded a large amount of source code from the internet. How do you ensure that it hasn’t been corrupted or tampered with in transit? Use a hash function! A 1-gigabyte document can be condensed to a 32-byte number. Changing a single character in it will result in a different value. The website will post a hash value, and you run the hash on your end. If they match, you know you’re good to go.
Data Vault 2.0 uses hash functions in this way. In DV 2.0, hash functions are used to combine numerous columns into one for comparison purposes. Primary keys and Type II dimensions are common use cases. (Data Vault 2.0 and Hash Keys).
However there is a drawback to this method. Theoretically, two different values can compute to the same hash key. This is known as a collision, and the odds of getting a collision are more common than you would think due to the birthday problem.
For example if you have a 32-bit hash function (4.3 billion hash values), you only need about 77,000 rows to get a 50% chance of getting a collision.
If you have a 256-bit hash function (1077 hash values), you’d need 1,038 rows to get a 50% chance of a collision. (You’ll need a lot of hard drives.)
There is a tradeoff. Bigger keys have lower collision chances, but are slower to calculate and take up more space. Look at these valuable resources to learn more.
CockroachDB’s motto is scale fast, survive anything, thrive everywhere. What is CockroachDB? It’s a partial open-source, distributed, Docker/Kubernetes native, relational database. It is more of a competitor to PostgreSQL and SQL Server than Snowflake and Azure Synapse. Cockroach Labs recently raised $150 million in funding, so expect an advertising/marketing push from them.
This podcast is an interview with Cockroach Labs CEO Spencer Kimball, who will explain more about distributed databases and containerization.
Listen to the podcast: Distributed databases and containerization