In Progress
Unit 1, Lesson 1
In Progress

Hash Table

Video transcript & code

We've been working our way through a series on building classes to represent physical quantities. In the last episode, we talked about three different kinds of object equality.

There is a fourth type of equality that we need to understand. However, in preparing for that episode, I realized that there was some background knowledge that we needed to cover first. So today we're going to take a little side trip, and talk about hash tables.

A Ruby Hash object is part of a family of data structures variously known as "associative arrays", "maps", or "dictionaries". All of these names refer to a data structure which maps keys to values. There are various ways to implement such a structure. Ruby's built-in implementation is called Hash because of how it is implemented: it uses what is known as a "hash table".

Consider if we were to construct a very naive implementation of an associative array, using an array of pairs. Every time we tried to look up an entry, we would need to walk through the entire structure, comparing keys, until we found one whose key matched. It is easy to see that the more keys we added to this structure, the longer it would take to look up a given key.

assoc = [[:foo, 42], [:bar, 23]]
value = assoc.each do |entry|
  if entry.first == :bar
    break entry.last

Hash tables address this performance concern. The idea behind them goes like this:

What if there were a way to take every object used as a key, and somehow reduce it to a single integer? The integer wouldn't be guaranteed to be unique. But it would be calculated in such a way that any two differing objects would be highly unlikely to have the same integer equivalent.

This integer value is known as a hash value. Every object in Ruby has a hash value, discovered by sending it the hash message. Numbers, strings, symbols; they all have hash values.

123.hash                        # => -4326452433101730515
"hello".hash                    # => 1575747404428851756
"goodbye".hash                  # => 2103260368034954340
:foo.hash                       # => -924210406470135610

The way these values are derived are beyond the scope of what we'll be talking about today. For right now, there are just two important facts to be aware of about the algorithm Ruby uses for hashing. First, it is pseudo-random, which means that even closely-related values will likely have widely different hash values. And second, Ruby uses a fresh seed for the algorithm every time the Ruby process is started. So the hash value of a given value will be different every time the Ruby process is started. That's why all of the values here keep changing whenever the code is reevaluated.

We can easily see that the hash value for a given value is the same within a single run, however:

Let's pretend we are a Ruby hash object. Our make-believe Hash will be highly oversimplified. So simple in fact, it won't even have values. Just keys.

We know that any potential key object has a hash value. Given that information, we can construct a small number of "buckets". Each "bucket" is a list of hash entries. For this demonstration we'll make three of them, and call them b0, b1, and b2.

In order to decide which bucket a given entry will go into, we take the hash value of the key, divide it by three, and take the remainder, using the modulus operator. For any possible integer value, this will return either 0, 1, or 2. This gives us the number of the bucket to put the entry in.

As we mentioned previously, the specific hash values of objects will vary with every run of Ruby. In order to simplify this demonstration, let's construct three special String objects that have fixed hash values. We'll call them x, y, and z.

Object x's hash value modulo 3 is 0. Object y's hash value modulo 3 is 1. Object z's hash value modulo 3 is 0 again.

According to our rules, this means we need to put x and z in bucket 0, and y in bucket 1.

require "./feet"

b0 = []
b1 = []
b2 = [] # !> assigned but unused variable - b2

one =
two =

123.hash % 3                        # => 1
"hello".hash % 3                    # => 2
"goodbye".hash % 3                  # => 2
:foo.hash % 3                       # => 1
one.hash % 3                        # => 2
two.hash % 3                        # => 2

x = "x"
def x.hash; 333; end
y = "y"
def y.hash; 100; end
z = "z"
def z.hash; 999; end

x.hash % 3                      # => 0
y.hash % 3                      # => 1
z.hash % 3                      # => 0

b0 << x                         # => ["x"]
b1 << y                         # => ["y"]
b0 << z                         # => ["x", "z"]

Now let's say someone asks us to look up the entry filed under object x. We can take the hash value of x, modulo 3, and know that it would have been sorted into bucket 0.

What does all this buy us? Now that we know which bucket to look in, we've cut down our search space by two thirds! Instead of having to compare every single key to x, we only have to look through the contents of bucket 0 in order to find object x.

b0.detect{|entry| entry == x}

Ruby's actual Hash implementation is quite sophisticated. It ensures that as a hash grows in size, its number of buckets is regularly increased, and the entries redistributed among the buckets. As a result, even with very large hashes, the Hash only ever needs to perform full comparisons against a few keys before finding the one it's looking for.

So now we know why Ruby Hashes are called Hashes. In the next episode, we'll put this knowledge to use to help us decipher some puzzling behavior in our quantity objects. Until then, happy hacking!