Can someone offer an intuitive understanding of linear/quadratic probing and double hashing?

2.8k Views Asked by At

I'm reading through Introduction to Algorithms, and I'm having trouble grasping intuitively how linear probing, quadratic probing, and double hashing exactly work. I suspect my confusion lies within my hazy understanding of hashing itself, so I'd appreciate if anyone could clear up these areas and help me grasp the concepts. Here's what the textbook has to say about linear probing and quadratic probing:

enter image description here

What does it mean to "first probe T[h'(k)]?" Also, what is "then we wrap around to slots T[0],...?" I'm also confused as to what primary clustering means; in particular, the part that talks about "long runs of occupied slots [building] up..."

Any help would be great. Thank you.

1

There are 1 best solutions below

2
On

In both cases, as you probably know, you have a universe of objects, $U$, and you wish to insert a number $n\le m$ of these objects into an array $T = T[0], T[1], \dots, T[m-1]$ with no more than one object in each array slot (commonly known as a bucket) . One way to do this is to use a hash function $h(x)$ that maps $U$ into the set of array indices $\{0, 1, \dots, m-1\}$. The problem is that the size of $U$ is generally larger than $m$, the number of buckets in the array, so you have the potential that two different objects in your universe might be sent by $h$ to the same bucket, known as a collision. In other words, you might have different objects $x_1, x_2$ such that $h(x_1) = h(x_2)$. To handle situations like this, you need not only a hash function, but also a protocol to deal with collisions when they occur.

In linear probing, the protocol to insert an object $x$ into the array is to first look at the bucket at index $h(x)$, namely $T[h(x)]$, which is what your notes refer to as the "first probe". If that bucket, $T[h(x)]$ is already occupied by an object other than $x$, you then try to insert $x$ into $T[h(x)+1]$. If that doesn't work, you try to insert $x$ into $T[h(x)+2], T[h(x)+3]$, and so on, until you find a vacant bucket or reach the last slot, $T[m-1]$ in your array. What do you do if you come to the last slot and haven't found a place yet for $x$? A simple way is to do what your notes call "wrap around", namely continue searching, starting at the top bucket, $T[0]$ and continuing to search at $T[1], T[2], T[3]$ and so on, which explains the $\mod m$ in your notes. Unless the hash table is completely full, this strategy will always find an available slot for the object $x$.

Visualize the array as a collection of boxes. Some of them are currently empty, so color them white. Some of them are already full, so color them black. Now think of inserting several more objects, using linear probing. Each attempt to insert an object into a black box will lead to a sequence of consecutive black boxes with your new element occupying the first white box after those, which you then color black. Each of these sequences of adjacent black boxes is known as a cluster and it's not hard to see that, first, these clusters will grow over time and, even worse, might bump into other already existing clusters, leading to even longer ones, all of which slow down the insertion process. The moral: big clusters lead to slow hashing, which defeats the whole purpose of this data structure.

There are several other protocols that lessen this clustering. One is quadratic probing, described in your notes. Without going into too much detail, this does roughly the same thing as linear probing, namely "keep probing (somehow) until you find an available slot". In linear probing the "somehow" is "at the current slot plus 1"; in quadratic probing, the "somehow" is "at another slot determined by a quadratic function". This still leads to clustering, but since the next slot you try isn't the next one in line, you're less likely to wind up with big clusters. Instead, you'll have a collection of smaller clusters, separated by collections of available buckets. In general, this will lead to more efficient hashing on the average.

A good way to see this in action is to try an example. Suppose, for instance, you have a hash table with 11 buckets and a very simple hash function from the integers into the indices $\{0, 1, 2, \dots, 10\}$ given by $h(x)=(11 \mod x)$, try inserting a collection of numbers like $2, 17, 21, 35, 47, 13, 3, 6, 46, 29,10$ in that order and count the number of probes required using linear probing. Then compare that with quadratic probing, using successive probes given by $i^2+2i$. In other words, from an initial probe to index $h(x)=t$, look at slots $$ t+3, t+8, t+4, t+2, t+2, t+4, t+6, t+8, t+10, t+1 $$ if necessary, wrapping $\mod 11$ to keep the indices in range. Count the number of probes this takes; it should be less than linear probing (though I haven't tried it).