15-451 Algorithms                09/25/03
* Hashing                                       
 - basic idea
 - hashing with random inputs
 - Universal hashing         
 - Perfect hashing
-----------------------------------------------------------------------------

Hashing: a great practical tool, with interesting & subtle theory too.

Basic setting: juts like we've been discussing so far, we have a bunch
of items we want to store for efficient lookup.  This is often called
the "dictionary problem".  Two versions: 

 - static: Given set S of (key,object) pairs.  Want to store S so that
 we can do find(key) operations quickly.  E.g., fixed dictionary (e.g,
 key=word, object=definition), or any sort of static lookup table.

 - dynamic: sequence of insert(key,object), find(key), and, perhaps,
 delete(key) requests.  Want to do these all efficiently.

For the first problem we could use a sorted array with binary search,
and for the second we could use a balanced search tree, but hashing
gives an alternative approach that is often the fastest and most
convenient way of solving these problems.  Example: writing a
game-playing program, or AI-search program, and you want to store
situations that you've already solved (board positions or elements of
state-space) so that you don't redo the same computation when you
encounter them again.

FORMAL SETUP
------------

- Say keys come from some large universe U.  (e.g, all < 50-character strings)
- Some set S in U of keys we actually care about (which may be static
  or dynamic).  N = |S|.
- do inserts and lookups by having an array A of size M, and a 
  HASH FUNCTION h: U -> {0,...,M-1}.  Use power of indirect addressing
  to store key k (and its associated object) in A[h(k)].

[One point: if U was small (like 3-character strings) then you
could just store the key k in A[k] like in bucketsort.  But the
problem is that U is big.  That's why we need the hash function.]

- One thing to worry about are collisions: h(k1) = h(k2).  Simplest
  way to resolve collisions is to have each entry in A be a
  linked list.  To insert, just put at top of list.  If h is good,
  then hopefully lists will be small.  This is called "separate chaining".

Nice properties: searches are easy: compute index i=h(key) and then
walk down list at A[i] until we find it.  Inserts and deletes are easy too.

Properties we want:

- M is O(N)
- Elements are spread out.  Ideally all buckets (lists) of constant size.
- h is fast (constant time) to compute.

 Let T_h = time to compute h.  Time to do a lookup for key k is T_h +
 O(length of list at h(k)).  insert just takes time T_h + O(1).  We'll
 be assuming T_h is O(1) but good to keep in back of our heads that
 the time to compute h is an issue.

Basic intuition: one way to spread things out nicely is to spread them
randomly.   Unfortunately, we can't just use a random number generator
to decide where the next thing goes because need h to be able to do
lookups: h(k) has to be the same when we lookup as it was when we
inserted.  So, we want something "pseudorandom".

First give good news.  Then bad news.  Then good news.

GOOD NEWS (I): WHAT IF KEYS ARE RANDOM?
--------------------------------------

Then, property we want of h is just that roughly same number of
elements of U hash to each location.  E.g., a really bad functions is
"h(x) = 17, no matter what x is".  Good hash function: h(x) = x mod M.
Or, if M = 2^b, and U = all strings of some length u, then can use
h(x) = last b-bits of x.  Typical in systems applications where keys
are random looking and maybe b=8. 

In this case, can think of having N balls being tossed at random into
M boxes.

If we fix some x, what is the expected number of other elements in S
(other balls) that collide with it?  Ans: (N-1)/M.

This is great, because if M=N, it means the for any given x, the
expected length of x's list is < 2.


BAD NEWS: WORST-CASE BEHAVIOR IS BAD
------------------------------------
 - Claim: for any hash function h, if |U| >= (N-1)*M + 1, there exists
   a set S of size N that all hash to the same location. 

 - Proof: by the pigeon-hole principle.

 - So, this is partly why hashing seems so mysterious -- how can you
claim hashing is good if for any hash function you can come up with
ways of foiling it?  One answer is that there are a lot of simple hash
functions that work well in practice.

Another nice answer: universal hashing.  Idea is a lot like in randomized
quicksort.  We'll use some random numbers in our construction of h.
What we'll show is that for *any* set S (don't need to assume S is
random), if we then pick h in this random way, things will be good in
expectation.  

Once we develop this idea, we'll use it for a really nice practical
goal: "perfect hashing".  Given a fixed set S, a perfect hash function
for S is one such that *every* lookup is constant time. We'll use
universal hashing as a tool for constructing perfect hash functions.
We'll do universal hashing in this class, and then perfect hashing on
Tuesday.


UNIVERSAL HASHING
-----------------

A set H is a universal family of hash functions if for all x != y in U, 
                        Pr [ h(x) = h(y) ] <= 1/M
                       h<-H

Theorem: if H is universal, then for any set S in U, for any x in S, the
*expected* number of collisions x has [for a random h in H] is <= (N-1)/M 

Proof: 
  -- For each y!=x in S, let C_xy = 1 if x and y collide and 0 otherwise.
     E[C_xy] = Pr(x and y collide) <= 1/M
  -- C_x = total # collisions for x = sum_y C_xy.
  -- E[C_x] = sum_y E[C_xy] <= (N-1)/M


Corollary: say the time to compute h is constant.  Then for any
*sequence* of L lookups, the expected *total* time for the lookups if
we use a random h in H is O(LN/M) -- so constant time per operation if
M=N.

Proof: By linearity of expectation.  Say lookup x_1 then x_2, ...,
then x_L (some of these could be the same).  Then the expected total
time is just the sum of expected time for each one, and each has
constant expected time by our theorem.


Question: can we actually construct a universal hash family?  If not,
this this is all pretty vacuous.   Answer is yes.  First, let's do a
small concrete example:

        |U| = 6, M=2

            x : 0  1  2  3  4  5
        ------------------------
        h_1(x): 0  1  0  1  0  1   (what's a bad S for h_1?)
        h_2(x): 0  0  0  1  1  1   (what's a bad S for h_2?)

    [is this universal yet? no.  problems: (x=0, y=2), (x=3, y=5)]

        h_3(x): 0  0  1  0  1  1
        h_4(x): 1  0  0  1  1  0

    [Now?  Yes.  Easiest way to see is notice that above pairs are
    fine now.  And problem pairs on the bottom are (0,3) and (2,5) which
    are fine on top.  Some pairs are better than 1/m, like (2,3),(1,4),(0,5)]


HOW TO CONSTRUCT:  Matrix method
-----------------
Say keys are u-bits long.  Say table size is power of 2, so an index is
b-bits long with M = 2^b. 

What we'll do is pick h to be a random b-by-u 0/1 matrix, and define
h(x) = hx, where we do addition mod 2.  These matrices are short and
fat.  For instance:

          h     x   h(x)
      --------- -   ---
      [1 0 0 0] 1    1
      [0 1 1 1] 0  = 1
      [1 1 1 0] 1    0
                0

Formally, H = { b-by-u 0/1 matrices} and we are picking h at random from H.

Claim: For x!=y, Pr[h(x) = h(y)] = 1/M = 1/2^m.

Proof: First of all, what does it mean to multiply h by x?   We can
think of it as adding some of the columns of h (mod 2) where the 1 
bits in x tell you which ones to add.  (e.g., we added the 1st and 3rd
columns of h above)

Now, take arbitrary x!=y.  Must differ someplace, so say they differ
in ith coordinate and for concreteness say x_i=0 and y_i=1.
Imagine we first choose all of h but the ith column.  Over remaining
choices of ith column, h(x) is fixed. But, each of the 2^b different
settings of the ith column gives a different value of h(y) [every time
we flip a bit in that column, we flip the corresponding bit in h(y).]
So there is exactly a 1/2^b chance that h(x) = h(y). 


Next time: another method for constructing universal hash functions,
and discussion of *perfect* hashing.