15-451 Algorithms 09/25/03 * Hashing - basic idea - hashing with random inputs - Universal hashing - Perfect hashing ----------------------------------------------------------------------------- Hashing: a great practical tool, with interesting & subtle theory too. Basic setting: juts like we've been discussing so far, we have a bunch of items we want to store for efficient lookup. This is often called the "dictionary problem". Two versions: - static: Given set S of (key,object) pairs. Want to store S so that we can do find(key) operations quickly. E.g., fixed dictionary (e.g, key=word, object=definition), or any sort of static lookup table. - dynamic: sequence of insert(key,object), find(key), and, perhaps, delete(key) requests. Want to do these all efficiently. For the first problem we could use a sorted array with binary search, and for the second we could use a balanced search tree, but hashing gives an alternative approach that is often the fastest and most convenient way of solving these problems. Example: writing a game-playing program, or AI-search program, and you want to store situations that you've already solved (board positions or elements of state-space) so that you don't redo the same computation when you encounter them again. FORMAL SETUP ------------ - Say keys come from some large universe U. (e.g, all < 50-character strings) - Some set S in U of keys we actually care about (which may be static or dynamic). N = |S|. - do inserts and lookups by having an array A of size M, and a HASH FUNCTION h: U -> {0,...,M-1}. Use power of indirect addressing to store key k (and its associated object) in A[h(k)]. [One point: if U was small (like 3-character strings) then you could just store the key k in A[k] like in bucketsort. But the problem is that U is big. That's why we need the hash function.] - One thing to worry about are collisions: h(k1) = h(k2). Simplest way to resolve collisions is to have each entry in A be a linked list. To insert, just put at top of list. If h is good, then hopefully lists will be small. This is called "separate chaining". Nice properties: searches are easy: compute index i=h(key) and then walk down list at A[i] until we find it. Inserts and deletes are easy too. Properties we want: - M is O(N) - Elements are spread out. Ideally all buckets (lists) of constant size. - h is fast (constant time) to compute. Let T_h = time to compute h. Time to do a lookup for key k is T_h + O(length of list at h(k)). insert just takes time T_h + O(1). We'll be assuming T_h is O(1) but good to keep in back of our heads that the time to compute h is an issue. Basic intuition: one way to spread things out nicely is to spread them randomly. Unfortunately, we can't just use a random number generator to decide where the next thing goes because need h to be able to do lookups: h(k) has to be the same when we lookup as it was when we inserted. So, we want something "pseudorandom". First give good news. Then bad news. Then good news. GOOD NEWS (I): WHAT IF KEYS ARE RANDOM? -------------------------------------- Then, property we want of h is just that roughly same number of elements of U hash to each location. E.g., a really bad functions is "h(x) = 17, no matter what x is". Good hash function: h(x) = x mod M. Or, if M = 2^b, and U = all strings of some length u, then can use h(x) = last b-bits of x. Typical in systems applications where keys are random looking and maybe b=8. In this case, can think of having N balls being tossed at random into M boxes. If we fix some x, what is the expected number of other elements in S (other balls) that collide with it? Ans: (N-1)/M. This is great, because if M=N, it means the for any given x, the expected length of x's list is < 2. BAD NEWS: WORST-CASE BEHAVIOR IS BAD ------------------------------------ - Claim: for any hash function h, if |U| >= (N-1)*M + 1, there exists a set S of size N that all hash to the same location. - Proof: by the pigeon-hole principle. - So, this is partly why hashing seems so mysterious -- how can you claim hashing is good if for any hash function you can come up with ways of foiling it? One answer is that there are a lot of simple hash functions that work well in practice. Another nice answer: universal hashing. Idea is a lot like in randomized quicksort. We'll use some random numbers in our construction of h. What we'll show is that for *any* set S (don't need to assume S is random), if we then pick h in this random way, things will be good in expectation. Once we develop this idea, we'll use it for a really nice practical goal: "perfect hashing". Given a fixed set S, a perfect hash function for S is one such that *every* lookup is constant time. We'll use universal hashing as a tool for constructing perfect hash functions. We'll do universal hashing in this class, and then perfect hashing on Tuesday. UNIVERSAL HASHING ----------------- A set H is a universal family of hash functions if for all x != y in U, Pr [ h(x) = h(y) ] <= 1/M h<-H Theorem: if H is universal, then for any set S in U, for any x in S, the *expected* number of collisions x has [for a random h in H] is <= (N-1)/M Proof: -- For each y!=x in S, let C_xy = 1 if x and y collide and 0 otherwise. E[C_xy] = Pr(x and y collide) <= 1/M -- C_x = total # collisions for x = sum_y C_xy. -- E[C_x] = sum_y E[C_xy] <= (N-1)/M Corollary: say the time to compute h is constant. Then for any *sequence* of L lookups, the expected *total* time for the lookups if we use a random h in H is O(LN/M) -- so constant time per operation if M=N. Proof: By linearity of expectation. Say lookup x_1 then x_2, ..., then x_L (some of these could be the same). Then the expected total time is just the sum of expected time for each one, and each has constant expected time by our theorem. Question: can we actually construct a universal hash family? If not, this this is all pretty vacuous. Answer is yes. First, let's do a small concrete example: |U| = 6, M=2 x : 0 1 2 3 4 5 ------------------------ h_1(x): 0 1 0 1 0 1 (what's a bad S for h_1?) h_2(x): 0 0 0 1 1 1 (what's a bad S for h_2?) [is this universal yet? no. problems: (x=0, y=2), (x=3, y=5)] h_3(x): 0 0 1 0 1 1 h_4(x): 1 0 0 1 1 0 [Now? Yes. Easiest way to see is notice that above pairs are fine now. And problem pairs on the bottom are (0,3) and (2,5) which are fine on top. Some pairs are better than 1/m, like (2,3),(1,4),(0,5)] HOW TO CONSTRUCT: Matrix method ----------------- Say keys are u-bits long. Say table size is power of 2, so an index is b-bits long with M = 2^b. What we'll do is pick h to be a random b-by-u 0/1 matrix, and define h(x) = hx, where we do addition mod 2. These matrices are short and fat. For instance: h x h(x) --------- - --- [1 0 0 0] 1 1 [0 1 1 1] 0 = 1 [1 1 1 0] 1 0 0 Formally, H = { b-by-u 0/1 matrices} and we are picking h at random from H. Claim: For x!=y, Pr[h(x) = h(y)] = 1/M = 1/2^m. Proof: First of all, what does it mean to multiply h by x? We can think of it as adding some of the columns of h (mod 2) where the 1 bits in x tell you which ones to add. (e.g., we added the 1st and 3rd columns of h above) Now, take arbitrary x!=y. Must differ someplace, so say they differ in ith coordinate and for concreteness say x_i=0 and y_i=1. Imagine we first choose all of h but the ith column. Over remaining choices of ith column, h(x) is fixed. But, each of the 2^b different settings of the ith column gives a different value of h(y) [every time we flip a bit in that column, we flip the corresponding bit in h(y).] So there is exactly a 1/2^b chance that h(x) = h(y). Next time: another method for constructing universal hash functions, and discussion of *perfect* hashing.