15-451 Algorithms 9/28/00 * Hashing - dictionary problem - hashing with random inputs - Universal hashing - Perfect hashing (next time) ----------------------------------------------------------------------------- THE DICTIONARY PROBLEM: - static version: Given set S of (key,object) pairs (e.g, key=word, object=definition). Want to store S so that we can do lookups efficiently. - dynamic: sequence of insert(key,object), find(key), and, perhaps, delete(key) requests. Want to do these all efficiently. Pretty much all the algorithms and data structures we've been talking about so far are motivated at least in part by these problems. E.g., one reason to do sorting is that a sorted array is a good way of handling a static dictionary, since then can do binary search. Sorted arrays not so good for dynamic - but can use B-trees or Splay Trees or Tries. Hashing is another way to do it - very practical and easy to implement. You probably implemented them in 211. But a little mysterious too. Our focus will be on the theoretical underpinnings, and we'll draw on 2-player game way of thinking from hwk2. BASIC IDEA OF HASHING - define a hash function h that maps keys into numbers from 0...M-1. - to store our dictionary, we will allocate an array A of size M and use the power of indirect addressing. We'll store (key, object) in A[h(key)]. To handle collisions (h(key) = h(key')), we'll make each entry of A be a linked list, and just put at top of list. - searches are easy: compute index i=h(key) and then walk down list at A[i] until we find it. Deletes are easy too. Ideal setting: M is O(size of dictionary), all buckets (lists) are of constant size, and h takes constant amount of time to compute. In that case, what is time for insert/search/delete? To analyze hashing schemes, make a few definitions. - Let U be the "universe" of possible keys (e.g, all < 50-character strings) - So, h is a function from U to {0,...,M-1}. - Some much smaller subset S of U that we care about. (S may be static or changing with time). Use N = |S|. - Let T_h = time to compute h. Time to do a lookup for key k is T_h + O(length of list at h(k)). insert just takes time T_h + O(1). We'll be assuming T_h is O(1) but good to keep in back of our heads that the time to compute h is an issue. - basic intuition: really bad hash function: h maps everything in S to the same place. Really good hash function: S gets evenly spread through range. Almost as good: h spreads elements randomly --- but, we can't just use a random number generator to decide where the next thing goes because need h to be able to do lookups: h(k) has to be the same when we lookup as it was when we inserted. WORST-CASE BEHAVIOR (I) - Claim: for any hash function h, if |U| >= N*M, there exists a set S of size N that all hash to the same location. - Proof: by the pigeon-hole principle. - So, this is partly why hashing seems so mysterious -- how can you claim hashing is good if for any hash function you can come up with ways of foiling it? Any thoughts? If think in 2-player game view, can imagine different hash functions as rows and different sets S as columns, and we are saying that for any deterministic strategy of row player, there is a column that kills it. So, one way around is to assume column is picked randomly. Another is to allow randomized strategy for row player. E.g., maybe can pick hash function at random from some set and argue it has good expected behavior for any column. Will get back to that in a bit. Start with easier case of random inputs. WHAT IF KEYS ARE RANDOM? U = all u-bit strings, M = 2^m. Keys chosen at random in U. What would be a good hash function? -- can just use: h(x) = last m-bits of x. Typical in systems applications where keys are random looking and maybe m=8. Say have N keys total (chosen at random). If we fix one of them, what is expected number of other elements that collide with it? Basically throwing N balls at random into M boxes, and asking how many collide with first ball. Expected number is (N-1)/M. So, if N=M, then for any given element, expected length of list is < 2. So, just need linear size table and average time is constant. WHAT IF S ISN'T RANDOM? Idea is to use some randomness in choice of h. E.g., maybe we can find a set H of h's such that no matter what S is, *most* h in H behave well. This is idea of universal hashing. UNIVERSAL HASHING ----------------- Definition: a set H is a universal family of hash functions if for all x != y in U, Pr [ h(x) = h(y) ] <= 1/M h<-H Theorem: if H is universal, then for any set S in U of size N, and any x in S, the *expected* number of collisions x has (where this is over the choice of h) is <= (N-1)/M Proof: -- For each y!=x in S, let C_xy = 1 if x and y collide and 0 otherwise. E[C_xy] = Pr(x and y collide) <= 1/M -- C_x = total # collisions for x = sum_y C_xy. -- E[C_x] = sum_y E[C_xy] <= (N-1)/M Corollary: say N=M, and time to compute h is constant. Then for any set S of size N, and any sequence of L lookups, the expected total time for the lookups if we use a random h in H is O(L) -- so constant time per operation. Proof: say lookup x_1 then x_2, ..., then x_L (some of these could be the same). Then the expected total time is just the sum of expected time for each one, and each has constant expected time by our theorem. So, this is giving us a good randomized strategy in the 2-player game sense. Example of universal family |U| = 6, M=2 x : 0 1 2 3 4 5 ------------------------ h_1(x): 0 1 0 1 0 1 (what's a bad S for h_1?) h_2(x): 0 0 0 1 1 1 (what's a bad S for h_2?) [is this universal yet? no. problems: (x=0, y=2), (x=3, y=5)] h_3(x): 0 0 1 0 1 1 h_4(x): 1 0 0 1 1 0 [Now? Yes. Easiest way to see is notice that above pairs are fine now. And problem pairs on the bottom are (0,3) and (2,5) which are fine on top. Some pairs are better than 1/m, like (2,3),(1,4),(0,5)] HOW TO CONSTRUCT: Matrix method ----------------- Say keys are u-bits long. Say table size is power of 2, so an index is m-bits long with M = 2^m. H = { u x m 0-1 matrices}. h(x) = hx (h is a matrix, x is a vector), where we do addition mod 2. (these matrices are short and fat) Claim: For x!=y, Pr[h(x) = h(y)] = 1/M = 1/2^m. Proof: First of all, what does it mean to multiply h by x? E.g., h x h(x) --------- - --- [1 0 0 0] 1 1 [0 1 1 1] 0 = 1 [1 1 1 0] 1 0 0 Can think of as adding some of the columns of h (mod 2) where the 1 bits in x tell you which ones to add. (e.g., we added the 1st and 3rd columns of h above) Now, take arbitrary x!=y. Must differ someplace, so say they differ in ith coordinate and for concreteness say x_i=0 and y_i=1. Imagine we first choose all of h but the ith column. Over remaining choices of ith column, h(x) is fixed. But, each of the 2^m different settings of the ith column gives a different value of h(y) [every time we flip a bit in that column, we flip the corresponding bit in h(y).] So there is exactly a 1/2^m chance that h(x) = h(y). Here's another universal family that's easier to deal with computationally: Lets let U = {0,...,u-1} and M = {0,...,m-1}. Pick prime p >= u (or, think of just rounding u up to nearest prime). Define h_{a,b}(x) = ( (a*x + b) mod p ) mod m. H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}} Claim: H is a universal family too.