15-451 Algorithms 10/03/00 * Universal hashing, cont * Perfect hashing ---------------------------------------------------------------------------- Hashing recap from last time: General picture: There's a large universe U of possible keys (like all possible strings of at most 50 chars). Some set S in U of keys we actually care about (which may be static or dynamic), N = |S|. Do inserts and lookups by having an array A of size M and a hash function h:U ->{0,...,M-1}. Resolve collisions by having each entry in A be a linked list. So, we want as few collisions as possible. Last time we saw: 1. For any given function h, there exist bad sets S (so long as U is large) that all hash to the same place. 2. If S is *random* then simple h's will turn this into problem of tossing N balls at random into M buckets. -> for any given x, expected number that collide with x is (N-1)/M. -> if N=M, this is < 1. -> if N=M, w.h.p., the size of the LONGEST list is O(log N/loglogN) Proof of this last fact: Let's fix a bucket like bucket 1. Let's also fix k balls. What's the chance those k all fall into bucket 1? Ans: 1/M^k. Claim: the chance that bucket #1 gets >= k balls is at most (N choose k)/M^k. The reason is that if we look at these (N choose k) events, at least one has to occur in order for >= k balls to fall into bucket #1. For N=M, this is <= 1/k!. Suppose we set k so that this is 0.01/M. Then the chance that *any* bucket gets k or more balls is at most 0.01. This happens at k = O(log N/log log N). [And you can work out the constants using Stirling's formula]. 3. Universal hashing: A set H is a universal family of hash functions if for all x != y in U, Pr [ h(x) = h(y) ] <= 1/M h<-H Theorem: if H is universal, then for any S, any x in S, expected # of collisions with x [for h random from H] is <= (N-1)/M. So, this is like a good randomized strategy for the row-player. [But, our argument that largest bucket has size O(log N/loglogN) breaks down because we can't necessarily make the same claims about k-tuples for k>2. E.g., can't say that Pr(h(x)=h(y)=h(z)) <= 1/M^2] Question: if we fix the set S, can we find hash function h such that *all* lookups are constant-time? Yes. This leads to... PERFECT HASHING --------------- Hash function is "perfect" for S if all lookups involve O(1) work. (Usually defined as "no collisions" but that's a little confusing in 2-level scheme) METHOD #1 --------- Say M = O(|S|^2) is OK. I.e., we are willing to have a table whose size is quadratic in the size of our dictionary S. Then, here is an easy method. Let H be universal and M = |S|^2. Pick random h in H and hash everything in S. Claim: Pr(no collisions) >= 1/2. [So, we just try it, and if we got any collisions, we just try again] Proof: - how many pairs (x,y) in S are there? {|S| \choose 2} - for each pair, the chance they collide is <= 1/M by defn of universal. - so, Pr(exists a collision) <= {|S| \choose 2}/M < 1/2. This is like the "other side" to the "birthday paradox". If the number of days is *more* than (# people)^2, then there is a reasonable chance *no* pair has the same birthday. What if we want to use just O(|S|) space? METHOD #2: --------- History: This was a big open question -- posed as "should tables be sorted" -- for a fixed set can you get constant lookup time with only linear space? Was followed by more and more complicated attempts, finally solved in mid-80s by Fredman Komlos and Szemeredi using nice idea of universal hash functions in 2-level scheme. One thing to mention -- a lot of things we talk about were big research results in 70s and 80s, now digested into course topics. Not the sort of thing people have known for ages -- lot of smart people struggled over understanding how to do these things... Proposal: hash into table of size |S|. Will get some collisions. Then, for each bin, rehash it using Method #1, squaring the size of the bin to get zero collisions. E.g., if 1st hash function maps {a,b,c,d,e,f} to [--,{a}, {b,c}, {d,e,f}], then final result would look something like: [--,h_2, h_3, h_4] h_2 would just be the function h_2(x) = 0. h_3(x) would be a function to table of size 4, h_4 would be function to table of size 9. Point is: using Method #1, we can find h_i with no collisions by picking from universal H and trying it -- if it doesn't work we try again (each time, Pr(success)>= 1/2). So, total space used is O(|S|) for the 1st table (assume it takes a constant amount of space to store each hash function), plus O(sum of |B_i|^2) for the other tables. So, all we need to do is prove the following theorem: Theorem: if pick initial h from universal set H, then Pr[sum of |B_i|^2 > 4*|S|] < 1/2. Proof: We'll prove this by showing that E[sum_i |B_i|^2] < 2*|S| (Why does this give us the result we need? <-- "Markov's inequality") Now, the neat trick is that one way to count this quantity is to count the number of ordered pairs that collide, including element colliding with itself. E.g, if bucket has {d,e,f}, then d collides with each of d,e,f; e collides with each of d,e,f; f collides with each of d,e,f; so get 9. So, sum_i |B_i|^2 = # of ordered pairs that collide (allowing x=y) = sum_x sum_y C(x,y) [C(x,y)=1 if in same bucket, else 0] So, E[sum_i |B_i|^2] = sum_x sum_y Pr(x and y are in same bucket) <= |S| + |S|(|S|-1) * (1/M), where the "1/M" comes from definition of universal. < 2*|S|. (since we set M = |S|) So, final "hash function" is 1. compute i = h(x) 2. Now compute h_i(x) and use to lookup in table T_i. Total space used is O(sum of (B_i)^2) = O(|S|). ============================================================================ If there's time, here is another way of achieving universal hashing. Let's assume keys in {0,1,...,U-1}. Hashing down to {0,1,...,M-1} Pick prime p >= U (or, think of just rounding U up to nearest prime). Define h_{a,b}(x) = ( (a*x + b) mod p ) mod M. H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}} Claim: H is a universal family. Proof idea: the inner part will spread things out nicely in the range 0..p-1. In particular, given x!= y, (1) x is equally likely to go anywhere, and (2) given where x goes, y is equally likely to go to each of the p-1 places x didn't go (there won't be any collisions at the mod p level). Then, this allows us to show things will be nice when we take the results mod M. Actual Proof: Fixing x!=y and two locations r and s, let's look at: Pr[[(ax + b) = r (mod p)] AND [(ay + b) = s (mod p)]]. -> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which has exactly one solution (mod p) since Z_p* is a field. [We needed p >= U to ensure that x!=y mod p] -> If r=s, then this means we need a=0 which wasn't allowed. So Pr=0. -> If r!=s, then there is a 1/(p-1) chance that a has the right value. Given this value of a, we need b = r-ax (mod p), and there is a 1/p chance b gets this value, so the overall probability is 1/[p(p-1)]. Now, the probability x and y collide is equal to 1/p(p-1) times the number of pairs r!=s in {0,...,p-1} such that r = s (mod M). We have p choices for r, and then at most ceiling(p/M)-1 choices for s ("-1" since we disallow s=r). The product is at most p(p-1)/M Putting this all together, Pr[(a*x + b mod p) mod M = (a*y + b mod p) mod M] <= p(p-1)/M * [1/(p(p-1))] = 1/M. QED