15-750 Graduate Algorithms 2/8/01 * A randomized linear-time MST algorithm [KKT] [Note: extensive MST tutorial at http://www.cis.upenn.edu/~jeisner/papers/eisner.mst-tutorial.ps] ================================= Preliminaries: - Assume all weights are distinct (o.w., number the edges in an initial pass, and use these to break ties). Note that then the MST is unique. - Cycle property: for any cycle in the graph, the MST doesn't have the heaviest edge in the cycle. [Book calls this the Red rule] Proof: if it did, remove the edge, which splits the tree into two components. Walk around the cycle until you find two nbrs from different components, and then put that edge in and get a better tree. Alternate way of looking at it: Watch behavior of Kruskal on this cycle, and the number of different compoents represented. By the time you get to the last edge, everybody is in the same component. - Cut property: for any nonempty set X of vertices, the lightest edge with exactly one endpt in X is in the MST. [Book calls this the Blue rule] Proof: Kruskal puts it in. Boruvka's algorithm (1926): For each vertex, select the shortest edge incident. We know all these will be in the tree by the cut property. These selected edges for a collection of components (at most n/2 of them). Now, contract all components, getting rid of multiple edges and self-loops, and repeat. [when there are multiple edges, just keep the shortest ofthem] Running time of Boruvka: selectingthe edges takes time SUM_v d(v) = O(m). Contracting can also be done in time O(m). Probably easiest is 2-pass DFS: in pass 1, number the vertices based on which component. In step 2, construct the new graph. Number of steps is O(log n). So, total time is O(m log n). Side note: Remember how Prim with Fib heaps was O(m + n log n). So if you do loglog(n) steps of Boruvka, then the number of nodes is now n/log n. So, you can run Prim on what's left in time O(m). Total time is then O(m loglog(n))). That's basically linear...right? ================================== Linear time algorithm: High level idea is that if we could somehow keep the density of the graph constant, so that after each step of Boruvka the number of edges went down by a factor of 2 as well, then our costs would telescope. So, we need to develop a complementary algorithm that somehow removes edges that cannot possibly be in the MST. For various reasons, we'll want to allow for disconnected graphs G and talk about minimum spanning Forests. All the discussion above works then too. Def: Given a forest F for a graph G, we'll say that an edge (u,v) in G is *heavy* for F if (a) u and v are connected in F, but not with this edge, and (b) weight(u,v) > max edge weight in u-v path in F. -> So, if F is a minimum spanning forest, then *all* edges not in F are heavy for F. -> Also, for any forest F, heavy edges cannot be in MSF. Why? (follows from cycle property). Magic fact: Given a graph G and a forest F, can find all the heavy edges for F in time O(m). [Note 1: it's NOT easy -- maybe sketch at end] [Note 2: This is sometimes called the "verification" problem, since if someone handed us a supposed MSF F and we wanted to check it, we'd need to verify that every edge not in F is heavy for F.] Here's the plan for the algorithm: We'll do 2 Boruvka steps. Then, in linear time we find a "pretty good" forest F (will use randomized strategy for this). Then in linear time we get rid of the heavy edges. If all went well, we're at a good density again. Note: finding the "pretty good" forest will involve computing MSF of smaller graph than G, recursively. So, running time will havea feel like in linear-time median finding. More specifically, we're going to randomly sample edges from G, taking each one with prob p. Call new graph H. Then will find MSF F of H. The key to this working will be the following lemma: Lemma: The expected number of non-heavy edges in the above experiment is at most n/p. [So, new graph will be sparse again] Proof: This is pretty neat. For purposes of analysis, imagine looking at the edges in order from shortest to longest (we don't have time to sort, but this is just for analysis) and using Kruskal to build F at the same time as we pick H. Specifically, we start with F and H both empty. Then for each edge, we flip a coin of bias p to decide whether to put into H. If we do put it into H, we look to see if both endpoints are in the same component of F, and if not, we put it into F also. Now, here's the trick: suppose that conceptually we *first* check to see if both endpoints are in the same component of F, and if not, mark the edge, and *then* flip the coin of bias p. We still do the same thing: if the coin is heads *and* the edge is marked we put the edge into F and H, if heads an unmarked then H only, and if tails then neither. Note that all unmarked edges are heavy by definition, so our job is to bound the number of marked edges. Now, here's the point: each time we have a marked edge, we have prob p of putting it into F, and F can have at most n-1 edges in the end. So, if we just look at coin flips on marked edges, what we have is basically a situation where we're tossing a coin of bias p until n-1 heads occur and we want to know what is the expected number of flips this took. (Actually, what we have is some complicated stopping criteria but that for certain has stopped by the time we get n-1 heads). As you might expect, the expected number of flips is (n-1)/p. [Proof of that last statement: what if we just flip until we see one head? The expected number of flips this takes is 1/p. Can see this by writing equation E(flips) = p*1 + (1-p)*[1 + E(flips)]. Then use linearity of expectation.] So, that proves the lemma. Now we just need to put all of this together (and after that maybe go into "magic fact") Algorithm: 1. Run two Boruvka steps. 2. Run sparsification step with p=1/2. 3. Repeat. (Note: Step 2 involves a recursive run of algorithm on H). Running time analysis: Let's first do a back-of-the-envelope calculation as if alg was deterministic, and we hit all expectations exactly. Then we get a reccurence T(m,n) <= c*m + T(m/2,n/4) + T(n/2,n/4). This looks messy, but suppose we assume that m >= 2n at the start (add dummy edges if we need to). Then this holds in the recurrence, and the point is that n/2 <= m/4. So, we can ignore n and just write: T(m) <= c*m + T(m/2) + T(m/4), and we have linear time. Now, let's do it right. Let's write out the recursion tree. We can write it as a binary tree where left branch corresponds to step 2 and right branch corresponds to step 3. In any particular node, can write down the number of edges (which will be a random variable) and our total running time is the sum of these. So, by linearity of expectation, can get our expected running time by summing up the expectations. If Y is the left-child of X, then our theorem tells us that E[Y | X=k] <= k/2. This means that E[Y] <= E[X]/2. The next point is that if a node is at level i, then it has at most n/4^i vertices. So, if Z is a right-child of somebody and it is at level i, then the expected number of edges is at most 2*(n/4^i). So, now we just need to add things up. In fact, can now go back to our handwavy argument by just reinterpreting recurrence as a way of adding all these expectations. ============================= Basic idea of MSF verification (that "magic step"): For each edge e = (u,v) with u and v connected in F, we want to find the length of the longest edge in F in their path (so we can compare that to e). Let's call this Max_F(u,v). The problem is we need to handle all m of these queries in time O(m). Here's part of the idea. Say we run Boruvka's alg on just the forest F. This is weird since we know we'll just end up with F (since it is its own MSF), but the point is to look at the order in which edges are added in. The first thing is we *can* run Boruvka in time O(n) here, since each contracting phase cuts the number of edges in half. The second interesting thing is that suppose for each node in F, we keep track of the longest edge ever selected by its component in the run of Boruvka so far. Then when u and v get connected, we can quickly compute Max_F(u,v). But, it is not at all clear how to update these things efficiently, and that's where all the hard work comes in.