15-750 04/19/01 * A simple coupling example (walk on a line) * eigenvalues and rapid mixing * conductance and expanders * a practical example of MCMC (Markov Chain Monte Carlo). See also: http://www.cs.berkeley.edu/~sinclair/mcmc.ps ================================== One more (simple) coupling example: Consider a random walk on the line 1,...,n. You start at point 1. At each step you flip a coin: heads means go left, tails means go right. If you're at 1 and get a heads, just stay put (same if you're at n and get a tails). 1. What is the stationary distribution? Ans: uniform. [Can verify directly or view as special case of random walk on graph, where the self-loops give each node the same out-degree] 2. How long until we get close? Let's usse a coupling argument. We are particle x, starting at location 1. Also imagine particle y, at a random spot according to stationary distribution. Now, let's correlate the walks: each time we get heads, so does y, and same with tails. So, y stays uniformly distirbuted. But also, notice that our distance to y never goes up, and sometimes goes down. In particular, by the time we reach point n, we must have coupled. We can now use fact from before that expected time to reach n is at most 2mn = 2n^2. Let's use "Markov's inequality": the probability we have reached by time 4n^2 is at least 1/2. If not, then at least a 1/2 chance in next 4n^2 steps. So, if we want success probability 1-epsilon, then we just have to walk for O(n^2* log(1/epsilon)) steps. So, in this many steps, the prob we have *not* coupled is at most epsilon, so by our lemma from last time, this means our variation distance is at most epsilon. That's it. Note: the lemma about variation distance is kind of clever. More straightforward way to do that last part of the argument is: Pr(x is at i) >= Pr(y is at i)*Pr(coupled | y is at i). We want this to be at least 1/n * (1-epsilon). We know Pr(y is at i) is 1/n. But the coupling probability might depend on y [think of n=2 and a zero-step walk]. So you end up getting a bit worse bounds if you do it this way. ================================== Now we're going to talk more generally about what properties of a graph make the walk rapidly mixing. One way to think of a random walk is as a matrix-vector product. Say we have n states. Define transition matrix: M_ij = prob of going to j given that you're in state i. E.g., our random walk on the line... Write a prob distribution as a row vector q. Then, one step of walk = qM stationary distribution: eigenvector of eigenvalue 1. Say we're interested in rapid mixing. Want to approach stationary dist in time poly in log(n). Can we talk about this in terms of properties of the matrix? Eigenvalue gap -> Rapid Mixing ============================== THEOREM: Say M is a transition matrix with real eigenvalues and orthogonal eigenvectors. Then, for any starting distribution, the L_2 distance between q^(t) (the distribution after t steps) and the stationary distribution \pi is at most |\lambda_2|^t, where \lambda_2 is the eigenvalue of second-largest absolute value. First of all, when do we get real eigenvalues and orthogonal eigenvectors? Well, one case is when M is symmetric. E.g., think of a random walk on a regular graph (all nodes have the same degree). In fact, can generalize to ``time-reversible'' Markov chains: \pi_i * M_{ij} = \pi_j * M_{ji} for all i,j. E.g., random walk on an undirected graph where nodes are not necessarily all the same degree. [Non-time-reversible: random walk on a directed cycle]. Second: we know that the stationary distribution is an eigenvector of eigenvalue equal to 1. This is also the largest eigenvalue: all other eigenvalues have magnitude <= 1. (The proof of the theorem will show us why.) The difference between 1 and |\lambda_2| is called the "eigenvalue gap". So, if we have an eigenvalue gap of some amount gamma, and we want an L_2 distance of delta_2, then we need only O((1/gamma)*log(1/delta_2)) steps. We'll relate L_2 distance to variation distance in a minute. PROOF OF THEOREM: Say orthogonal eigenvectors are e_1, \ldots, e_n. e_1 = \pi. Say we start at q^{(0)} = c_1e_1 + ... + c_ne_n. By definition of eigenvectors (and since M(a+b) = Ma + Mb), after t steps, we're at: q^{(t)} = c_1e_1 + (c_2(\lambda_2)^te_2 +... + c_n (\lambda_n)^te_n). Notice that |lambda_2| cannot be larger than 1, since otherwise our vector is getting bigger and bigger. Can't have a probability vector with a Euclidean length longer that 1 (Euclidean length is always between 1/sqrt(n) and 1). Assuming that |lambda_2| < 1 (else the theorem is vacuous), and e_1 is pi, we must have c_1 = 1. So, the L_2 distance between q^(t) and e_1 is just the length of the vector corresponding to the second part of the RHS. We can rewrite this vector like: (lambda_2)^t * blah where blah is shorter than (c_2e_2 + ... + c_ne_n), which for sure is shorter than q^(0), which has length at most 1. So, this means the L_2 distance between q^(t) and e_1 is at most (lambda_2)^t. QED How does this relate to variation distance? Variation distance = 1/2 of L_1 distance. L_1 distance is at most sqrt(n) * L_2 distance. [[E.g., if you have a unit-length vector in R^n, the largest the sum of the entries can be is if they are all equal to 1/sqrt(n), in which case the sum is sqrt(n).]] So, if we want a variation distance of delta_1, we just need to run for O((1/gamma)*log(n/delta_1)) steps. What does the eigenvector of 2nd-largest eigenvalue look like? It has some positive entries and some negative entries. E.g., if graph is regular, then the eigenvector of highest eigenvalue has all entries equal to 1/n; since the next one is perpendicular, the sum of its positive entries must equal the sum of its negative entries. If the graph is dumbbell shaped, then the positives will be on one side and the negatives will be on the other. These describe the sets that in a sense mix most slowly together. In fact, this approach is often used for graph partitioning (splitting into large pieces that have few edges between them). ============================================================================= When do we get rapid mixing? Ans #1: if have gap between largest and 2nd largest eigenvalue. Ans #2: when the Markov chain has high "conductance". [Note: this has nothing to do with the notion of "resistance" we discussed last time.] Def: * pi(S) = sum_{i in S} pi_i. (prob of being in S according to pi) * flow(S) = sum_{i in S, j not in S} w_ij, where w_ij = (pi_i)*(p_ij) = prob of being on edge i->j in pi. (measure of # edges leaving S). * Conductance. Let Phi(S) = flow(S)/pi(S). Conductance Phi is the minimum Phi(S) over all sets S of pi <= 1/2. Think of Phi(S) this way: If you started out at a random place in S (according to pi), what's the chance that you won't be in S in your next step? This is sum_{i in S, j not in S} Pr(was at i and went to j). This is sum_{i in S, j not in S} p_{ij} * pi_i / pi(S). This is flow(S) / pi(S). If graph has *low* conductance,then it's pretty clear it's *not* rapidly mixing. In particular, if you start out at a random point in the bad set S, it will take a long time to get out. More interesting thing is that if have high conductance, then it *is* rapidly mixing. Note: very similar to definition of an Expander graph. A graph is an epsilon-expander if for all sets S of <= n/2 nodes, |N(S) - S| > epsilon*|S|. Here is the main theorem (we won't give the proof): if diagonal entries are constant > 0 (e.g., self-loops) then Eigenvalue gap is Omega(conductance^2). In the other direction (also won't give the proof): Conductance is Omega(eigenvalue gap). Intuition for main theorem: Imagine giving each node a number according to eigenvector of 2nd largest eigenvalue, and pretend this is a distribution (it's not: the values all sum to 0). After 1 step in the walk, the fact that we have high conductance means there will be a lot of cancelling going on. (and self-loops make surewe don't just bounce between S and its complement.) ================================================== Application: Gibbs sampling from a posterior. Say you have some large set of hypotheses about some phenomenon. before you see any data, you might have some prior belief about which is true. Now, you see some data, and use Bayes rule: Pr(h | d) = Pr(d | h)*Pr(h) / Pr(d). Want to find h of highest posterior. Let's ignore the denominator (it's the same for each h) and say we know how to compute Pr(d | h) for any given h. Problem is: too many h's. So, how about this: can we at least randomly *sample* h's according to the posterior distribution? This is called "Gibbs sampling". From now on, let value(h) = posterior prob of h. Idea: set up some neighborhood structure and take a random walk using "Metropolis filter" (Metropolis is someone's name --- I used to think this was because people were thinking of this as a walk on a grid...) Specifically, let's say we set up a neighborhood structure so that each h has d-1 neighbors, plus add self-loop for aperiodicity, so a total of d neighbors. Now, say we're at h and our coin toss tells us to move to neighbor h'. If value(h') >= value(h) then we go there. Otherwise, we only go there with prob value(h')/value(h); otherwise we stay put. Claim is that this has the correct stationary distribution: Say at time t, prob at h = value(h) for all h. Then at time t+1, Pr(h) = Pr(got here by taking an edge) + Pr(got here by not taking an edge) = sum_{nbrs h' of h} value(h')*(1/d)*min{1,value(h)/value(h')} + sum_{nbrs h' of h} value(h)*(1/d)*max{0,1 - value(h')/value(h)}. = avg_{nbrs h' of h} [min{value(h'), value(h)} + max{0,value(h)-value(h')}] If h' has smaller value than h, then thing in [...] is value(h')+value(h)-value(h') = value(h). If h' has larger value than h, then thing in [...] is value(h)+0. Either way, we're averaging a bunch of things all of which are value(h). So, the stationary dist is just what we want. We just hope the markov chain has high conductance. Unfortunately, high conductance means not only that neighborhood structure is expanding, but also that values are nice. One place this happens [from Kalai-Vempala'00] is when h's are CRP investment strategies. value(h) means how much money we would have had if we had used that investment strategy from the start. Goal is pick random strategy in proportion to value. Here, the natural neighborhood structure has the property that similar strategies have similar values. So this all works.