15-859(B) Machine Learning Theory 03/12/02 Membership Query learning, contd * KM (GL) algorithm for finding large Fourier coefficients * Fourier spectrum of DNF * Bshouty's algorithm for learning decision trees and XORs of terms ========================================================================= KM/GL ALGORITHM: See notes from last time. On the Fourier Spectrum of DNF formulas ======================================= THEOREM: any DNF of s terms has some fourier coefficient of magnitide at least 1/(4s), over the parity basis. PROOF: First, let's say v1 <= v2 if v1 <= v2 in the natural partial order (i.e., v1 is an indicator vector for a subset of v2's set). Say T is a term of f, and V is the set of variables used in T. Claim: sum_{v <= V} |hat{f}_v| >= 1 Why is this good enough: notice that if a term has L literals in it, then the probability it is satisfied by a random example is 1/2^L. So, if all the terms of f have size >= log_2(4s), then the probability that a random example is positive for f is at most 1/4. This means that "all negative" is a weak hypothesis. On the other hand, if some term of f has size < log_2(4s) then there are at most 4s values being summed above, with a total sum of 1, so at least one of them has to be at least 1/4s. Proof of Claim: Define the function T(x) as: T(x) = 1 if x satisfies term T and T(x) = 0 otherwise. This is a little weird since it is not a boolean function according to our definition ({-1, +1}). But it is still a legal function and will be helpful. Let L = # of literals in T. What does the fourier decomposition of T look like: hat{T}_v = E[T(x)*phi_v(x)] = 0 if v is not <= V. (by usual parity argument) = (1/2^L)*phi_v(T) if v <= V. This is because T(x)=0 unless x satisfies T, in which case we have 1*phi_v(T). (phi_v(T) is the value of phi_v on an example satisfying T) Now, here's the trick: f(x)T(x) = T(x), because T is a term of f. So, clearly E[f(x)T(x)] = E[T(x)]. The right-hand-side of the above equation is 1/2^L. The left-hand-side of the above equation is the dot product of f and T. Using the fourier basis, this is sum_v hat{f}_v * hat{T}_v = (1/2^L) * sum_{v <= V} hat{f}_v * phi_v(T). So, sum_{v <= V} hat{f}_v * phi_v(T) = 1. This gives us the claim. ========= This means we can weak-learn DNF over the uniform distribution (with membership queries). Jackson then combines this with a controlled boosting to get strong-learning. ============================================================================ [probably won't finish this this time] Fourier techniques are great for learning over the uniform distribution. But what about arbitrary distributions? It turns out you can show that an adversarial distribution can make MQs worthless for learning DNF, if crypto-hard functions exist (a similar statement is that modulo some crypto assumptions, I could construct a 3-SAT formula such that even if you ask me for a bunch of satisfying assginments, you still can't construct a new one that I haven't shown you yet). But, what about DECISION TREES? THis was open for a long time. Solved by Bshouty. Here's the more recent version of Bshouty's algorithm. You'll see some similarities to KM (and some vague similarities to a problem on hwk3). In fact, algorithm works for a larger class: "XOR of terms". We'll assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_m, where T_i are conjunctions. Why is this a generalization of DTs? We're going to solve by building a decision-tree-like-thing (a "parity decision graph"?) of depth n and width O(m). So, by Occam results, just need to take some initial sample S of size O(nm/epsilon), and then use our membership queries to find a consistent hypothesis of size O(nm). Idea: let's just build a decision tree with x_1 at the root, then x_2 at the next level, etc., until we get to > m leaves. This will happen at some depth d = log(m). Define T_1', ..., T_m' to be what you get by taking the original terms T_1,...,T_m and removing all mention of variables x_1,...,x_d from them. So, if we look at all leaves of our tree so far, the value of the target function, given that the example takes you to that leaf, can be written as an XOR of some subset of the T' terms. For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4) and say d=2. Then the leaves look like: (T1' XOR T3'), T1', (T2' XOR T3'), T2'. Now, the claim is that since there are > m leaves, some of these must be XORs of others. In fact, there must exist some set L of at most m leaves, such that any other leaf is an XOR of some subset of leaves in L. E.g., above any leaf is the XOR of the other three leaves. So the plan is now: 1. (somehow) find such a set L, and write down for other leaves how they can be written as an XOR of leaves in L. (oops: the description length of a level could be as high as m^2, so let's have S be of size nm^2/epsilon to be safe....) 2. keep building the tree down from leaves in L until have more than m leaves again, and repeat. 3. (somehow) understand what this means in terms of a final hypothesis. Let's do 1: For each example, vary the first d bits (using MQs) to make the example reach all active leaves. ==> Get a vector of outputs. I.e., what would the output have been if first d bits were set to take you to leaf i. If we write this out for all the examples, we get a big matrix with 1 roew per example, where the set L corresponds to a set of <= m columns such that every other column is some linear combination of these (mod 2). Find these with Gaussian elimination. (What's a little confusing here is: say this tells us that leaf 1 is the XOR of leaves 2,3,5. We don't really know if this is true over ALL of {0,1}^n. We just know it's true over examples with suffixes matching S. But, that's OK since our goal is just to find a hypothesis consistent with S.) To do 3: Follow example down the tree. When you get to an "inactive" node, then replace position by the *set* of positions listed in the node. So, our "current location" in the tree is now a set of nodes. More generally, if our current set of nodes is N and say node y in N is inactive, then look at set (N - {y}) XOR (positions listed at y). Finally, as we go down the tree, the T' terms will become constant 1 or 0, and we just take the XOR. Why is this legal? We can see this by arguing bottom-up: Say (as above) that L is the set of active nodes at the current level, and inductively assume that the trees rooted at nodes of L are correct on all examples produced by taking some x in S and replacing its prefix by whatever is needed to reach that node. Then (by construction) the inactive nodes (the ones not in L) are also correct on examples produced by taking some x in S and replacing its prefix with whatever was needed to reach them, too. Therefore, all *parents* of this level are correct on examples produced by taking some x in S and replacing prefixes. And so on.