15-859(B) Machine Learning Theory 02/26/02 * More on learning with random noise * Statistical Query model cont. * Characterizing what's learnable in SQ model via Fourier analysis ======================================================================== - recap noise model. - compare to gaussian generative model? - random noise at least allows for arbitrary D. Also, even though in the real world, you won't have this kind of independence, at least it's an "unbiased" type of assumption. - nice project: try to look at variants of noise models. - Kearns's neat trick: if can learn C by C w/o noise, and can learn C by H with noise, how to learn C by C with noise. E.g., case of LTFs. - SQ defn recap - why it's OK to imagine asking for [0,1]-valued property - viewing neural nets as SQ. Viewing Decision Treess as SQ. - then do SQ ==> PAC + CN - mention that converse is not true. - then start on next topic, which is.... Characterization of SQs using Fourier analysis ---------------------------------------------- [I don't expect we will finish this today] - characterization of "what's learnable" in terms of number of pairwise nearly uncorrelated functions. - show why small SQ-dim implies weak learnability over that D. Say m is the SQ-dim of C under D. Alg for given C,D: * store maximal set H of fns s.t. for all f,g in H, |Pr[f(x) != g(x)] - 1/2| < 1/(m+1). We know |H| < m+1. * To learn, ask for Pr[c(x) = h(x)] for each h in H, with tau=1/4(m+1) We know there must be some h such that the correct answer differs from 1/2 by at least 1/(m+1). So, just keep asking until we find one where response differs from 1/2 by at least 3/4(m+1), and output either h or not(h) as appropriate. (Note: for this top imply PAC learning for all D, need small SQ-dim for all D and also need to be able to construct hyps in poly time.) - Parity, Decision Trees, and DNF: The classes of poly-size DTs and DNFs are not learnable under the uniform distribution in the SQ model since these classes each have Omega(n^{log n}) pairwise uncorrelated functions, namely the parity functions having at most log(n) relevant variables. We will prove: if there are a super-polynomial number of pairwise uncorrelated functions then can't weak learn in SQ model. (we are changing "nearly uncorrelated" to "uncorrelated" to make it cleaner) Fourier analysis of finite functions ------------------------------------- Fancy name for something not inherently complicated. But it *is* tricky conceptually. I think it's neat, so *please* stop me when it gets confusing.... First of all, let's think of positive as +1 and negative as -1, so a boolean concept is a function from {0,1}^n to {-1, +1}. Think of a function f as a vector of 2^n elements, one for each input. We'll associate with f the vector: (sqrt(Pr(000)) * f(000), sqrt(Pr(001)) * f(001), ..., sqrt(Pr(111) * f(111)) In other words, this is the truth table of f, but weighted by sqrt(Pr(x)). What is the (L^2) length of f as a vector? f.f = 1. What is f.g? f.g = sum_x Pr(x)f(x)g(x) = E_D[f(x)g(x)] = Pr(f(x)=g(x)) - Pr(f(x)!=g(x)). Let's call this the correlation of f and g with respect to distrib D. f.g = 0 means that f and g are uncorrelated. They are orthogonal as vectors. So, this is a nice way of looking at a set of functions. Each function is a vector in this (2^n)-dimensional space, and the dot product between two vectors tells us how correlated they are. "Fourier analysis" is just a fancy way of saying that we want to talk about what happens when you make a change of basis. An "orthonormal basis" is a set of orthogonal unit vectors that span the space. For instance, in 2-d, let x' be the unit vector in the x direction and y' be the unit vector in the y direction. Then a vector v = (3,2) is 3*x' + 2*y'. Say we have two other orthogonal unit vectors a and b. Then could also write v in terms of a and b as: v = (v.a)a + (v.b)b This is called a change of basis. In our 2^n dimensional space, an orthonormal basis is a set of 2^n orthogonal unit vectors. For instance, if D is the uniform distribution, then the parity functions are one possible orthonormal basis. Lets fix one: phi_1, phi_2, ..., phi_{2^n}. Given a vector f, let f_i be the ith bit of f. I.e., ith entry in the standard basis. then \hat{f}_i = f \cdot \phi_i is the ith entry in the phi basis. For instance, we could write f as: f = \sum_i \hat{f}_i \phi_i The \hat{f}_i are called the "fourier coefficients of f" in the phi basis. Since, as a vector, we have f = \sum_i \hat{f}_i \phi_i, this also means that as functions, f(x) = \sum_i \hat{f}_i \phi_i(x) The reason is that the left-hand-side is the xth coordinate of f, divided by sqrt(Pr(x)). So, all we're saying is that if one vector is the sum of a bunch of other vectors, then the xth coordinate of the vector is the sum of the xth coordinates of the other vectors. So, nothing fancy so far, but here's one neat thing that comes out of the point of view we're using: Since f is a unit vector, the sum of the squares of its coefficients is equal to 1. So, \sum_i (\hat{f}_i)^2 = 1. (If you want to be fancy you could call this fact "Parseval's identity") One thing this implies is: at most t^2 of the \hat{f}_i can be greater than 1/t. In other words, any boolean function f can be weakly correlated with at most a polynomial number of the phi's. More generally if we have some set S of pairwise uncorrelated functions, then f can be weakly correlated with at most a polynomial number of functions in S (since we can extend this set arbitrarily into an orthonormal basis). Here is the point: Suppose we ask a query "what is the correlation of the target with my hypothesis h?" with some tau = 1/poly. Then if the target function is chosen randomly from some superpolynomial-sized set of uncorrelated functions (like parity functions) then w.h.p. the correct answer is infinitesimally small, so a legal response is 0. (each question throws out only a poly number of possibilities). Or, to think about what this means going back to the PAC-style setting: if you have a fixed set of data and you're doing hill-climbing, then each time you do a check on your current error, whp the correct answer is infinitesimmaly close to 1/2, and all the variation is due to the variation in the data sample. It will take more than poly number of steps to stumble on a good query. It turns out that ANY Stat query can be basically viewed in this way. Proof: Step 1: Say the set S of orthogoal functions is phi_1, ..., phi_m. Let's extend this to an orthonormal basis arbitrarily phi_1, ..., phi_m,..., phi_{2^n} (only the first m of these are functions in our concept class) Instead of saying a StatQuery is a 0/1 function, let's make it a -1/+1 fn. So, a query chi takes {0,1}^n x label -> {-1, +1}. We're asking for E_D[chi(x, c(x))] where c in one of the phi's (phi_target) Want to think of this as asking for a correlation. Step 2: One way that chi is different from a hypothesis is that it is a funcion of the label too. So, to handle this technical difficulty, lets extend our setting to the domain {0,1}^n x {-1, +1}. * define distrib D' = D x (uniform on {-1, +1}) * Define phi_i(x,l) = phi_i(x). still orthogonal: Pr_D[phi_i(x,l) = phi_j(x,l)] = Pr_D[phi_i(x)=phi_j(x)] = 1/2 * Need 2^n more basis functions. Let's define h_i(x,l) = phi_i(x) * l. Need to verify this is OK: - check that h_i and h_j are orthogonal (for i != j) - check that h_i and phi_j are orthogonal Pr[h_i(x,l) = phi_j(x)] = 1/2*Pr[phi_i(x) = phi_j(x)] + 1/2*Pr[-phi_i(x) = phi_j(x)] even if i=j we get 1/2. Step 3: Do a fourier decomposition of chi. chi = sum_i alpha_i*phi_i + sum_i beta_i*h_i where sum_i (alpha_i)^2 + sum_i (beta_i)^2 = 1. So, we can write the quantity we're asking about E_D[chi(x, c(x))] = E_D[ sum_i alpha_i*phi_i(x) + sum_i beta_i*h_i(x,c(x))] = sum_i alpha_i*E_D[phi_i(x)] + sum_i beta_i*E_D[c(x)phi_i(x)] The first term is some constant offset that depends on the query but is independent of the target function. What about the second term? Well, since c = phi_target is one of the phi's, and since they are orthogonal, that expectation is 0 except for phi_target. So, the second term is just beta_target. And, since at most 1/tau^2 of the betas can be larger than tau, since the target was chosen randomly from superpolynomial-sized set, this is whp less than tau. So, can whp respond with just the first part that is independent of the target function. That's it.