15-859(B) Machine Learning Theory 01/17/02 * Mistake-bound model - simple results - relation to consistency model - Halving and Standard Optimal algorithm ======================================================================= Mistake-bound model ------------------- The consistency model had the problem that a positive result didn't necessarily imply finding a rule with good "predictive power" in any sense. The Mistake-Bound model addresses this problem by explicitly modeling learning as an on-line process. In the MB model, learning is in stages. In each stage: 1. Learner gets unlabeled example. 2. Learner predicts classification. 3. Learner is told correct answer. Goal: bound the total number of mistakes. Definition: An algorithm A learns C in the mistake bound model if for any sequence of examples consistent with some concept in C, the total number of mistakes ever made by A is bounded by some polynomial in n and s, where n = size of an example s = size (description length) of target function (s is only important if C contains concepts with description length more than polynomial in n, like DNF or decision trees) Definition: C is *efficiently* learnable in the mistake-bound model if there exists an algorithm A that learns C, where the running time of A per stage is also polynomial in n and s. Conjunctions: Start: h = x_1 \Not{x}_1 x_2 \Not{x}_2 ... x_n \Not{x}_n. Mistake on positive: remove from h all literals not satisfied. Mistake on negative: say ``no consistent conjunction.'' Guarantee: {literals in h} contains {literals in target}. The first mistake removes n literals, and each subsequent mistake removes at least 1, which implies that # mistakes <= n+1. -------------------------- Disjunctions: --> same thing just reverse positive and negative. -------------------------- Lower bound of n: We've seen an algorithm that makes at most n+1 mistakes. In fact, *no* deterministic algorithm can guarantee a mistake bound less than n. Easiest to see with disjunctions: Imagine seeing the following sequence of examples: 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 1 then, *any* labeling of these is consistent with some concept in C, so for any algorithm, there exists a sequence of examples forcing it to make n mistakes. (And if we imagine labeling them with a random OR-function, can see that even a randomized algorithm has to make n/2 mistakes in expectation). [this in some ways foreshadows the notion of "VC-dimension" that we will get to later...] Relation to consistency model ----------------------------- Theorem: if class C is learnable in the MB model by a poly-time algorithm A that uses hypotheses in C, then C is learnable in the consistency model. Proof: Given set S of data, run algorithm through the set over and over until it makes a pass with no mistakes. Let's assume the algorithm is "conservative", meaning that it only changes its hypothesis when it makes a mistake. Then, if it makes a pass with no mistakes, we have a consistent hypothesis. On the other hand, if it keeps making mistakes and exceeds its mistake bound, then we know there is no consistent hypothesis. What about "non-conservative" algorithms? --> can reset its state after seeing a non-mistake --> or, can just use its hypothesis to see if there is any inconsistent example and then feed that example in. Note: the other direction is not necessarily so easy, if we want our algorithm to be efficient. You can construct weird concept classes where the consistency problem is can be solved easily, but achieving a polynomial mistake bound requires being able to invert a 1-way function (e.g., like factoring). Even for natural concept classes, going in the other direction is not necessarily very obvious. E.g., we saw last time how to learn decision lists in the consistency model. On your hwk you will show how to do that in the MB model. One more example: * each example is a number in range: 0 to 2^n-1. (think of as binary). * C = { [0,a] : a < 2^n} (i.e., C = initial sub-intervals). (e.g., imagine that you are observing some device that fails when it gets too hot, and each day you write down current temperature and whether device is working right or not) Simple alg for consistency model: choose concept [0,b] where b is the largest positive example seen. Then verify that no negative example is < b. - Does this work for MB model? - What would work for MB model? use current hypothesis as [0,b] where b is halfway between largest positive and smallest negative example. Cuts down unknown region in half on every mistake. At most n mistakes. ======================================================================== Halving algorithm ----------------- If we don't care about computation time, we can generalize the above algorithm to learn an *arbitrary* concept class C with at most lg(|C|) mistakes. HALVING ALGORITHM: predict using majority vote over all concepts in C consistent with past data. E.g., there are 3^n conjunctions, so this makes at most O(n) mistakes. Notice that it takes lg|C| bits to describe a concept in C (if you use the same number of bits per concept), so we are making at most 1 mistake per bit. What about concept classes with both simple and complex functions? E.g., decision trees. If we assume that our description language is a "prefix code" (e.g., like a huffman code: think of a binary tree with concepts at the leaves), then you can modify the halving algorithm to make only s mistakes where s is the number of bits it takes to write down the actual target function. All you do is change the algorithm to take a weighted vote, where each function f is given weight 1/2^{size(f)}. The reason this works is that at the start, the total weight of all functions is at most 1 (think about the binary tree). Then each mistake removes at least half of the weight. In the end, the final weight has to be at least 1/2^s. So, that proves it. Question: is halving algorithm optimal? Say you have unlimited computation time, does this have the best possible mistake bound? Surprisingly, the answer is no. On-line learning as a 2-person game ----------------------------------- If we ignore the issue of computational efficiency. we can think of learning in the mistake-bound model as a 2-person game. In particular, for a deterministic algorithm, the game goes like this: 1. The opponent selects some example x, which splits C into two sets: C_0(x) is the set of concepts that label x negative, and C_1(x) is the set of concepts that label x positive. 2. The algorithm gets to pick one of C_0(x) and C_1(x) to throw away (by predicting postive and making a mistake it throws out C_0(x), or by predicting negative and making a mistake it throws out C_1(x)) 3. If only one concept left, then done (view two semantically equivalent concepts as the same). Else, go back to step 1. The value of this game (to the opponent) is the number of rounds the game is played. Looking at the problem this way, we can see a couple of things. First, there is a well-defined notion of optimal strategies for each player and second, there is a well-defined number OPT(C) that is the optimal mistake bound for concept class C. What is optimal strategy? Given an example x, we ``just'' calculate OPT(C_0(x)) and OPT(C_1(x)) (by applying this idea recursively), and throw out whichever set has larger mistake bound. Another way to look at it is that OPT(C) is defined recursively as: If |C| = 1 then OPT(C) = 0. Else OPT(C) = 1 + max( min( OPT(C_0(x)), OPT(C_1(x)) ) ) x This algorithm is often called the "Standard optimal algorithm" (more discussion in Littlestone paper). Notice that this is {\em not} necessarily the same thing as the Halving Algorithm. Can anyone think of a case where halving isn't optimal? Another thing to think about is what if there isn't any perfect target function in C? One thing we might hope to do is perform comparably to the *best* function in C in hindsight. Next class we'll see a really neat algorithm for doing this that is like a flexible version of the halving algorithm: instead of throwing out the inconsistent functions, you just give them a reduced vote. By doing this right we'll be able to get a bound that looks like: log(|C|) * (1/epsilon) + (1 + epsilon)*M_best where M_best is the number of mistakes of the best function in C in hindsight.