15-859(A) Machine Learning Theory            04/13/04

* Margins, Kernels, luckiness functions.
=====================================================================
We've seen a lot of results on learning linear separators.  Today,
will talk about more, plus something called the "kernel trick" for
using linear-separator algorithms when your data doesn't actually have
a good linear separator.

Learning linear separators
==========================
Given data in R^n, labeled +,-.  Want to find a linear separator.  Can
do with a number of algorithms including linear programming.  Notice
that this includes decision lists, so pretty much everything we've
seen can be transformed into a linear separator problem, with suitable
definition of features. (e.g., for k-decision lists)

Some things we've seen:
- VC-dim of linear separators in n-dim space is n+1.  So this says
  that O(1/epsilon[n*log(1/epsilon)+log(1/delta)] examples are
  sufficient for learning. 

- But, we've seen that if there's a large margin, can get away with less. 
  Let's assume target is w^* . x > 0, and has zero error. |w^*| = 1.
  Given a sample S, define the margin to be:
      gamma = min_{x in S} |w^* . x|/|x| = |cos(w^*,x)|

  Perceptron algorithm makes at most 1/gamma^2 mistakes.  This means,
  using our "online->batch" conversions, we just need a training set of
  size:
        O(1/epsilon[M + log(1/delta)])
        M = mistake bound = 1/gamma^2.

- Old view: perceptron alg is not so good, since gamma can be
  exponentially small in n.  (like in case of decision lists)

  New view: on the other hand, gamma can often be fairly large "the
  large margin case", in which case this is a better bound: depends
  only on the margin, not on the dimension!!

More about margins
==================

- We've seen one reason why having a large margin is a good thing:
  it allows the perceptron algorithm to work from a small sample.  It
  turns out another thing we can say is that whp, *any* separator with
  a large margin over the data will have low error.  Will get back to this
  in a minute.  This suggests an alternative algorithm: first scale
  all examples to have length 1, and then solve for w of |w|<=1 such
  that positives have w.x >= gamma, and negatives have w.x <= -gamma,
  and maximize gamma. (this is a convex optimization problem).  Called
  a "maximal margin hyperplane".  Also "support vector machine".

- Often people fix gamma=1 and minimize w.w. 

- [Aside: what if not possible to separate?  One thing you can do is add
  slack variables: replace gamma with "gamma - epsilon_i" on ith
  constraint.  Require epsilon_i >= 0.  Then maximize gamma - alpha *
  sum_i epsilon_i, where "alpha" trades off the quantities.  This is
  what we talked about last time.]

Sample complexity analysis
==========================
First thing to show: what is the maximum number of points that can be
split in all possible ways by a separator of margin at least gamma?
a.k.a., "fat-shattering dimension".  Ans: O(1/gamma^2).

Proof: simple proof is just to consider perceptron algorithm.  Suppose
gamma-fat-shattering dimension is d.  Then can force perceptron alg to
make d mistakes, and yet still have a separator w^* of margin gamma.
But we know the number of mistakes is at most 1/gamma^2.  So, that's it.

Second part: now want to apply this to get a sample-complexity bound.
Seems like analysis we used for VC-dimension should just go right
through, but it's actually not so clean.   Let's do the analysis and
will just point out the one part that's messy, and give a citation for
the thing we need. 

Analysis: Draw 2m points from D.  Want to show it is unlikely there
exists a separator that gets first half correct by margin gamma, but
has more than epsilon*m/2 mistakes on the 2nd half.  [this then implies the
conclusion we want, by same reasoning as when we argued the VC bounds]

As in VC proof, will show stronger statement: for *any* set S of size
2m, whp this is true over randomization of split into two pieces S1,S2
of size m.  In VC argument, we said: fix some h that makes at least
epsilon*m/2 mistakes.  Showed that Prob(all mistakes are on 2nd half)
is at most 2^{epsilon*m/2).  Then applied union bound over all
labelings of data using h in C.  For us, it's tempting to say "let's
count the number of separators of S with margin gamma over all of S",
but this might be undercounting since what about separators where h
only has margin gamma on S1?  Instead, we'll do the following
complicated thing.  First, let's assume all |x|=1.  Now let's group
the separators together.  Define h(x) = h.x but truncated at +/-
gamma.  Let dist_S(h1,h2) to be max_{x in S}|h1(x) - h2(x)|.  We want
a "gamma/2-cover": a set H of separators such that every other
separator is within gamma/2 of some separator in H.  Claim is: there
exists an H that is not too large, as a function of fat-shattering
dimension [Alon et al].  Now, for these guys, define "correct" as
"correct by margin at least gamma/2" and define "mistake" as "mistake
OR correct by less than gamma/2".  Our standard VC argument shows that
whp, none of these will get S1 all correct, and yet make > epsilon*m/2
"mistakes" on S2.  This then implies (by defn of H) that whp *no*
separator gets S1 correct by margin >= gamma and has > epsilon*m/2 real
mistakes on S2.


More about margins
==================
Here is yet another way to see why having a large margin is a good
thing: if you take two vectors of angle alpha, and do a *random*
linear projection down to a space of dimension O(k^2 * log(1/delta)),
then with prob 1-delta, the angle of the projected vectors will be
alpha +/- 1/k.  This is a Johnson-Lindenstrauss lemma argument.
Implication: if you have a set S of points that are separable with
margin gamma, and randomly project down to a space of dimension
O((1/gamma^2) log(|S|/delta)), then whp they will still be separable
(in fact, separable by margin gamma/2 if we put an extra "4" into the
O(..)).  So it really wasn't that high-dimensional of a problem after
all.  

Kernels
=======
What if your data doesn't have a good linear separator?  Here's a neat
idea, called the "kernel trick".  

One thing we might like to do is map our data to a higher dimensional
space, e.g., look at all products of pairs of features, in the hope
that data will be linearly separable there.  If we're lucky, will be
separable by a large margin so we don't have to pay a lot in terms of
data.  But this is going to a pain computationally.  However, one
thing we notice is that most learning algorithms only access data
through performing dot-products (will get back to how to interpret
algs like perceptron in this way in a minute).  So, maybe we can do
our mapping in such a way that we have an efficient way of computing
dot-products.  This leads to idea of a kernel.

A Kernel is a function K(x,y) such that for some mapping phi,
	 K(x,y) = phi(x) . phi(y).

Some examples:
         K(x,y) = (1 + <x,y>)^d.
	 K(x,y) = (1 + x_1*y_1)(1 + x_2*y_2)...(1 + x_n*y_n)
	  [corresponds to mapping x,y to list of all products of subsets]
         Also string kernels [count how many substrings of length p
	  two strings have in common]

More generally, nice for the case where examples aren't so easy to map
directly into R^n, but we have a reasonable notion of similarity we
can encode in a kernel K.

Neat fact: most learning algorithms can be run using kernels.
E.g., perceptron algorithm:
      Initialize w = first positive example.
      If w.x < 0 on positive x, let w = w+x
      if w.x > 0 on negative x, let w = w-x.

How to kernelize?  notice that w is just a weighted sum of examples.
E.g., w = x1 - x2 + x3, where x1,x2,x3 are examples we've seen so far.
So to compute phi(w).phi(x), just do: K(x1,x)-K(x2,x)+K(x3,x).

The examples that the hypothesis is written in terms of are called
"support vectors".  If you find the maximum margin separator for a
given dataset, that is also something that can be written in terms of
support vectors (not hard to see).  That's the reason for the name
"support vector machines". 

      
Luckiness functions
===================

Basic idea of margins was in essense to view some separators as
"simpler" than others, using margin as the notion of "simple".  What
makes this different from our Occam bounds, is that the notion of
"simple" depends on the data.  Basically, we have a data-dependent
ordering of functions such that if we're lucky and the the target has
low complexity in this ordering, then we don't need much training
data.  More generally, things like this are called "luckiness
functions".  If a function is a "legal notion of luckiness"
(basically, the ordering depends only on the data points and not their
labels, and not too many splits of data with small complexity) then
you can apply sample complexity bounds.

PAC-MDL bounds
==============

A nice way to view this is in terms of a game between two players A
and B.  A and B both have same set of unlabeled data.  A is then given
labels for a random subset of the data, and has to transmit a string
that will uncompress to a labeling of the entire data set.  Claim is:
if can do this with not too many bits, then can be confident in labels
on the test set.  Point is that communication language is allowed to
refer to the unlabeled data.  E.g., for VC-bounds: since only C[2m]
ways of splitting whole bunch, A can just give index of the split she
is thinking of.  Can also interpret things like "PAC-Bayes" bounds in
this way.


A super-simple algorithm for weak-learning a large-margin separator
===================================================================

While we're at it, here is a super-simple algorithm for weak-learning
a large-margin separator.

Algorithm: pick h at random.  if error(h) < 1/2 - gamma/4 then done.
if error(h) > 1/2 + gamma/4 then done (use -h).  Else repeat.

How to analyze this?  Let w be the vector defining target function.
Since we are looking at both h and -h, we can assume wlog that w.h>=0.
Now, let x be one of the examples.  Let's look at: 

     Pr_h(h.x > 0 | w.h >= 0).  

To analyze this, we only need to consider the 2-dim plane defined by w
and x.  h's projection down to this plane (ignoring length) looks like
a random vector subject to having w.h >= 0.  What fraction of these
label x incorrectly?  The answer is: angle(w,x)/pi.  Approximating
theta=cos(theta), this is at most (pi/2 - gamma)/pi = 1/2 - gamma/pi.
So, Pr(h is wrong) <= 1/2 - gamma/pi.

This means that for random h, E[min(err(h),err(-h))] <= 1/2 - gamma/pi.  
Now,  it could be that with small probability we get a really good h
and with large probability we get a not so good h.  But since the
quantity is bounded between 0 and 1/2, the probability of success
can't be that small.

Note: can then apply boosting (since that just reweights the data, and
doesn't affect margins).

Margins and boosting
====================
For boosting, can also get good results based on the margin of the
vote (good generalization on the examples where the vote is far from
50/50).  One key difference between the notion of "margin" we were 
discussing and the notion of margin here is that we have been
discussing an "L_2 margin": we scaled examples to have length=1 in the
L_2 sense.  Here, you have an "L_1 margin", since you are effectively
scaling the vote so the weights have length=1 in L_1 sense.  Will
discuss this topic later.