<?xml version="1.0"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "mathml.dtd"> 
<?xml-stylesheet type="text/css" href="thesis.css"?> 
<html  
xmlns="http://www.w3.org/1999/xhtml"  
><head><title>2.2 Relationship to Prior Work</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 
<meta name="generator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<meta name="originator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<!-- 3,early_,early^,xhtml,mozilla --> 
<meta name="src" content="thesis.tex" /> 
<meta name="date" content="2002-08-28 13:56:00" /> 
<link rel="stylesheet" type="text/css" href="thesis.css" /> 
</head><body 
>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse8.xml" >next</a>] [<a 
href="thesisse6.xml" >prev</a>] [<a 
href="thesisse6.xml#tailthesisse6.xml" >prev-tail</a>] [<a 
href="#tailthesisse7.xml">tail</a>] [<a 
href="thesisch2.xml#thesisse7.xml" >up</a>] </p></div>
   <h3 class="sectionHead"><span class="titlemark">2.2. </span> <a 
  name="x12-110002.2"></a>Relationship to Prior Work</h3>
   <h4 class="subsectionHead"><span class="titlemark">2.2.1. </span> <a 
  name="x12-120002.2.1"></a>Distribution free learning</h4>
<!--l. 367--><p class="noindent">The question answered here differs significantly from much prior learning theory,
including the results of Vapnik <span class="cite">[<a 
href="thesisli2.xml#XVapnik"><span 
class="ecbx-1000">51</span></a>]</span>, Valiant <span class="cite">[<a 
href="thesisli2.xml#XValiant"><span 
class="ecbx-1000">50</span></a>]</span>, Devroye <span class="cite">[<a 
href="thesisli2.xml#XDevroye"><span 
class="ecbx-1000">11</span></a>]</span>, and many others. See
<span class="cite">[<a 
href="thesisli2.xml#XHaussler"><span 
class="ecbx-1000">20</span></a>]</span> for a good summary. The principle difference is the question we address: &#x201C;Have we
learned?&#x201D;
</p><!--l. 372--><p class="indent">   Much of the prior work in learning theory addresses the question: &#x201C;How many
examples are needed in order to guarantee that I will choose (nearly) the best hypothesis
from some fixed hypothesis set?&#x201D;
</p><!--l. 376--><p class="indent">   In particular, suppose that we have a hypothesis set <!--l. 376--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>H</mi></mrow></math>. If we
guarantee that:
</p><!--l. 379--><p class="indent">   <!--l. 379--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">        <mrow 
>
                         <msub><mrow 
><mo 
>Pr</mo></mrow><mrow 
><mi 
>S</mi><mo 
class="MathClass-rel">&#x223C;</mo><msup><mrow 
><mi 
>D</mi></mrow><mrow 
><mi 
>m</mi></mrow></msup 
></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>&#x2203;</mi><mi 
>h</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mi 
>H</mi> <mo 
class="MathClass-punc">:</mo>  <mo 
class="MathClass-rel">&#x2223;</mo><msub><mrow 
><mi 
>e</mi></mrow><mrow 
><mi 
>D</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-bin">&#x2212;</mo><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo>&#x0302;</mo></mover></mrow><mrow 
><mi 
>S</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
class="MathClass-rel">&#x2223;</mo> <mo 
class="MathClass-rel">&#x003E;</mo> <mi 
>&#x03B5;</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">&#x003C;</mo> <mi 
>&#x03B4;</mi>
</mrow></math>
then if our learning algorithm choses the hypothesis with minimum
empirical error (known as &#x201C;empirical risk minimization&#x201D; in the language
of <span class="cite">[<a 
href="thesisli2.xml#XV2"><span 
class="ecbx-1000">52</span></a>]</span>), we will be guaranteed that the chosen hypothesis satisfies: <!--l. 384--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">
                                                                     

                                                                     
<mrow 
>
                           <msub><mrow 
><mi 
>e</mi></mrow><mrow 
><mi 
>D</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-bin">&#x2212;</mo> <msubsup><mrow 
><mi 
>e</mi></mrow><mrow 
><mi 
>D</mi></mrow><mrow 
><mo 
class="MathClass-bin">&#x2217;</mo></mrow></msubsup 
><mo 
class="MathClass-rel">&#x2264;</mo> <mn>2</mn><mi 
>&#x03B5;</mi>
</mrow></math> where <!--l. 386--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msubsup><mrow 
><mi 
>e</mi></mrow><mrow 
><mi 
>D</mi></mrow><mrow 
><mo 
class="MathClass-bin">&#x2217;</mo></mrow></msubsup 
> <mo 
class="MathClass-rel">=</mo><msub><mrow 
><mo 
> inf</mo> </mrow><mrow 
><mi 
>h</mi><mo 
class="MathClass-rel">&#x2208;</mo><mi 
>H</mi></mrow></msub 
><msub><mrow 
><mi 
>e</mi></mrow><mrow 
><mi 
>D</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>.
</p><!--l. 388--><p class="indent">   There are several difficulties inherent in this approach which we will avoid.
</p><!--l. 390--><p class="indent">
           </p><ol type="1" class="enumerate1" start="1" 
>
        <li class="enumerate"><a 
  name="x12-12002x1"></a>Results  in  this  model  apply  only  for  the  empirical  risk  minimization
        (ERM)  algorithm.  The  ERM  algorithm  is  known  to  be  NP-complete
        <span class="cite">[<a 
href="thesisli2.xml#XAR"><span 
class="ecbx-1000">4</span></a>]</span> for some hypothesis spaces and, in general, is essentially dependent
        upon  the  axiom  of  choice.  These  results  will  approximately  apply  to
        approximate ERM algorithms, but it is unclear how &#x201C;approximate&#x201D; typical
        learning algorithms are. By answering &#x201C;Have we learned?&#x201D; this complexity
        is avoided.
           </li>
        <li class="enumerate"><a 
  name="x12-12004x2"></a>There is no natural notion of preference (or &#x201C;prior&#x201D;) amongst the hypotheses,
        <!--l. 398--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>h</mi></mrow></math>,
        in the hypothesis space, <!--l. 398--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>H</mi></mrow></math>.
        This is very important for practical application as is shown in Figure
        <a 
href="thesisse55.xml#x76-1110013">12.3.3<!--tex4ht:ref: fig-simple-holdout --></a>.
           </li>
        <li class="enumerate"><a 
  name="x12-12006x3"></a>Answers to this question are generally insensitive to the final result, <!--l. 400--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
        <mrow 
><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>S</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>.
        This is again important in practice (shown here <a 
href="thesisse12.xml#x18-260021">3.4.1<!--tex4ht:ref: fig-bounds --></a>) because the variance
        of the distribution of the empirical error, <!--l. 402--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>S</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>,
        changes significantly with different true error rates, <!--l. 403--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><msub><mrow 
><mi 
>e</mi></mrow><mrow 
><mi 
>D</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>.</li></ol>
<!--l. 404--><p class="nopar"> These drawbacks can be alleviated (but not removed) to some extent. For example, many
people apply results in this model to arbitrary learning algorithms by simply noticing that the
deviation of the empirical and true error rates is small. This, in turn, implies that whatever
hypothesis your algorithm learns (empirical minimum or not), it&#x2019;s true error is within <!--l. 409--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>&#x03B5;</mi></mrow></math> of the
empirical error. This is one approach to addressing the question &#x201C;have we learned?&#x201D; -
others will be presented here.
</p><!--l. 413--><p class="indent">   The second drawback can be alleviated using &#x201C;structural risk minimization&#x201D; (as in
<span class="cite">[<a 
href="thesisli2.xml#XV2"><span 
class="ecbx-1000">52</span></a>]</span>). Structural risk minimization removes most of drawback (2), although it is awkward
for specifying very fine-grained preferences. We will use an arbitrary measure <!--l. 416--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>P</mi></mrow></math> over the hypothesis
space <!--l. 416--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>h</mi></mrow></math>.
                                                                     

                                                                     
This prior <!--l. 417--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>P</mi></mrow></math>
need not (necessarily) be a Bayesian prior - all that is formally required is that this
&#x201C;prior&#x201D; be specified without using information from the examples. The notion of measure <!--l. 419--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>P</mi></mrow></math> is
more general than structural risk minimization because for <span 
class="ecti-1000">every </span>&#x201C;structure&#x201D; <!--l. 420--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>T</mi></mrow></math>
on which structural risk minimization is done, we can produce a &#x201C;prior&#x201D; <!--l. 421--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>P</mi></mrow><mrow 
><mi 
>T</mi></mrow></msub 
></mrow></math> dependent upon
the structure <!--l. 422--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>T</mi></mrow></math>
and derive the same (or tighter) results.
</p><!--l. 424--><p class="indent">   The third drawback can be alleviated using &#x201C;relative risk&#x201D; as in <span class="cite">[<a 
href="thesisli2.xml#XV2"><span 
class="ecbx-1000">52</span></a>]</span>, but this still
leaves some slack in the bounds.
</p><!--l. 427--><p class="indent">   The work in this thesis can be thought of as directly addressing the altered question
which alleviates problem 1. Since our goal is addressing this altered question rather than
deriving the answer from other results, we will be able to state and prove tighter results.
Furthermore, because we address the question people encounter in practice,
the bounds presented here will be more directly applicable. In particular, they
will apply to arbitrary learning algorithms (although not necessarily <span 
class="ecti-1000">tightly </span>to
arbitrary learning algorithms) rather than just the empirical risk minimization
algorithm.
</p>
   <h4 class="subsectionHead"><span class="titlemark">2.2.2. </span> <a 
  name="x12-130002.2.2"></a>Bayesian Analysis</h4>
<!--l. 439--><p class="noindent">The basic result of Bayesian analysis is that Bayes rule: <!--l. 440--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">
<mrow 
>
                        <mo 
>Pr</mo><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>f</mi><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>S</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mfrac><mrow 
><mo 
>Pr</mo><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>S</mi><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>f</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
>Pr</mo><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>f</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow> 
     <mrow 
><mo 
>Pr</mo><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>S</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfrac>
</mrow></math> is the
<span 
class="ecti-1000">optimal </span>learning algorithm when the learning problem is drawn from the distribution <!--l. 443--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mo 
>Pr</mo><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>f</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>. Note that
the &#x201C;hypothesis&#x201D; learned by the Bayesian learning algorithm is the weighted average predictor, <!--l. 445--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">
                                                                     

                                                                     
<mrow 
>
                      <mi 
>h</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>x</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <!--mstyle 
class="text"--><mtext class="textrm">sign</mtext><!--/mstyle--><mrow><mo 
class="MathClass-open">(</mo><mrow><mo 
class="MathClass-op">&#x222B;</mo>
 <mo 
>Pr</mo><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>f</mi><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>S</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mi 
>f</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>x</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mi 
>d</mi><mi 
>f</mi></mrow><mo 
class="MathClass-close">)</mo></mrow>
</mrow></math> This rather
strong statement is difficult to utilize in practice because specifying and using arbitrary distributions
<!--l. 448--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">     <mrow 
><mo 
>Pr</mo><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>f</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
is unwieldy. In practice, many people use approximations and
there is considerable question about whether or not the specified <!--l. 450--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mo 
>Pr</mo><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>f</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> is
&#x201C;right enough&#x201D; after approximations. A chapter is later devoted to analyzing hypotheses
of this weighted-average form and a theorem  <a 
href="thesisse30.xml#x45-66001r1">7.2.1<!--tex4ht:ref: th-averaging --></a> about their accuracy is
proved.
</p><!--l. 454--><p class="indent">   Some work has been done to analyze the robustness of Bayesian algorithms
under approximation errors. There are two common traits of Bayesian-related
analysis:
</p><!--l. 457--><p class="indent">
           </p><ol type="1" class="enumerate1" start="1" 
>
        <li class="enumerate"><a 
  name="x12-13002x1"></a>All statements are parameterized by a prior, <!--l. 458--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>P</mi></mrow></math>.
           </li>
        <li class="enumerate"><a 
  name="x12-13004x2"></a>The analysis is typically an &#x201C;average case&#x201D; (w.r.t. the prior <!--l. 459--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>P</mi></mrow></math>
        and a fixed Bayesian or approximate Bayesian algorithm, <!--l. 460--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>A</mi></mrow></math>).</li></ol>
<!--l. 461--><p class="nopar"> An example of this sort of analysis can be found in <span class="cite">[<a 
href="thesisli2.xml#XHKS"><span 
class="ecbx-1000">21</span></a>]</span>. The work in this thesis adopts parameterization
by a measure <!--l. 463--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>P</mi></mrow></math>,
but is <span 
class="ecti-1000">not </span>an average case analysis. Our analysis is &#x201C;worst-case&#x201D; in the
sense that it applies to <span 
class="ecti-1000">all </span>learning problems whether or not the measure <!--l. 465--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>P</mi></mrow></math> is a
&#x201C;correct&#x201D; prior or not. This approach is strongly similar to the work of McAllester <span class="cite">[<a 
href="thesisli2.xml#XPB"><span 
class="ecbx-1000">39</span></a>]</span>,
and a later chapter is devoted to a refinement of this result. Despite this, it is interesting
to note that our bounds are minimized (in some sense) when the measure <!--l. 469--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>P</mi></mrow></math> is in
fact a &#x201C;correct&#x201D; Bayesian prior.
</p><!--l. 472--><p class="indent">   There have been other attempts to connect Bayesian average case settings with worst
case settings. One interesting example is <span class="cite">[<a 
href="thesisli2.xml#XYpredicting"><span 
class="ecbx-1000">16</span></a>]</span> which discusses the connection between
Bayesian setting and the mistake bound model. This is especially interesting
because the mistake bound model is, in some sense, more &#x201C;worst-case&#x201D; than the
model we consider here as no assumption of example independence is made.
Further work <span class="cite">[<a 
href="thesisli2.xml#XGrun"><span 
class="ecbx-1000">29</span></a>]</span> has occurred in the &#x201C;Minimum Description Length&#x201D; (MDL)
community.
</p><!--l. 481--><p class="indent">
                                                                     

                                                                     
</p>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse8.xml" >next</a>] [<a 
href="thesisse6.xml" >prev</a>] [<a 
href="thesisse6.xml#tailthesisse6.xml" >prev-tail</a>] [<a 
href="thesisse7.xml" >front</a>] [<a 
href="thesisch2.xml#thesisse7.xml" >up</a>] </p></div><a 
  name="tailthesisse7.xml"></a>   
</body> 
</html> 
