<?xml version="1.0"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "mathml.dtd"> 
<?xml-stylesheet type="text/css" href="thesis.css"?> 
<html  
xmlns="http://www.w3.org/1999/xhtml"  
><head><title>5.1 A Motivating Observation</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 
<meta name="generator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<meta name="originator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<!-- 3,early_,early^,xhtml,mozilla --> 
<meta name="src" content="thesis.tex" /> 
<meta name="date" content="2002-08-28 13:56:00" /> 
<link rel="stylesheet" type="text/css" href="thesis.css" /> 
</head><body 
>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse22.xml" >next</a>] [<a 
href="#tailthesisse21.xml">tail</a>] [<a 
href="thesisch5.xml#thesisse21.xml" >up</a>] </p></div>
   <h3 class="sectionHead"><span class="titlemark">5.1. </span> <a 
  name="x30-390005.1"></a>A Motivating Observation</h3>
<!--l. 1403--><p class="noindent">Imagine, for the moment, that we know the (unknown) problem distribution, <!--l. 1403--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>D</mi></mrow></math>. For a given learning
algorithm <!--l. 1404--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>A</mi></mrow></math>, a distribution <!--l. 1404--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>D</mi></mrow></math>on labeled examples induces
a distribution <!--l. 1405--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> over the
possible hypotheses <!--l. 1405--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>h</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mi 
>H</mi></mrow></math>
produced by algorithm <!--l. 1406--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>A</mi></mrow></math>
after <!--l. 1406--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>m</mi></mrow></math>
examples<a 
href="thesis31.xml" name="thesis31.xml" ><sup>1</sup></a>.
A natural choice for the Occam&#x2019;s Razor bound ( <a 
href="thesisse20.xml#x27-36001r1">4.6.1<!--tex4ht:ref: th-ORB --></a>) is the measure <!--l. 1410--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>. Is
this choice optimal? The answer is &#x201C;yes&#x201D;, given the right notion of optimal. In particular,
if we start with the relative entropy Occam&#x2019;s razor bound ( <a 
href="thesisse20.xml#x27-36002r2">4.6.2<!--tex4ht:ref: th-reorb --></a>), we can show that <!--l. 1412--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
minimizes the expected value of the bound on the Kullback-Leibler divergence between
the empirical error and true error.
</p>
   <div class="newtheorem">
<!--l. 1416--><p class="noindent"><span class="head">
<a 
  name="x30-39002r1"></a>
                                                                     

                                                                     
  <span 
class="eccc-1000">T<small 
class="small-caps">H</small><small 
class="small-caps">E</small><small 
class="small-caps">O</small><small 
class="small-caps">R</small><small 
class="small-caps">E</small><small 
class="small-caps">M</small> </span>5.1.1<span 
class="eccc-1000">.</span></span>
</p><!--l. 1417--><p class="indent">   <span 
class="ecti-1000">(KL                    divergence                    minimization)                    </span><!--l. 1417--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
<span 
class="ecti-1000">minimizes the expected value of the KL divergence in the relative entropy Occam&#x2019;s</span>
<span 
class="ecti-1000">Razor bound ( </span><a 
href="thesisse20.xml#x27-36002r2"><span 
class="ecti-1000">4.6.2</span><!--tex4ht:ref: th-reorb --></a><span 
class="ecti-1000">).</span>
</p>
   </div>
   <div class="proof">
<!--l. 1421--><p class="indent">   <span class="head">
   <span 
class="eccc-1000">P<small 
class="small-caps">R</small><small 
class="small-caps">O</small><small 
class="small-caps">O</small><small 
class="small-caps">F</small>.</span> </span>We need to show that <!--l. 1422--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">      <mrow 
>
                           <mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo><msub><mrow 
> <!--mstyle 
class="text"--><mtext class="textrm">argmin</mtext><!--/mstyle--></mrow><mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></msub 
><msub><mrow 
> <mo 
class="MathClass-op">&#x2211;</mo>
          </mrow><mrow 
><mi 
>h</mi></mrow></msub 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mfrac><mrow 
><mo 
>ln</mo><!--nolimits-->   <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfrac> <mo 
class="MathClass-bin">+</mo><mo 
> ln</mo><!--nolimits--> <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>&#x03B4;</mi></mrow></mfrac></mrow>
        <mrow 
><mi 
>m</mi></mrow></mfrac>
</mrow></math>
removing terms which the minimum does not depend on, we get: <!--l. 1425--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">      <mrow 
><msub><mrow 
>
                                    <!--mstyle 
class="text"--><mtext class="textrm">argmin</mtext><!--/mstyle--></mrow><mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></msub 
><msub><mrow 
> <mo 
class="MathClass-op">&#x2211;</mo>
                          </mrow><mrow 
><mi 
>h</mi></mrow></msub 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
>ln</mo><!--nolimits-->   <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfrac>
</mrow></math>
                                                                     

                                                                     
adding a constant, we get: <!--l. 1428--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">      <mrow 
><msub><mrow 
>
                                    <!--mstyle 
class="text"--><mtext class="textrm">argmin</mtext><!--/mstyle--></mrow><mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></msub 
><msub><mrow 
> <mo 
class="MathClass-op">&#x2211;</mo>
                          </mrow><mrow 
><mi 
>h</mi></mrow></msub 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
>ln</mo><!--nolimits--> <mfrac><mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow> 
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfrac>
</mrow></math>
This is equivalent to minimizing the Kullback-Leibler divergence between the distribution
of <!--l. 1431--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
and <!--l. 1431--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
which is minimized for <!--l. 1431--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>.
<span class="qed"><span 
class="msam-10">&#x25AB;</span></span>
</p>
   </div>
<!--l. 1433--><p class="indent">   Using the KL divergence as our notion of loss is somewhat non-intuitive.
However, it is mathematically simple and not irrational. For all true
errors, the KL divergence will be upper and lower bounded between an <!--l. 1435--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>l</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
></mrow></math> and an <!--l. 1435--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>l</mi></mrow><mrow 
><mn>2</mn></mrow></msub 
></mrow></math>
metric. Since these are two of the most common metrics, the choice of KL divergence
based metric should behave similarly well.
</p><!--l. 1439--><p class="indent">   The point of these observations is to notice that if the structure of the learning algorithm produces
a choice <!--l. 1440--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> that
approximates <!--l. 1440--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>,
the result should be better estimation bounds.
</p><!--l. 1444--><p class="indent">
                                                                     

                                                                     
</p>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse22.xml" >next</a>] [<a 
href="thesisse21.xml" >front</a>] [<a 
href="thesisch5.xml#thesisse21.xml" >up</a>] </p></div><a 
  name="tailthesisse21.xml"></a>  
</body> 
</html> 
