<?xml version="1.0"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "mathml.dtd"> 
<?xml-stylesheet type="text/css" href="thesis.css"?> 
<html  
xmlns="http://www.w3.org/1999/xhtml"  
><head><title>6.1 PAC-Bayes Basics</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 
<meta name="generator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<meta name="originator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<!-- 3,early_,early^,xhtml,mozilla --> 
<meta name="src" content="thesis.tex" /> 
<meta name="date" content="2002-08-28 13:56:00" /> 
<link rel="stylesheet" type="text/css" href="thesis.css" /> 
</head><body 
>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse26.xml" >next</a>] [<a 
href="#tailthesisse25.xml">tail</a>] [<a 
href="thesisch6.xml#thesisse25.xml" >up</a>] </p></div>
   <h3 class="sectionHead"><span class="titlemark">6.1. </span> <a 
  name="x38-580006.1"></a>PAC-Bayes Basics</h3>
<!--l. 2256--><p class="noindent">In the PAC-Bayes setting, a classifier is defined by a distribution <!--l. 2256--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> over the
hypothesis space. Each classification is carried out according to a hypothesis <span 
class="ecti-1000">sampled </span>from <!--l. 2258--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>.
We are interested in the gap between the <span 
class="ecti-1000">expected </span>generalization error <!--l. 2259--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>e</mi></mrow><mrow 
><mi 
>q</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">&#x2261;</mo> <msub><mrow 
><mi 
>E</mi></mrow><mrow 
><mi 
>q</mi></mrow></msub 
> <mfenced separators="" 
open="["  close="]" ><mrow><mi 
>e</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfenced></mrow></math> and the <span 
class="ecti-1000">expected</span>
empirical error <!--l. 2260--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>q</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <msub><mrow 
><mi 
>E</mi></mrow><mrow 
><mi 
>q</mi></mrow></msub 
> <mfenced separators="" 
open="["  close="]" ><mrow><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfenced></mrow></math>,
where both expectations are taken with respect to <!--l. 2261--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>. The
gap will be parameterized by the Kullback-Leibler divergence (see <span class="cite">[<a 
href="thesisli2.xml#XCover"><span 
class="ecbx-1000">10</span></a>]</span>). Recall
that:
</p>
   <table class="equation"><tr><td> <a 
  name="x38-58001r1"></a>
<!--l. 2265--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">     
                                    <!--mstyle 
class="text"--><mtext class="textrm">KL</mtext><!--/mstyle--><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>q</mi><mo 
class="MathClass-rel">&#x2223;</mo><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>p</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <msub><mrow 
><mi 
>E</mi></mrow><mrow 
><mi 
>h</mi><mo 
class="MathClass-rel">&#x223C;</mo><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></msub 
><mo 
> ln</mo><!--nolimits--> <mfrac><mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow> 
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfrac>
</math>
<!--l. 2268--><p class="nopar"></p></td><td class="eq-no">(6.1.1)</td></tr></table>
If the support is finite, we have <table class="equation"><tr><td> <a 
  name="x38-58002r2"></a>
                                                                     

                                                                     
<!--l. 2270--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">     
                                    <!--mstyle 
class="text"--><mtext class="textrm">KL</mtext><!--/mstyle--><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>q</mi><mo 
class="MathClass-rel">&#x2223;</mo><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>p</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo><msub><mrow 
> <mo 
class="MathClass-op">&#x2211;</mo>
   </mrow><mrow 
><mi 
>h</mi></mrow></msub 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
>ln</mo><!--nolimits--> <mfrac><mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow> 
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfrac>
</math>
<!--l. 2273--><p class="nopar"></p></td><td class="eq-no">(6.1.2)</td></tr></table>
The relative entropy is an asymmetric distance measure between probability distributions,
with <!--l. 2275--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><!--mstyle 
class="text"--><mtext class="textrm">KL</mtext><!--/mstyle--><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>q</mi><mo 
class="MathClass-rel">&#x2223;</mo><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>p</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mn>0</mn> <mo 
class="MathClass-rel">&#x21D4;</mo> <mi 
>q</mi> <mo 
class="MathClass-rel">=</mo> <mi 
>p</mi></mrow></math>
almost everywhere.
   <div class="newtheorem">
<!--l. 2277--><p class="noindent"><span class="head">
<a 
  name="x38-58003r1"></a>
  <span 
class="eccc-1000">T<small 
class="small-caps">H</small><small 
class="small-caps">E</small><small 
class="small-caps">O</small><small 
class="small-caps">R</small><small 
class="small-caps">E</small><small 
class="small-caps">M</small> </span>6.1.1<span 
class="eccc-1000">.</span></span>
</p><!--l. 2278--><p class="indent">   <span 
class="ecti-1000">(PAC-Bayes </span><span class="cite">[<a 
href="thesisli2.xml#XPB"><span 
class="ecbx-1000">39</span></a>]</span><span 
class="ecti-1000">) For all &#x201C;priors&#x201D; </span><!--l. 2278--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
<span 
class="ecti-1000">and for all </span><!--l. 2279--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>&#x03B4;</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mrow><mo 
class="MathClass-open">(</mo><mrow><mn>0</mn><mo 
class="MathClass-punc">,</mo><mn>1</mn></mrow><mo 
class="MathClass-close">]</mo></mrow></mrow></math><span 
class="ecti-1000">:</span>
</p><!--l. 2281--><p class="indent">   <!--l. 2281--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">
<mrow 
>
          <msub><mrow 
><mo 
>Pr</mo></mrow><mrow 
><msup><mrow 
><mi 
>D</mi></mrow><mrow 
><mi 
>m</mi></mrow></msup 
></mrow></msub 
> <mfenced separators="" 
open="("  close=")" ><mrow><mi 
>&#x2203;</mi><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-punc">:</mo>  <msub><mrow 
><mi 
>e</mi></mrow><mrow 
><mi 
>q</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">&#x2265;</mo><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo>&#x0302;</mo></mover></mrow><mrow 
><mi 
>q</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-bin">+</mo> <msqrt><mi 
></mi>
 <mrow><mfrac><mrow 
><!--mstyle 
class="text"--><mtext class="textrm">KL</mtext><!--/mstyle--> <mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>q</mi><mo 
class="MathClass-rel">&#x2223;</mo><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>p</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-bin">+</mo><mo 
> ln</mo> <!--nolimits--> <mfrac> <mrow 
> <mi 
>m</mi></mrow> 
<mrow 
><mi 
>&#x03B4;</mi></mrow></mfrac>  <mo 
class="MathClass-bin">+</mo> <mn>2</mn></mrow>
        <mrow 
><mn>2</mn><mi 
>m</mi> <mo 
class="MathClass-bin">&#x2212;</mo> <mn>1</mn></mrow></mfrac></mrow></msqrt>         </mrow></mfenced> <mo 
class="MathClass-rel">&#x2264;</mo> <mi 
>&#x03B4;</mi>
</mrow></math>
</p>
   </div>
   <div class="proof">
<!--l. 2286--><p class="indent">   <span class="head">
   <span 
class="eccc-1000">P<small 
class="small-caps">R</small><small 
class="small-caps">O</small><small 
class="small-caps">O</small><small 
class="small-caps">F</small>.</span> </span>Given in <span class="cite">[<a 
href="thesisli2.xml#XPB"><span 
class="ecbx-1000">39</span></a>]</span>. <span class="qed"><span 
class="msam-10">&#x25AB;</span></span>
</p>
   </div>
<!--l. 2288--><p class="indent">   This PAC-Bayes bound is almost the same as the Occam&#x2019;s Razor bound (theorem  <a 
href="thesisse20.xml#x27-36001r1">4.6.1<!--tex4ht:ref: th-ORB --></a>)
when the distribution is peaked on a single hypothesis and the Occam&#x2019;s razor bound is proved
using the looser Hoeffding inequality. This can be seen by noting that the KL-divergence when <!--l. 2291--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>q</mi></mrow></math> is all on one
                                                                     

                                                                     
hypothesis, <!--l. 2292--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>h</mi></mrow></math>
satisfies <!--l. 2292--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><!--mstyle 
class="text"--><mtext class="textrm">KL</mtext><!--/mstyle--><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>q</mi><mo 
class="MathClass-rel">&#x2223;</mo><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>p</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo><mo 
> log</mo><!--nolimits-->   <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfrac></mrow></math>.
Comparing with the Occam&#x2019;s Razor bound, we see that a (small) extra term of size <!--l. 2293--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mfrac><mrow 
><mo 
>ln</mo><!--nolimits--><mi 
>m</mi></mrow>
 <mrow 
><mi 
>m</mi></mrow></mfrac> </mrow></math> has
been introduced in return for the capability to average with respect to <span 
class="ecti-1000">any </span>posterior <!--l. 2295--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>. It is unclear yet
that this <!--l. 2295--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mfrac><mrow 
><mo 
> ln</mo><!--nolimits--> <mi 
>m</mi></mrow>
 <mrow 
><mi 
>m</mi></mrow></mfrac>  </mrow></math>
term needs to be there.
</p>
   <div class="newtheorem">
<!--l. 2298--><p class="noindent"><span class="head">
<a 
  name="x38-58004r2"></a>
  <span 
class="eccc-1000">P<small 
class="small-caps">R</small><small 
class="small-caps">O</small><small 
class="small-caps">B</small><small 
class="small-caps">L</small><small 
class="small-caps">E</small><small 
class="small-caps">M</small> </span>6.1.2<span 
class="eccc-1000">.</span></span>
</p><!--l. 2299--><p class="indent">   (Open)                          Remove                          the                          <!--l. 2299--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mfrac><mrow 
><mo 
>ln</mo><!--nolimits--><mi 
>m</mi></mrow>
 <mrow 
><mi 
>m</mi></mrow></mfrac> </mrow></math>
term from the sample complexity.
</p>
   </div>
<!--l. 2301--><p class="indent">   The real power of the PAC-Bayes bound occurs when the average
is over many hypotheses. This might occur if the distribution <!--l. 2302--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> is
picked using Bayes law or a Gibbs distribution. One of the most interesting
aspects of the PAC-Bayes bound is that it holds for finite <span 
class="ecti-1000">and </span>infinite hypothesis
spaces.
</p><!--l. 2307--><p class="indent">
                                                                     

                                                                     
</p>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse26.xml" >next</a>] [<a 
href="thesisse25.xml" >front</a>] [<a 
href="thesisch6.xml#thesisse25.xml" >up</a>] </p></div><a 
  name="tailthesisse25.xml"></a>  
</body> 
</html> 
