<?xml version="1.0"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "mathml.dtd"> 
<?xml-stylesheet type="text/css" href="thesis.css"?> 
<html  
xmlns="http://www.w3.org/1999/xhtml"  
><head><title>13.1 Theoretical setup</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 
<meta name="generator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<meta name="originator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<!-- 3,early_,early^,xhtml,mozilla --> 
<meta name="src" content="thesis.tex" /> 
<meta name="date" content="2002-08-28 13:56:00" /> 
<link rel="stylesheet" type="text/css" href="thesis.css" /> 
</head><body 
>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse58.xml" >next</a>] [<a 
href="#tailthesisse57.xml">tail</a>] [<a 
href="thesisch13.xml#thesisse57.xml" >up</a>] </p></div>
   <h3 class="sectionHead"><span class="titlemark">13.1. </span> <a 
  name="x79-11900013.1"></a>Theoretical setup</h3>
<!--l. 5027--><p class="noindent">We first present a modern neural network bound (the &#x201C;competition&#x201D;), then specialize the
PAC-Bayes bound to a stochastic neural network. A stochastic neural network is simply
a neural network where each weight in the neural network is drawn from some
distribution whenever it is used. The reason for constructing a stochastic neural network
is that it will have a <span 
class="ecti-1000">much </span>lower true error upper bound than the neural network.
Furthermore, this will be accomplished without increasing the empirical error rate more
than marginally.
</p>
   <h4 class="subsectionHead"><span class="titlemark">13.1.1. </span> <a 
  name="x79-12000013.1.1"></a>Neural Network bound</h4>
<!--l. 5038--><p class="noindent">We will compare a specialization of the best current neural network true error rate
bound <span class="cite">[<a 
href="thesisli2.xml#XPanchenko"><span 
class="ecbx-1000">28</span></a>]</span> with our approach. The neural network bound is described in terms of the
following parameters:
</p><!--l. 5042--><p class="indent">
           </p><ol type="1" class="enumerate1" start="1" 
>
        <li class="enumerate"><a 
  name="x79-120002x1"></a>A margin, <!--l. 5043--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mn>0</mn> <mo 
class="MathClass-rel">&#x003C;</mo> <mi 
>&#x03B8;</mi> <mo 
class="MathClass-rel">&#x003C;</mo> <mn>1</mn></mrow></math>.
           </li>
        <li class="enumerate"><a 
  name="x79-120004x2"></a>A function <!--l. 5044--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>&#x03C6;</mi></mrow></math>
        defined by <!--l. 5044--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>&#x03C6;</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>x</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mn>1</mn></mrow></math>
        if <!--l. 5044--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>x</mi> <mo 
class="MathClass-rel">&#x003C;</mo> <mn>0</mn></mrow></math>,
        <!--l. 5044--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>&#x03C6;</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>x</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mn>0</mn></mrow></math>
        if <!--l. 5045--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>x</mi> <mo 
class="MathClass-rel">&#x003E;</mo> <mn>1</mn></mrow></math>,
        and linear in between.
           </li>
        <li class="enumerate"><a 
  name="x79-120006x3"></a><!--l. 5046--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><msub><mrow 
><mi 
>A</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math>,
        an upper bound on the sum of the magnitude of the weights in the <!--l. 5047--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
        <mrow 
><mi 
>i</mi></mrow></math>th
                                                                     

                                                                     
        layer of the neural network
           </li>
        <li class="enumerate"><a 
  name="x79-120008x4"></a><!--l. 5048--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><msub><mrow 
><mi 
>L</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math>,
        a Lipschitz constant which holds for the <!--l. 5048--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>i</mi></mrow></math>th
        layer  of  the  neural  network.  A  Lipschitz  constant  is  a  bound  on  the
        magnitude of the derivative.
           </li>
        <li class="enumerate"><a 
  name="x79-120010x5"></a><!--l. 5050--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>d</mi></mrow></math>,
        the size of the input space.</li></ol>
<!--l. 5051--><p class="nopar"> With these parameters defined, we get the following bound.
</p>
   <div class="newtheorem">
<!--l. 5054--><p class="noindent"><span class="head">
<a 
  name="x79-120011r1"></a>
  <span 
class="eccc-1000">T<small 
class="small-caps">H</small><small 
class="small-caps">E</small><small 
class="small-caps">O</small><small 
class="small-caps">R</small><small 
class="small-caps">E</small><small 
class="small-caps">M</small> </span>13.1.1<span 
class="eccc-1000">.</span></span>
</p><!--l. 5055--><p class="indent">   <span 
class="ecti-1000">(2     Layer     Feed-Forward     Neural     Network     bound)     For     all     </span><!--l. 5055--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>&#x03B4;</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mrow><mo 
class="MathClass-open">(</mo><mrow><mn>0</mn><mo 
class="MathClass-punc">,</mo><mn>1</mn></mrow><mo 
class="MathClass-close">]</mo></mrow></mrow></math>
<!--l. 5056--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">
<mrow 
>
                     <msub><mrow 
><mo 
>Pr</mo></mrow><mrow 
><mi 
>D</mi></mrow></msub 
> <mfenced separators="" 
open="("  close=")" ><mrow><mi 
>&#x2203;</mi><mi 
>h</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mi 
>H</mi> <mo 
class="MathClass-punc">:</mo>  <mi 
>e</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">&#x003E;</mo><msub><mrow 
><mo 
> inf</mo> </mrow><mrow 
><mi 
>&#x03B8;</mi></mrow></msub 
> <mi 
>b</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>&#x03B8;</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfenced>
</mrow></math>
<span 
class="ecti-1000">where                                                                                              </span><!--l. 5058--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>b</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>&#x03B8;</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo>  <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>m</mi></mrow></mfrac> <mo 
class="MathClass-op">&#x2211;</mo>
    <mi 
>&#x03C6;</mi> <mfenced separators="" 
open="("  close=")" ><mrow><mfrac><mrow 
><mi 
>y</mi><mi 
>h</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>x</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow>
   <mrow 
><mi 
>&#x03B8;</mi></mrow></mfrac>   </mrow></mfenced> <mo 
class="MathClass-bin">+</mo> <mfrac><mrow 
><mn>2</mn><msqrt><mi 
></mi>
 <mrow><mn>2</mn><mi 
>&#x03C0;</mi></mrow></msqrt></mrow> 
  <mrow 
><mi 
>&#x03B8;</mi></mrow></mfrac>   <mn>3</mn><mn>2</mn><msqrt><mi 
></mi>
 <mrow><mfrac><mrow 
><mi 
>d</mi><mo 
class="MathClass-bin">+</mo><mn>1</mn></mrow>
 <mrow 
><mi 
>m</mi></mrow></mfrac></mrow></msqrt> <msub><mrow 
><mi 
>L</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
><msub><mrow 
><mi 
>L</mi></mrow><mrow 
><mn>2</mn></mrow></msub 
><msub><mrow 
><mi 
>A</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
><msub><mrow 
><mi 
>A</mi></mrow><mrow 
><mn>2</mn></mrow></msub 
> <mo 
class="MathClass-bin">+</mo> <mfrac><mrow 
><msqrt><mi 
></mi><mrow><mfrac><mrow 
>
<mn>1</mn></mrow> 
<mrow 
><mn>2</mn></mrow></mfrac> <mo 
> ln</mo><!--nolimits--> <mfrac><mrow 
><mn>2</mn></mrow> 
<mrow 
><mi 
>&#x03B4;</mi></mrow></mfrac></mrow></msqrt><mo 
class="MathClass-bin">+</mo><mn>2</mn></mrow> 
     <mrow 
><msqrt><mi 
></mi><mrow><mi 
>m</mi></mrow></msqrt></mrow></mfrac>    </mrow></math>
</p>
   </div>
   <div class="proof">
<!--l. 5061--><p class="indent">   <span class="head">
                                                                     

                                                                     
   <span 
class="eccc-1000">P<small 
class="small-caps">R</small><small 
class="small-caps">O</small><small 
class="small-caps">O</small><small 
class="small-caps">F</small>.</span> </span>Given in <span class="cite">[<a 
href="thesisli2.xml#XPanchenko"><span 
class="ecbx-1000">28</span></a>]</span> <span class="qed"><span 
class="msam-10">&#x25AB;</span></span>
</p>
   </div>
<!--l. 5063--><p class="indent">   The theorem is actually only given up to a universal constant. &#x201C;<!--l. 5063--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mn>3</mn><mn>2</mn></mrow></math>&#x201D;
might be the right choice, but this is just an educated guess by the author <span class="cite">[<a 
href="thesisli2.xml#XDPP"><span 
class="ecbx-1000">42</span></a>]</span>. The
neural network true error bound above is (perhaps) the tightest known bound for
general feed-forward neural networks and so it is the natural bound to compare
with.
</p><!--l. 5069--><p class="indent">   This 2 layer feed-forward bound is not easily applied in a tight
manner because we can&#x2019;t calculate a priori what our weight bound <!--l. 5070--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>A</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math>
should be. This can be patched up using the principle of structural
risk minimization. In particular, we can state the bound for <!--l. 5072--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>A</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
> <mo 
class="MathClass-rel">=</mo> <msup><mrow 
><mi 
>&#x03B1;</mi></mrow><mrow 
><mi 
>j</mi></mrow></msup 
></mrow></math> where <!--l. 5072--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>j</mi></mrow></math>is some non-negative integer
and <!--l. 5073--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>&#x03B1;</mi> <mo 
class="MathClass-rel">&#x003E;</mo> <mn>1</mn></mrow></math> is a constant. If
the <!--l. 5073--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>j</mi></mrow></math>th bound holds
with probability <!--l. 5074--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
>   <mfrac><mrow 
><mi 
>&#x03B4;</mi></mrow>
<mrow 
><mi 
>j</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>j</mi><mo 
class="MathClass-bin">+</mo><mn>1</mn></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfrac></mrow></math>,
then all bounds will hold simultaneously with probability <!--l. 5075--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mn>1</mn> <mo 
class="MathClass-bin">&#x2212;</mo> <mi 
>&#x03B4;</mi></mrow></math>, since <!--l. 5076--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">
<mrow 
><msubsup><mrow 
>
                         <mo 
class="MathClass-op">&#x2211;</mo>
                            </mrow><mrow 
><mi 
>j</mi><mo 
class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow 
><mi 
>&#x221E;</mi></mrow></msubsup 
>   <mfrac><mrow 
><mn>1</mn></mrow>
<mrow 
><mi 
>j</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>j</mi> <mo 
class="MathClass-bin">+</mo> <mn>1</mn></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfrac> <mo 
class="MathClass-rel">=</mo> <mn>1</mn>
</mrow></math> Applying this approach
to the values of both <!--l. 5078--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><msub><mrow 
><mi 
>A</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
></mrow></math>
and <!--l. 5078--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msub><mrow 
><mi 
>A</mi></mrow><mrow 
><mn>2</mn></mrow></msub 
></mrow></math>,
we get the following theorem:
</p>
   <div class="newtheorem">
<!--l. 5081--><p class="noindent"><span class="head">
<a 
  name="x79-120012r2"></a>
                                                                     

                                                                     
  <span 
class="eccc-1000">C<small 
class="small-caps">O</small><small 
class="small-caps">R</small><small 
class="small-caps">O</small><small 
class="small-caps">L</small><small 
class="small-caps">L</small><small 
class="small-caps">A</small><small 
class="small-caps">R</small><small 
class="small-caps">Y</small> </span>13.1.2<span 
class="eccc-1000">.</span></span>
</p><!--l. 5082--><p class="indent">   <span 
class="ecti-1000">(2     Layer     Feed-Forward     Neural     Network     bound)     For     all     </span><!--l. 5082--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>&#x03B4;</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mrow><mo 
class="MathClass-open">(</mo><mrow><mn>0</mn><mo 
class="MathClass-punc">,</mo><mn>1</mn></mrow><mo 
class="MathClass-close">]</mo></mrow></mrow></math>
<!--l. 5083--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">
<mrow 
>
                  <msub><mrow 
><mo 
>Pr</mo></mrow><mrow 
><mi 
>D</mi></mrow></msub 
> <mfenced separators="" 
open="("  close=")" ><mrow><mi 
>&#x2203;</mi><mi 
>h</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mi 
>H</mi><mo 
class="MathClass-punc">,</mo><mi 
>j</mi><mo 
class="MathClass-punc">,</mo><mi 
>k</mi> <mo 
class="MathClass-punc">:</mo>  <mi 
>e</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">&#x003E;</mo><msub><mrow 
><mo 
> inf</mo> </mrow><mrow 
><mi 
>&#x03B8;</mi></mrow></msub 
> <mi 
>b</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>&#x03B8;</mi><mo 
class="MathClass-punc">,</mo><mi 
>j</mi><mo 
class="MathClass-punc">,</mo><mi 
>k</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfenced>
</mrow></math>
<span 
class="ecti-1000">where                                                                                              </span><!--l. 5085--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>b</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>&#x03B8;</mi><mo 
class="MathClass-punc">,</mo><mi 
>j</mi><mo 
class="MathClass-punc">,</mo><mi 
>k</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo>  <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>m</mi></mrow></mfrac> <mo 
class="MathClass-op">&#x2211;</mo>
    <mi 
>&#x03C6;</mi> <mfenced separators="" 
open="("  close=")" ><mrow><mfrac><mrow 
><mi 
>y</mi><mi 
>h</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>x</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow>
   <mrow 
><mi 
>&#x03B8;</mi></mrow></mfrac>   </mrow></mfenced> <mo 
class="MathClass-bin">+</mo> <mfrac><mrow 
><mn>2</mn><msqrt><mi 
></mi>
 <mrow><mn>2</mn><mi 
>&#x03C0;</mi></mrow></msqrt></mrow> 
  <mrow 
><mi 
>&#x03B8;</mi></mrow></mfrac>   <mn>3</mn><mn>2</mn><msqrt><mi 
></mi>
 <mrow><mfrac><mrow 
><mi 
>d</mi><mo 
class="MathClass-bin">+</mo><mn>1</mn></mrow>
 <mrow 
><mi 
>m</mi></mrow></mfrac></mrow></msqrt> <msub><mrow 
><mi 
>L</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
><msub><mrow 
><mi 
>L</mi></mrow><mrow 
><mn>2</mn></mrow></msub 
><msup><mrow 
><mi 
>&#x03B1;</mi></mrow><mrow 
><mi 
>j</mi></mrow></msup 
><msup><mrow 
><mi 
>&#x03B2;</mi></mrow><mrow 
><mi 
>k</mi></mrow></msup 
> <mo 
class="MathClass-bin">+</mo> <mfrac><mrow 
><msqrt><mi 
></mi>
 <mrow><mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mn>2</mn></mrow></mfrac> <mo 
> ln</mo><!--nolimits--> <mfrac><mrow 
><mn>2</mn><mi 
>j</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>j</mi><mo 
class="MathClass-bin">+</mo><mn>1</mn></mrow><mo 
class="MathClass-close">)</mo></mrow><mi 
>k</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>k</mi><mo 
class="MathClass-bin">+</mo><mn>1</mn></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow> 
         <mrow 
><mi 
>&#x03B4;</mi></mrow></mfrac></mrow></msqrt>       <mo 
class="MathClass-bin">+</mo><mn>2</mn></mrow> 
             <mrow 
><msqrt><mi 
></mi><mrow><mi 
>m</mi></mrow></msqrt></mrow></mfrac>           </mrow></math>
</p>
   </div>
   <div class="proof">
<!--l. 5088--><p class="indent">   <span class="head">
   <span 
class="eccc-1000">P<small 
class="small-caps">R</small><small 
class="small-caps">O</small><small 
class="small-caps">O</small><small 
class="small-caps">F</small>.</span> </span>Apply    the    union    bound    to    all    possible    values    of    <!--l. 5088--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>j</mi></mrow></math>
and                                                                                                       <!--l. 5088--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>k</mi></mrow></math>
as discussed above. <span class="qed"><span 
class="msam-10">&#x25AB;</span></span>
</p>
   </div>
<!--l. 5091--><p class="indent">   In practice, we will use <!--l. 5091--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>&#x03B1;</mi> <mo 
class="MathClass-rel">=</mo> <mi 
>&#x03B2;</mi> <mo 
class="MathClass-rel">=</mo> <mn>1</mn><mo 
class="MathClass-punc">.</mo><mn>1</mn></mrow></math>
and report the value of the tightest applicable bound for all <!--l. 5092--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>j</mi><mo 
class="MathClass-punc">,</mo><mi 
>k</mi></mrow></math>.
</p>
   <h4 class="subsectionHead"><span class="titlemark">13.1.2. </span> <a 
  name="x79-12100013.1.2"></a>Stochastic Neural Network bound</h4>
                                                                     

                                                                     
<!--l. 5097--><p class="noindent">We will specialize a PAC-Bayes bound ( <a 
href="thesisse26.xml#x39-59001r1">6.2.1<!--tex4ht:ref: th-repbb --></a>) for application to a stochastic neural
network with a choice of the &#x201C;prior&#x201D;. Our &#x201C;prior&#x201D; will be zero on all neural net structures
other than the one we train and a multidimensional isotropic gaussian on the values of
the weights in our neural network. The multidimensional gaussian will have a mean of <!--l. 5101--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mn>0</mn></mrow></math> and a variance in each
dimension of <!--l. 5102--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msup><mrow 
><mi 
>b</mi></mrow><mrow 
><mn>2</mn></mrow></msup 
></mrow></math>.
This choice is made for convenience and happens to provide good results.
</p><!--l. 5105--><p class="indent">   The optimal value of <!--l. 5105--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>b</mi></mrow></math>
is unknown and dependent on the learning problem so we will wish to
parameterize it in an example dependent manner. We can do this using the same
trick as for the original neural net bound. Use a sequence of bounds where <!--l. 5108--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>b</mi> <mo 
class="MathClass-rel">=</mo> <mi 
>c</mi><msup><mrow 
><mi 
>&#x03B1;</mi></mrow><mrow 
><mi 
>j</mi></mrow></msup 
></mrow></math> for <!--l. 5108--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>c</mi></mrow></math> and <!--l. 5108--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>&#x03B1;</mi></mrow></math> some constants and <!--l. 5109--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>j</mi></mrow></math> a nonnegative number.
For the <!--l. 5109--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>j</mi></mrow></math>th
bound set <!--l. 5109--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>&#x03B4;</mi> <mo 
class="MathClass-rel">&#x2192;</mo>  <mfrac><mrow 
><mi 
>&#x03B4;</mi></mrow> 
<mrow 
><mi 
>j</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>j</mi><mo 
class="MathClass-bin">+</mo><mn>1</mn></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfrac></mrow></math>.
The union bound will imply that all bounds hold simultaneously with probability at least <!--l. 5111--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mn>1</mn> <mo 
class="MathClass-bin">&#x2212;</mo> <mi 
>&#x03B4;</mi></mrow></math>.
</p><!--l. 5113--><p class="indent">   Assuming that our &#x201C;posterior&#x201D; <!--l. 5113--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>Q</mi></mrow></math>
is also defined by a multidimensional gaussian with the mean and variance in each dimension defined
by <!--l. 5114--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msub><mrow 
><mi 
>w</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math>
and <!--l. 5115--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msubsup><mrow 
><mi 
>s</mi></mrow><mrow 
><mi 
>i</mi></mrow><mrow 
><mn>2</mn></mrow></msubsup 
></mrow></math>,
we can specialize to the following corollary:
</p>
   <div class="newtheorem">
<!--l. 5117--><p class="noindent"><span class="head">
<a 
  name="x79-121001r3"></a>
  <span 
class="eccc-1000">C<small 
class="small-caps">O</small><small 
class="small-caps">R</small><small 
class="small-caps">O</small><small 
class="small-caps">L</small><small 
class="small-caps">L</small><small 
class="small-caps">A</small><small 
class="small-caps">R</small><small 
class="small-caps">Y</small> </span>13.1.3<span 
class="eccc-1000">.</span></span>
</p><!--l. 5118--><p class="indent">   <span 
class="ecti-1000">(Stochastic Neural Network bound) Let </span><!--l. 5118--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>k</mi></mrow></math> <span 
class="ecti-1000">be the number of weights</span>
<span 
class="ecti-1000">in a neural network, </span><!--l. 5119--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><msub><mrow 
><mi 
>w</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math>
<span 
class="ecti-1000">be the </span><!--l. 5119--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>i</mi></mrow></math> <span 
class="ecti-1000">the weight</span>
<span 
class="ecti-1000">and </span><!--l. 5120--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msub><mrow 
><mi 
>s</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math> <span 
class="ecti-1000">be the</span>
<span 
class="ecti-1000">variance of the </span><!--l. 5120--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>i</mi></mrow></math><span 
class="ecti-1000">th</span>
<span 
class="ecti-1000">weight. Then, we have:</span> </p><table class="equation"><tr><td> <a 
  name="x79-121002r1"></a>
                                                                     

                                                                     
<!--l. 5121--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">      
<msub><mrow 
><mo 
>Pr</mo></mrow><mrow 
><mi 
>D</mi></mrow></msub 
> <mfenced separators="" 
open="("  close=")" ><mrow><mi 
>&#x2203;</mi><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-punc">:</mo>  <!--mstyle 
class="text"--><mtext class="textrm">KL</mtext><!--/mstyle--><mrow><mo 
class="MathClass-open">(</mo><mrow><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo>&#x0302;</mo></mover></mrow><mrow 
><mi 
>q</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
class="MathClass-rel">&#x2223;</mo><mo 
class="MathClass-rel">&#x2223;</mo><msub><mrow 
><mi 
>e</mi></mrow><mrow 
><mi 
>q</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">&#x2265;</mo><msub><mrow 
><mo 
> inf</mo> </mrow><mrow 
><mi 
>j</mi></mrow></msub 
><mfrac><mrow 
><msubsup><mrow 
> <mo 
class="MathClass-op">&#x2211;</mo>
     </mrow><mrow 
><mi 
>i</mi><mo 
class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow 
><mi 
>k</mi></mrow></msubsup 
> <mfenced separators="" 
open="["  close="]" ><mrow><mo 
>ln</mo><!--nolimits--> <mfrac><mrow 
><mi 
>c</mi><msup><mrow 
><mi 
>&#x03B1;</mi></mrow><mrow 
><mi 
>j</mi></mrow></msup 
></mrow> 
 <mrow 
><msub><mrow 
><mi 
>s</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></mfrac>  <mo 
class="MathClass-bin">+</mo> <mfrac><mrow 
><msubsup><mrow 
><mi 
>s</mi></mrow><mrow 
><mi 
>i</mi></mrow><mrow 
><mn>2</mn></mrow></msubsup 
><mo 
class="MathClass-bin">+</mo><msubsup><mrow 
><mi 
>w</mi></mrow><mrow 
>
<mi 
>i</mi></mrow><mrow 
><mn>2</mn></mrow></msubsup 
></mrow> 
 <mrow 
><mn>2</mn><msup><mrow 
><mi 
>c</mi></mrow><mrow 
><mn>2</mn></mrow></msup 
><msup><mrow 
><mi 
>&#x03B1;</mi></mrow><mrow 
><mn>2</mn><mi 
>j</mi></mrow></msup 
></mrow></mfrac>  <mo 
class="MathClass-bin">&#x2212;</mo><mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mn>2</mn></mrow></mfrac></mrow></mfenced> <mo 
class="MathClass-bin">+</mo><mo 
> ln</mo><!--nolimits--> <mfrac><mrow 
><mi 
>j</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>j</mi><mo 
class="MathClass-bin">+</mo><mn>1</mn></mrow><mo 
class="MathClass-close">)</mo></mrow><mi 
>m</mi></mrow> 
    <mrow 
><mi 
>&#x03B4;</mi></mrow></mfrac>     </mrow> 
                        <mrow 
><mi 
>m</mi> <mo 
class="MathClass-bin">&#x2212;</mo> <mn>1</mn></mrow></mfrac>                       </mrow></mfenced> <mo 
class="MathClass-rel">&#x2264;</mo> <mi 
>&#x03B4;</mi>
</math>
<!--l. 5124--><p class="nopar"></p></td><td class="eq-no">(13.1.1)</td></tr></table>
   </div>
   <div class="proof">
<!--l. 5128--><p class="indent">   <span class="head">
   <span 
class="eccc-1000">P<small 
class="small-caps">R</small><small 
class="small-caps">O</small><small 
class="small-caps">O</small><small 
class="small-caps">F</small>.</span> </span>Analytic calculation of the KL divergence between two multidimensional
Gaussians and the union bound applied for each value of <!--l. 5129--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>j</mi></mrow></math>.
<span class="qed"><span 
class="msam-10">&#x25AB;</span></span>
</p>
   </div>
<!--l. 5131--><p class="indent">   We will choose <!--l. 5131--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>&#x03B1;</mi> <mo 
class="MathClass-rel">=</mo> <mn>1</mn><mo 
class="MathClass-punc">.</mo><mn>1</mn></mrow></math>
and <!--l. 5131--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>c</mi> <mo 
class="MathClass-rel">=</mo> <mn>0</mn><mo 
class="MathClass-punc">.</mo><mn>2</mn></mrow></math>
as reasonable default values.
</p><!--l. 5133--><p class="indent">   One more step is necessary in order to apply this bound. The essential difficulty is evaluating <!--l. 5134--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>q</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>. This
quantity is observable although calculating it to high precision is difficult. We will use
the Monte Carlo sampling technique of section  <a 
href="thesisse27.xml#x41-610006.3.1">6.3.1<!--tex4ht:ref: pb-approx --></a> in order to bound the value of <!--l. 5136--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>q</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
and then use the bound on this value in the PAC-Bayes bound. We use <!--l. 5137--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>n</mi> <mo 
class="MathClass-rel">=</mo> <mn>1</mn><mn>0</mn><mn>0</mn><mn>0</mn></mrow></math>
evaluations of the empirical error rate of the stochastic neural network.
</p>
   <h4 class="subsectionHead"><span class="titlemark">13.1.3. </span> <a 
  name="x79-12200013.1.3"></a>Distribution Construction algorithm</h4>
<!--l. 5143--><p class="noindent">One critical step is missing in the description: How do we calculate the multidimensional gaussian,
<!--l. 5144--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">     <mrow 
><mi 
>Q</mi></mrow></math>?
The variance of the posterior gaussian needs to be dependent on each weight in order to
achieve a tight bound since we want any &#x201C;meaningless&#x201D; weights to not contribute
significantly to the overall sample complexity. We use a simple greedy algorithm to find
the appropriate variance in each dimension.
</p><!--l. 5149--><p class="indent">
           </p><ol type="1" class="enumerate1" start="1" 
>
        <li class="enumerate"><a 
  name="x79-122002x1"></a>Train a neural net on the examples
           </li>
        <li class="enumerate"><a 
  name="x79-122004x2"></a>For every weight, <!--l. 5151--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><msub><mrow 
><mi 
>w</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math>,
        search for the variance, <!--l. 5151--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><msubsup><mrow 
><mi 
>s</mi></mrow><mrow 
><mi 
>i</mi></mrow><mrow 
><mn>2</mn></mrow></msubsup 
></mrow></math>,
        which reduces the empirical accuracy of the trained network by <!--l. 5152--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mn>1</mn><mi 
>%</mi></mrow></math>
                                                                     

                                                                     
        (for example) while holding all other weights fixed.
           </li>
        <li class="enumerate"><a 
  name="x79-122006x3"></a>The stochastic neural network defined by <!--l. 5154--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mrow><mo 
class="MathClass-open">{</mo><mrow><msub><mrow 
><mi 
>w</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
><mo 
class="MathClass-punc">,</mo> <msubsup><mrow 
><mi 
>s</mi></mrow><mrow 
><mi 
>i</mi></mrow><mrow 
><mn>2</mn></mrow></msubsup 
></mrow><mo 
class="MathClass-close">}</mo></mrow></mrow></math>
        will generally have a too-large empirical error. Therefore, we calculate a
        global multiplier <!--l. 5156--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>&#x03BB;</mi> <mo 
class="MathClass-rel">&#x003C;</mo> <mn>1</mn></mrow></math>
        such that the stochastic neural network defined by <!--l. 5157--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mrow><mo 
class="MathClass-open">{</mo><mrow><msub><mrow 
><mi 
>w</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
><mo 
class="MathClass-punc">,</mo> <mi 
>&#x03BB;</mi><msubsup><mrow 
><mi 
>s</mi></mrow><mrow 
><mi 
>i</mi></mrow><mrow 
><mn>2</mn></mrow></msubsup 
></mrow><mo 
class="MathClass-close">}</mo></mrow></mrow></math>
        decreases the empirical accuracy by only <!--l. 5158--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mn>1</mn><mi 
>%</mi></mrow></math>.
           </li>
        <li class="enumerate"><a 
  name="x79-122008x4"></a>Then,  we  evaluate  the  empirical  error  rate  of  the  resulting  stochastic
        neural net with <!--l. 5160--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mn>1</mn><mn>0</mn><mn>0</mn><mn>0</mn></mrow></math>
        samples from the stochastic neural network.</li></ol>
<!--l. 5161--><p class="nopar">
</p><!--l. 5163--><p class="indent">
                                                                     

                                                                     
</p>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse58.xml" >next</a>] [<a 
href="thesisse57.xml" >front</a>] [<a 
href="thesisch13.xml#thesisse57.xml" >up</a>] </p></div><a 
  name="tailthesisse57.xml"></a>   
</body> 
</html> 
