<?xml version="1.0"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "mathml.dtd"> 
<?xml-stylesheet type="text/css" href="thesis.css"?> 
<html  
xmlns="http://www.w3.org/1999/xhtml"  
><head><title>13 Neural Networks</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 
<meta name="generator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<meta name="originator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<!-- 3,early_,early^,xhtml,mozilla --> 
<meta name="src" content="thesis.tex" /> 
<meta name="date" content="2002-08-28 13:56:00" /> 
<link rel="stylesheet" type="text/css" href="thesis.css" /> 
</head><body 
>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisch14.xml" >next</a>] [<a 
href="thesisch12.xml" >prev</a>] [<a 
href="thesisch12.xml#tailthesisch12.xml" >prev-tail</a>] [<a 
href="#tailthesisch13.xml">tail</a>] [<a 
href="thesispa3.xml#thesisch13.xml" >up</a>] </p></div>
   <h2 class="chapterHead"><span class="titlemark">Chapter&#x00A0;13</span><br /><a 
  name="x78-11800013"></a>Neural Networks</h2>
<!--l. 4972--><p class="noindent">This work is joint with Rich Caruana and was published at NIPS <span class="cite">[<a 
href="thesisli2.xml#XSNN"><span 
class="ecbx-1000">32</span></a>]</span>.
</p><!--l. 4974--><p class="indent">   Estimating the true error rate of a continuous valued classifier can be surprisingly
difficult. For example, all known bounds on the true error rate of artificial neural
networks tend to be extremely loose and often result in the meaningless bound of &#x201C;always
err&#x201D; (error rate = 1.0). Figure  <a 
href="thesisse58.xml#x80-1230011">13.2.1<!--tex4ht:ref: bound_vs_epoch --></a> demonstrates this.
</p><!--l. 4980--><p class="indent">   The approach here is to <span 
class="ecti-1000">not </span>bound the true error rate of a neural network. Instead,
we bound the true error rate of a related distribution over neural networks
which we create by analyzing one neural network. The stochastic bound
approach proves much more fruitful than trying to bound the true error rate
of an individual network. The best current approaches <span class="cite">[<span 
class="ecbx-1000">?</span>]</span><span class="cite">[<a 
href="thesisli2.xml#XPanchenko"><span 
class="ecbx-1000">28</span></a>]</span> often require <!--l. 4985--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mn>1</mn><mn>0</mn><mn>0</mn><mn>0</mn></mrow></math>, <!--l. 4985--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mn>1</mn><mn>0</mn><mn>0</mn><mn>0</mn><mn>0</mn></mrow></math>, or
more examples before producing a nontrivial bound on the true error rate. We produce
nontrivial bounds on the true error rate of a stochastic neural network with less than <!--l. 4987--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mn>1</mn><mn>0</mn><mn>0</mn></mrow></math>
examples.
</p><!--l. 4989--><p class="indent">   Our approach uses a PAC-Bayes bound such as theorem ( <a 
href="thesisse26.xml#x39-59001r1">6.2.1<!--tex4ht:ref: th-repbb --></a>). The approach can
be thought of as a redivision of the work between the experimenter and the
theoretician: we make the experimenter work harder so that the theoretician&#x2019;s
true error bound becomes much tighter. This &#x201C;extra work&#x201D; on the part of the
experimenter is significant, but tractable, and the resulting bounds are <span 
class="ecti-1000">much</span>
tighter.
</p><!--l. 4996--><p class="indent">   An alternative viewpoint is that the classification problem <span 
class="ecti-1000">is </span>finding a hypothesis
with a low upper bound on the future error rate. We present a post-processing phase for
neural networks which results in a classifier with a much lower upper bound on the
future error rate. The post-processing can be used with any artificial neural net trained
with any optimization method; it does not require the learning procedure be
modified, re-run, or even that the threshold function be differentiable. In fact, this
post-processing step can easily be adapted to other continuous valued learning
algorithms.
</p><!--l. 5005--><p class="indent">   The post-processing step finds a &#x201C;large&#x201D; distribution over classifiers, which has a
small <span 
class="ecti-1000">average </span>empirical error rate. Given the average empirical error rate, it is
straightforward to apply the PAC-Bayes bound in order to find a bound on the <span 
class="ecti-1000">average</span>
true error rate. We can find this large distribution over classifiers by performing a simple
noise sensitivity analysis on the learned model. The noise model allows us to generate a
distribution of classifiers with a known, small, average empirical error rate. We refer to
the distribution of neural nets that results from this noise analysis as a &#x201C;stochastic&#x201D;
neural net model.
</p><!--l. 5015--><p class="indent">   Why do we expect the PAC-Bayes bound to be a significant improvement over standard
                                                                     

                                                                     
covering number and VC bound approaches? There exist learning problems for which the
difference between the lower bound and the PAC-Bayes upper bound is tight up to <!--l. 5018--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>O</mi> <mfenced separators="" 
open="("  close=")" ><mrow><mfrac><mrow 
><mo 
> ln</mo><!--nolimits--> <mi 
>m</mi></mrow>
 <mrow 
><mi 
>m</mi></mrow></mfrac>  </mrow></mfenced></mrow></math> where
<!--l. 5018--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">     <mrow 
><mi 
>m</mi></mrow></math>
is the number of training examples. This is superior to the guarantees which can be made
for typical covering number bounds where the gap is, at best, known up to an
(asymptotic) constant. The guarantee that PAC-Bayes bounds are sometimes quite tight
encourages us to apply them here.
</p>
   <div class="sectionTOCS"><span class="sectionToc">&#x00A0;13.1.&#x00A0;&#x00A0;<a 
href="thesisse57.xml#x79-11900013.1" name="QQ2-79-138">Theoretical setup</a></span><br /><span class="subsectionToc">&#x00A0;&#x00A0;&#x00A0;13.1.1.&#x00A0;&#x00A0;<a 
href="thesisse57.xml#x79-12000013.1.1" name="QQ2-79-139">Neural Network bound</a></span><br /><span class="subsectionToc">&#x00A0;&#x00A0;&#x00A0;13.1.2.&#x00A0;&#x00A0;<a 
href="thesisse57.xml#x79-12100013.1.2" name="QQ2-79-140">Stochastic Neural Network
bound</a></span><br /><span class="subsectionToc">&#x00A0;&#x00A0;&#x00A0;13.1.3.&#x00A0;&#x00A0;<a 
href="thesisse57.xml#x79-12200013.1.3" name="QQ2-79-141">Distribution Construction algorithm</a></span><br /><span class="sectionToc">&#x00A0;13.2.&#x00A0;&#x00A0;<a 
href="thesisse58.xml#x80-12300013.2" name="QQ2-80-142">Experimental Results</a></span><br /><span class="sectionToc">&#x00A0;13.3.&#x00A0;&#x00A0;<a 
href="thesisse59.xml#x81-12400013.3" name="QQ2-81-145">Conclusion</a></span><br />
   </div>


                                                                     

                                                                     
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisch14.xml" >next</a>] [<a 
href="thesisch12.xml" >prev</a>] [<a 
href="thesisch12.xml#tailthesisch12.xml" >prev-tail</a>] [<a 
href="thesisch13.xml" >front</a>] [<a 
href="thesispa3.xml#thesisch13.xml" >up</a>] </p></div><a 
  name="tailthesisch13.xml"></a>  
</body> 
</html> 
