<?xml version="1.0"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "mathml.dtd"> 
<?xml-stylesheet type="text/css" href="thesis.css"?> 
<html  
xmlns="http://www.w3.org/1999/xhtml"  
><head><title>5.2 The Simple Microchoice Bound</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 
<meta name="generator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<meta name="originator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<!-- 3,early_,early^,xhtml,mozilla --> 
<meta name="src" content="thesis.tex" /> 
<meta name="date" content="2002-08-28 13:56:00" /> 
<link rel="stylesheet" type="text/css" href="thesis.css" /> 
</head><body 
>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse23.xml" >next</a>] [<a 
href="thesisse21.xml" >prev</a>] [<a 
href="thesisse21.xml#tailthesisse21.xml" >prev-tail</a>] [<a 
href="#tailthesisse22.xml">tail</a>] [<a 
href="thesisch5.xml#thesisse22.xml" >up</a>] </p></div>
   <h3 class="sectionHead"><span class="titlemark">5.2. </span> <a 
  name="x32-400005.2"></a>The Simple Microchoice Bound</h3>
<!--l. 1448--><p class="noindent">The simple microchoice bound is essentially a compelling and easy way to select a measure <!--l. 1449--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> for
learning algorithms that operate by making a series of small choices. In particular,
consider a learning algorithm that works by making a sequence of choices, <!--l. 1451--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>c</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
><mo 
class="MathClass-punc">,</mo><mo 
class="MathClass-punc">.</mo><mo 
class="MathClass-punc">.</mo><mo 
class="MathClass-punc">.</mo><mo 
class="MathClass-punc">,</mo><msub><mrow 
><mi 
>c</mi></mrow><mrow 
><mi 
>d</mi></mrow></msub 
></mrow></math>, from a sequence of choice
sets, <!--l. 1452--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
><mo 
class="MathClass-punc">,</mo><mo 
class="MathClass-punc">.</mo><mo 
class="MathClass-punc">.</mo><mo 
class="MathClass-punc">.</mo><mo 
class="MathClass-punc">,</mo><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mi 
>d</mi></mrow></msub 
></mrow></math>, finally producing
a hypothesis, <!--l. 1452--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>h</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mi 
>H</mi></mrow></math>.
Specifically, the algorithm first looks at the choice set <!--l. 1453--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
></mrow></math> and the data <!--l. 1454--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msup><mrow 
><mi 
>z</mi></mrow><mrow 
><mi 
>N</mi></mrow></msup 
></mrow></math> to produce choice <!--l. 1454--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>c</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
> <mo 
class="MathClass-rel">&#x2208;</mo> <msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
></mrow></math>. The choice <!--l. 1454--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>c</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
></mrow></math> then determines the
next choice set <!--l. 1455--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mn>2</mn></mrow></msub 
></mrow></math>
(different initial choices produce different choice sets for the second
level). The algorithm again looks at the data to make some choice <!--l. 1457--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>c</mi></mrow><mrow 
><mn>2</mn></mrow></msub 
> <mo 
class="MathClass-rel">&#x2208;</mo> <msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mn>2</mn></mrow></msub 
></mrow></math>. This choice then determines
the next choice set <!--l. 1458--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mn>3</mn></mrow></msub 
></mrow></math>,
and so on. These choice sets can be thought of as nodes in a <span 
class="ecti-1000">choice</span>
<span 
class="ecti-1000">tree</span>, where each node in the tree corresponds to some internal state
of the learning algorithm, and a node containing some choice set <!--l. 1461--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>C</mi></mrow></math> has branching
factor <!--l. 1461--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>C</mi><mo 
class="MathClass-rel">&#x2223;</mo></mrow></math>.
Pictorially, we can draw the tree as follows:
</p><!--l. 1464--><p class="indent">   <img 
src="thesis6x.gif" alt="PIC" class="graphics" width="748.79749pt" height="311.16249pt"  /><!--tex4ht:graphics  
name="thesis6x.gif" src="thesis-presentation/microchoice_tree.eps"  
-->
</p><!--l. 1466--><p class="indent">   Depending on the learning algorithm, sub-trees of the overall tree may be identical.
                                                                     

                                                                     
We address optimization of the bound for this case later. Eventually there is a final
choice leading to a leaf, and a single hypothesis is output.
</p><!--l. 1470--><p class="indent">   For example, the decision list algorithm of Rivest <span class="cite">[<a 
href="thesisli2.xml#XRivest"><span 
class="ecbx-1000">45</span></a>]</span>, applied to a set of <!--l. 1471--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>n</mi></mrow></math> features, uses the data
to choose one of <!--l. 1471--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mn>4</mn><mi 
>n</mi> <mo 
class="MathClass-bin">+</mo> <mn>2</mn></mrow></math>
rules (e.g., &#x201C;if <!--l. 1472--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>x</mi></mrow><mo 
class="MathClass-op">&#x0304;</mo></mover></mrow><mrow 
><mn>3</mn></mrow></msub 
></mrow></math>
then <!--l. 1472--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mo 
class="MathClass-bin">&#x2212;</mo></mrow></math>&#x201D;)
to put at the top. Based on the choice made, it moves to a choice set of <!--l. 1473--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mn>4</mn><mi 
>n</mi> <mo 
class="MathClass-bin">&#x2212;</mo> <mn>2</mn></mrow></math>
possible rules to put at the next level, then a choice set of size <!--l. 1474--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mn>4</mn><mi 
>n</mi> <mo 
class="MathClass-bin">&#x2212;</mo> <mn>6</mn></mrow></math>,
and so on, until eventually it chooses a rule such as &#x201C;else <!--l. 1475--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mo 
class="MathClass-bin">+</mo></mrow></math>&#x201D;
leading to a leaf.
</p><!--l. 1477--><p class="indent">   The microchoice bound calculation program is as follows:
</p>
   <div class="newtheorem">
<!--l. 1479--><p class="noindent"><span class="head">
<a 
  name="x32-40001r1"></a>
  <span 
class="eccc-1000">A<small 
class="small-caps">L</small><small 
class="small-caps">G</small><small 
class="small-caps">O</small><small 
class="small-caps">R</small><small 
class="small-caps">I</small><small 
class="small-caps">T</small><small 
class="small-caps">H</small><small 
class="small-caps">M</small> </span>5.2.1<span 
class="eccc-1000">.</span></span>
</p><!--l. 1480--><p class="indent">   <span 
class="ecti-1000">Calculate_Microchoice</span>
</p>
   </div>
<!--l. 1482--><p class="indent">
           </p><ol type="1" class="enumerate1" start="1" 
>
        <li class="enumerate"><a 
  name="x32-40003x1"></a>set <!--l. 1483--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>p</mi> <mo 
class="MathClass-rel">&#x2190;</mo> <mn>1</mn></mrow></math>
           </li>
        <li class="enumerate"><a 
  name="x32-40005x2"></a>while learning algorithm has not halted.
        <!--l. 1486--><p class="indent">
            </p><ol type="a" class="enumerate2" start="1" 
>
            <li class="enumerate"><a 
  name="x32-40007x1"></a><!--l. 1487--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>C</mi><mo 
class="MathClass-rel">&#x2223;</mo><mo 
class="MathClass-rel">&#x2190;</mo></mrow></math>
            number of possible data-dependent choices
            </li>
            <li class="enumerate"><a 
  name="x32-40009x2"></a><!--l. 1488--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>p</mi> <mo 
class="MathClass-rel">&#x2190;</mo> <mfrac><mrow 
><mi 
>p</mi></mrow> 
<mrow 
><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>C</mi><mo 
class="MathClass-rel">&#x2223;</mo></mrow></mfrac></mrow></math></li></ol>
        <!--l. 1489--><p class="nopar">
           </p></li>
        <li class="enumerate"><a 
  name="x32-40011x3"></a>return <!--l. 1490--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>p</mi></mrow></math></li></ol>
<!--l. 1491--><p class="nopar"> Pictorially, this algorithm can be thought of as taking a &#x201C;supply&#x201D; of probability at the
root of the choice tree.
                                                                     

                                                                     
</p><!--l. 1496--><p class="noindent"><img 
src="thesis7x.gif" alt="PIC" class="graphics" width="748.79749pt" height="390.45874pt"  /><!--tex4ht:graphics  
name="thesis7x.gif" src="thesis-presentation/microchoice_alg.eps"  
-->
</p><!--l. 1499--><p class="indent">   The root takes its supply and splits it equally among all its children.
Recursively, each child then does the same: it takes the supply it is given and
splits it evenly among its children, until all of the supplied probability is
allocated among the leaves. If we examine some leaf containing a hypothesis <!--l. 1502--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>h</mi></mrow></math>,
we see that this method gives at least probability <!--l. 1503--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo><msubsup><mrow 
> <mo 
class="MathClass-op">&#x220F;</mo>
  </mrow><mrow 
><mi 
>i</mi><mo 
class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow 
><mi 
>d</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></msubsup 
>    <mfrac><mrow 
><mn>1</mn></mrow>
<mrow 
><mo 
class="MathClass-rel">&#x2223;</mo><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
class="MathClass-rel">&#x2223;</mo></mrow></mfrac></mrow></math> to each <!--l. 1504--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>h</mi></mrow></math> for any path of depth <!--l. 1504--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>d</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> reaching the
hypothesis <!--l. 1504--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>h</mi></mrow></math>.
</p><!--l. 1506--><p class="indent">   Note it is possible that several leaves will contain the same hypothesis <!--l. 1506--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>h</mi></mrow></math>,
and in that case one should really add the allocated measures together.
However, the microchoice bound neglects this issue, implying that it will be
unnecessarily loose for learning algorithms which can arrive at the same
hypothesis in multiple ways. The reason for neglecting this is that now, <!--l. 1510--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> is
something the learning algorithm itself can calculate by simply keeping track of the sizes
of the choice sets it has encountered so far. It is important to notice that this
construction is defined before observing any data. Consequently, every hypothesis has
some bound associated with it before the data is used to pick a particular hypothesis and
its corresponding bound.
</p><!--l. 1517--><p class="indent">   Another way to view this process is that we cannot know in advance
which choice sequence the algorithm will make. However, a distribution <!--l. 1518--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>D</mi></mrow></math> on labeled
examples induces a probability distribution over choice sequences, inducing a probability distribution <!--l. 1520--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> over hypotheses. Ideally
we would like to use <!--l. 1521--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
in our bounds as noted above. However, we cannot calculate <!--l. 1522--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> (since the distribution <!--l. 1522--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>D</mi></mrow></math> is unknown), so instead,
our choice of <!--l. 1523--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
will be just an estimate. We hope that the algorithm designer has
                                                                     

                                                                     
chosen a &#x201C;good&#x201D; leaning algorithm which induces a distribution <!--l. 1524--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> over the final hypotheses
which is near to <!--l. 1525--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>.
Our estimate <!--l. 1525--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
is the probability distribution resulting from picking each choice uniformly at
random from the current choice set at each level (note: this is different from
picking a final hypothesis uniformly at random). I.e., it can be viewed as the
measure associated with the assumption that at each step, all choices are equally
likely.
</p><!--l. 1532--><p class="indent">   We immediately find the following theorem:
</p>
   <div class="newtheorem">
<!--l. 1534--><p class="noindent"><span class="head">
<a 
  name="x32-40012r2"></a>
  <span 
class="eccc-1000">T<small 
class="small-caps">H</small><small 
class="small-caps">E</small><small 
class="small-caps">O</small><small 
class="small-caps">R</small><small 
class="small-caps">E</small><small 
class="small-caps">M</small> </span>5.2.2<span 
class="eccc-1000">.</span></span>
</p><!--l. 1535--><p class="indent">   <span 
class="ecti-1000">(Microchoice Bound) For all hypothesis spaces, </span><!--l. 1535--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>H</mi></mrow></math><span 
class="ecti-1000">,</span>
<span 
class="ecti-1000">for all </span><!--l. 1536--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>&#x03B4;</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mrow><mo 
class="MathClass-open">(</mo><mrow><mn>0</mn><mo 
class="MathClass-punc">,</mo><mn>1</mn></mrow><mo 
class="MathClass-close">]</mo></mrow></mrow></math><span 
class="ecti-1000">:</span>
<!--l. 1537--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">    <mrow 
>
               <msub><mrow 
><mo 
>Pr</mo></mrow><mrow 
><msup><mrow 
><mi 
>D</mi></mrow><mrow 
><mi 
>m</mi></mrow></msup 
></mrow></msub 
> <mfenced separators="" 
open="("  close=")" ><mrow><mi 
>&#x2203;</mi><mi 
>h</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mi 
>H</mi> <mo 
class="MathClass-punc">:</mo>  <mi 
>e</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">&#x003E;</mo> <mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo>&#x0304;</mo></mover><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>m</mi><mo 
class="MathClass-punc">,</mo><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo>&#x0302;</mo></mover><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
class="MathClass-punc">,</mo>         <mfrac><mrow 
><mi 
>&#x03B4;</mi></mrow> 
<mrow 
><msubsup><mrow 
><mo 
class="MathClass-op">&#x220F;</mo>
  </mrow><mrow 
><mi 
>i</mi><mo 
class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow 
><mi 
>d</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></msubsup 
><mo 
class="MathClass-rel">&#x2223;</mo><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
class="MathClass-rel">&#x2223;</mo></mrow></mfrac></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfenced> <mo 
class="MathClass-rel">&#x2264;</mo> <mi 
>&#x03B4;</mi>
</mrow></math>
<span 
class="ecti-1000">where </span><!--l. 1539--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0304;</mo></mover> <mfenced separators="" 
open="("  close=")" ><mrow><mi 
>m</mi><mo 
class="MathClass-punc">,</mo> <mfrac><mrow 
><mi 
>k</mi></mrow> 
<mrow 
><mi 
>m</mi></mrow></mfrac><mo 
class="MathClass-punc">,</mo><mi 
>&#x03B4;</mi></mrow></mfenced> <mo 
class="MathClass-rel">&#x2261;</mo><msub><mrow 
><mo 
> max</mo></mrow><mrow 
><mi 
>p</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">{</mo><mrow><mi 
>p</mi> <mo 
class="MathClass-punc">:</mo>  <!--mstyle 
class="text"--><mtext class="textrm">Bin</mtext><!--/mstyle--><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>m</mi><mo 
class="MathClass-punc">,</mo><mi 
>k</mi><mo 
class="MathClass-punc">,</mo><mi 
>p</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mi 
>&#x03B4;</mi></mrow><mo 
class="MathClass-close">}</mo></mrow></mrow></math>
</p>
   </div>
<!--l. 1541--><p class="noindent"><span 
class="ecbx-1000">Proof. </span>Specialization of the Occam&#x2019;s Razor bound ( <a 
href="thesisse20.xml#x27-36001r1">4.6.1<!--tex4ht:ref: th-ORB --></a>).
</p><!--l. 1543--><p class="indent">   Once again, it will be worthwhile to slightly loosen this bound with the following
corollary:
</p>
   <div class="newtheorem">
<!--l. 1546--><p class="noindent"><span class="head">
<a 
  name="x32-40013r3"></a>
                                                                     

                                                                     
  <span 
class="eccc-1000">C<small 
class="small-caps">O</small><small 
class="small-caps">R</small><small 
class="small-caps">O</small><small 
class="small-caps">L</small><small 
class="small-caps">L</small><small 
class="small-caps">A</small><small 
class="small-caps">R</small><small 
class="small-caps">Y</small> </span>5.2.3<span 
class="eccc-1000">.</span></span>
</p><!--l. 1547--><p class="indent">   <span 
class="ecti-1000">(Relative Entropy Microchoice Bound) For all hypothesis spaces, </span><!--l. 1548--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>H</mi></mrow></math><span 
class="ecti-1000">,</span>
<span 
class="ecti-1000">for all </span><!--l. 1548--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">      <mrow 
><mi 
>&#x03B4;</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mrow><mo 
class="MathClass-open">(</mo><mrow><mn>0</mn><mo 
class="MathClass-punc">,</mo><mn>1</mn></mrow><mo 
class="MathClass-close">]</mo></mrow></mrow></math><span 
class="ecti-1000">:</span>
<!--l. 1549--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">    <mrow 
>
          <msub><mrow 
><mo 
>Pr</mo></mrow><mrow 
><msup><mrow 
><mi 
>D</mi></mrow><mrow 
><mi 
>m</mi></mrow></msup 
></mrow></msub 
> <mfenced separators="" 
open="("  close=")" ><mrow><mi 
>&#x2203;</mi><mi 
>h</mi> <mo 
class="MathClass-rel">&#x2208;</mo> <mi 
>H</mi> <mo 
class="MathClass-punc">:</mo>  <!--mstyle 
class="text"--><mtext class="textrm">KL</mtext><!--/mstyle--><mrow><mo 
class="MathClass-open">(</mo><mrow><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo>&#x0302;</mo></mover><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
class="MathClass-rel">&#x2223;</mo><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>e</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">&#x003E;</mo> <mfrac><mrow 
><msubsup><mrow 
><mo 
class="MathClass-op">&#x2211;</mo>
  </mrow><mrow 
><mi 
>i</mi><mo 
class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow 
><mi 
>d</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></msubsup 
><mo 
> ln</mo><!--nolimits--><mo 
class="MathClass-rel">&#x2223;</mo><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
class="MathClass-rel">&#x2223;</mo> <mo 
class="MathClass-bin">+</mo><mo 
> ln</mo><!--nolimits--> <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>&#x03B4;</mi></mrow></mfrac></mrow> 
                <mrow 
><mi 
>m</mi></mrow></mfrac>              </mrow></mfenced> <mo 
class="MathClass-rel">&#x2264;</mo> <mi 
>&#x03B4;</mi>
</mrow></math>
</p>
   </div>
<!--l. 1553--><p class="indent">   The point of the microchoice bound is that the quantity <!--l. 1553--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0304;</mo></mover><mrow><mo 
class="MathClass-open">(</mo><mrow><mo 
class="MathClass-punc">.</mo><mo 
class="MathClass-punc">.</mo><mo 
class="MathClass-punc">.</mo></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> is
something the algorithm can calculate as it goes along, based on the sizes of the
choice sets encountered. To see this, note that the hypothesis dependent term is <!--l. 1556--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msubsup><mrow 
><mo 
class="MathClass-op">&#x2211;</mo>
  </mrow><mrow 
><mi 
>i</mi><mo 
class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow 
><mi 
>d</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></msubsup 
><mo 
> ln</mo><!--nolimits--><mo 
class="MathClass-rel">&#x2223;</mo><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
class="MathClass-rel">&#x2223;</mo></mrow></math>. The quantity <!--l. 1556--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>d</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> can be calculated
by just noting the number of choices made before the learning algorithm terminates. The choice
sets, <!--l. 1558--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>,
can often be easily deduced by reasoning about the possible microchoices the algorithm
could have made given different datasets.
</p><!--l. 1562--><p class="indent">   In many natural cases, a &#x201C;fortuitous distribution and target concept&#x201D;
corresponds to a shallow leaf or a part of the tree with low branching,
resulting in a better bound. For instance, in the decision list case, <!--l. 1564--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msubsup><mrow 
><mo 
class="MathClass-op">&#x2211;</mo>
  </mrow><mrow 
><mi 
>i</mi><mo 
class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow 
><mi 
>d</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></msubsup 
><mo 
> ln</mo><!--nolimits--><mo 
class="MathClass-rel">&#x2223;</mo><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
class="MathClass-rel">&#x2223;</mo></mrow></math> is roughly <!--l. 1565--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>d</mi><mo 
>ln</mo><!--nolimits--><mi 
>n</mi></mrow></math> where <!--l. 1565--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>d</mi></mrow></math>is the length of the list produced
and <!--l. 1566--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>n</mi></mrow></math> is the number of
features. Notice that <!--l. 1566--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>d</mi><mo 
>ln</mo><!--nolimits--><mi 
>n</mi></mrow></math>
is also the description length of the final hypothesis produced in the natural encoding,
thus in this case these theorems yield similar bounds to a simple application of Occam&#x2019;s
razor or SRM.
</p><!--l. 1571--><p class="indent">   More generally, the microchoice bound is similar to Occam&#x2019;s razor or SRM bounds when each <!--l. 1572--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
                                                                     

                                                                     
<mrow 
><mi 
>k</mi></mrow></math>-ary choice in the tree
corresponds to <!--l. 1572--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mo 
>log</mo><!--nolimits--><mi 
>k</mi></mrow></math>
bits in the natural encoding of the final hypothesis <!--l. 1573--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>h</mi></mrow></math>. However,
sometimes this may not be the case. Consider, for instance, a local optimization algorithm in which
there are <!--l. 1575--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>n</mi></mrow></math>
parameters and each step adds or subtracts 1 from one of the parameters. Suppose in
addition the algorithm knows certain constraints that these parameters must
satisfy (perhaps a set of linear inequalities) and the algorithm restricts itself
to choices in the legal region. In this case, the branching factor, at most <!--l. 1579--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mn>2</mn><mi 
>n</mi></mrow></math>,
might become much smaller if we are &#x201C;lucky&#x201D; and head toward a highly constrained
portion of the solution space. One could always reverse-engineer an encoding of
hypotheses based on the choice tree, but the microchoice approach is much more
natural.
</p><!--l. 1584--><p class="indent">   There is also an opportunity to use <span 
class="ecti-1000">a</span>&#x00A0;<span 
class="ecti-1000">priori </span>knowledge in the choice of <!--l. 1585--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>.
In particular, instead of splitting our confidence equally at each node of
the tree, we could split it unevenly, according to some heuristic function <!--l. 1587--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>g</mi></mrow></math>. If <!--l. 1587--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>g</mi></mrow></math>
is &#x201C;good&#x201D; it may produce error bounds similar to the bounds when <!--l. 1588--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo> <mi 
>q</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>. In
fact, the method of section ( <a 
href="thesisse23.xml#x33-460005.3">5.3<!--tex4ht:ref: sec:query --></a>) where we combine these results with Freund&#x2019;s
query-tree <span class="cite">[<a 
href="thesisli2.xml#XSB"><span 
class="ecbx-1000">17</span></a>]</span> approach can be thought of as an attempt to do exactly this.
</p>
   <h4 class="subsectionHead"><span class="titlemark">5.2.1. </span> <a 
  name="x32-410005.2.1"></a>Examples</h4>
<!--l. 1595--><p class="noindent">It is difficult to create a bound which is universally better than previous bounds. The
microchoice bound can be much better than the discrete hypothesis bound ( <a 
href="thesisse16.xml#x23-32001r1">4.2.1<!--tex4ht:ref: th-DHSCP --></a>) and
can be slightly worse. To develop some understanding of how they compare we consider
several cases.
</p>
   <h5 class="subsubsectionHead"><span class="titlemark">5.2.1.1. </span> <a 
  name="x32-420005.2.1.1"></a>Greedy Set Cover</h5>
<!--l. 1603--><p class="noindent">Consider a greedy set cover algorithm for learning an OR function over <!--l. 1603--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>F</mi></mrow></math>
Boolean features. The algorithm begins with a choice space of size <!--l. 1604--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>F</mi> <mo 
class="MathClass-bin">+</mo> <mn>1</mn></mrow></math>
(one per feature or halt) and chooses the feature that covers the most positive
examples while covering no negative ones. It then moves to a choice space of size <!--l. 1607--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
                                                                     

                                                                     
<mrow 
><mi 
>F</mi></mrow></math>
(one per feature remaining or halt) and chooses the best remaining
feature and so on until it halts. If the number of features chosen is <!--l. 1608--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>k</mi></mrow></math> then
the microchoice bound is:
</p><!--l. 1612--><p class="indent">   <!--l. 1612--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">        <mrow 
>
             <mi 
>&#x03B5;</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">=</mo>  <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>m</mi></mrow></mfrac> <mfenced separators="" 
open="("  close=")" ><mrow><mo 
>ln</mo><!--nolimits--> <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>&#x03B4;</mi></mrow></mfrac> <mo 
class="MathClass-bin">+</mo><msubsup><mrow 
> <mo 
class="MathClass-op">&#x2211;</mo>
  </mrow><mrow 
><mi 
>i</mi><mo 
class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow 
><mi 
>k</mi></mrow></msubsup 
><mo 
> ln</mo><!--nolimits--><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>F</mi> <mo 
class="MathClass-bin">&#x2212;</mo> <mi 
>i</mi> <mo 
class="MathClass-bin">+</mo> <mn>2</mn></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfenced> <mo 
class="MathClass-rel">&#x2264;</mo> <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>m</mi></mrow></mfrac> <mfenced separators="" 
open="("  close=")" ><mrow><mo 
>ln</mo><!--nolimits--> <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>&#x03B4;</mi></mrow></mfrac> <mo 
class="MathClass-bin">+</mo> <mi 
>k</mi><mo 
>ln</mo><!--nolimits--><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>F</mi> <mo 
class="MathClass-bin">+</mo> <mn>1</mn></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfenced>
</mrow></math>
</p><!--l. 1616--><p class="indent">   The bound of ( <a 
href="thesisse16.xml#x23-32001r1">4.2.1<!--tex4ht:ref: th-DHSCP --></a>) is:
</p><!--l. 1619--><p class="indent">   <!--l. 1619--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">        <mrow 
>
                                        <mi 
>&#x03B5;</mi> <mo 
class="MathClass-rel">=</mo>  <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>m</mi></mrow></mfrac> <mfenced separators="" 
open="("  close=")" ><mrow><mo 
>ln</mo><!--nolimits--> <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>&#x03B4;</mi></mrow></mfrac> <mo 
class="MathClass-bin">+</mo> <mi 
>F</mi><mo 
>ln</mo><!--nolimits--><mn>2</mn></mrow></mfenced><mo 
class="MathClass-punc">.</mo>
</mrow></math>
</p><!--l. 1623--><p class="indent">   If <!--l. 1623--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>k</mi></mrow></math>
is small, then the microchoice bound is a lot better, but if <!--l. 1623--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>k</mi> <mo 
class="MathClass-rel">=</mo> <mi 
>O</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>F</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> then the
microchoice bound is slightly worse than the discrete hypothesis bound. Notice that in this case
the microchoice bound is essentially the same as the standard Occam&#x2019;s razor analysis when one
uses <!--l. 1626--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>O</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mo 
>ln</mo><!--nolimits--><mi 
>F</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
bits per feature to describe the hypothesis.
</p>
                                                                     

                                                                     
   <h5 class="subsubsectionHead"><span class="titlemark">5.2.1.2. </span> <a 
  name="x32-430005.2.1.2"></a>Decision Trees</h5>
<!--l. 1632--><p class="noindent">Decision  trees  over  discrete  sets  (say,  <!--l. 1632--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msup><mrow 
><mrow><mo 
class="MathClass-open">{</mo><mrow><mn>0</mn><mo 
class="MathClass-punc">,</mo><mn>1</mn></mrow><mo 
class="MathClass-close">}</mo></mrow></mrow><mrow 
><mi 
>F</mi></mrow></msup 
></mrow></math>) are
another natural setting for application of the microchoice bound.
</p><!--l. 1635--><p class="indent">   A decision tree differs from a decision list in that the size of the
available choice set is larger due to the fact that there are multiple nodes
where a new test may be applied. In particular, for a decision tree with <!--l. 1637--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>K</mi></mrow></math> leaves at an average
depth of <!--l. 1638--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>d</mi></mrow></math>, the
choice set size is <!--l. 1638--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>K</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>F</mi> <mo 
class="MathClass-bin">&#x2212;</mo> <mi 
>d</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>,
giving a bound noticeably worse than the bound for the decision list. This motivates a slightly
different decision algorithm which considers only one leaf node at a time. The algorithm
adds a new test or decides to never add a new test at this node. In this case, there are <!--l. 1642--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>F</mi> <mo 
class="MathClass-bin">&#x2212;</mo> <mi 
>d</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>v</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-bin">+</mo> <mn>1</mn></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math> choices for a
node <!--l. 1642--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>v</mi></mrow></math> at
depth <!--l. 1643--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>d</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>v</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>,
implying the bound: </p><table class="equation"><tr><td> <a 
  name="x32-43001r1"></a>
<!--l. 1644--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">     
                     <!--mstyle 
class="text"--><mtext class="textrm">KL</mtext><!--/mstyle--><mrow><mo 
class="MathClass-open">(</mo><mrow><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mo 
class="MathClass-rel">&#x2223;</mo><mo 
class="MathClass-rel">&#x2223;</mo><mi 
>e</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-rel">&#x2264;</mo> <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>m</mi></mrow></mfrac> <mfenced separators="" 
open="("  close=")" ><mrow><mo 
>ln</mo><!--nolimits--> <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>&#x03B4;</mi></mrow></mfrac> <mo 
class="MathClass-bin">+</mo><msub><mrow 
> <mo 
class="MathClass-op">&#x2211;</mo>
   </mrow><mrow 
><mi 
>v</mi></mrow></msub 
><mo 
> ln</mo><!--nolimits--><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>F</mi> <mo 
class="MathClass-bin">&#x2212;</mo> <mi 
>d</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>v</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-bin">+</mo> <mn>1</mn></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></mfenced>
</math>
<!--l. 1647--><p class="nopar"></p></td><td class="eq-no">(5.2.1)</td></tr></table>
where <!--l. 1648--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>v</mi></mrow></math>
ranges over the nodes of the decision tree. Once again, this is very similar to what might
be produced by an Occam&#x2019;s Razor Bound with an appropriate choice of prior. This result
is again sometimes much better than the Discrete Hypothesis bound and sometimes
slightly worse.
   <h4 class="subsectionHead"><span class="titlemark">5.2.2. </span> <a 
  name="x32-440005.2.2"></a>Pruning </h4>
<!--l. 1656--><p class="noindent">Decision tree algorithms for real-world learning problems often have some form of
&#x201C;pruning&#x201D; as in <span class="cite">[<a 
href="thesisli2.xml#XQuinlan"><span 
class="ecbx-1000">44</span></a>]</span> and <span class="cite">[<a 
href="thesisli2.xml#XMingers"><span 
class="ecbx-1000">41</span></a>]</span>. The tree is first grown to full size producing a
hypothesis with minimum empirical error. Then the tree is &#x201C;pruned&#x201D; starting at the
leaves and progressing up through the tree toward the root node using some
                                                                     

                                                                     
test for the significance of an internal node. An internal node is not significant
if the reduction in total error is small in comparison to the complexity of its
children. Insignificant internal nodes are replaced with a leaf resulting in a smaller
tree.
</p><!--l. 1665--><p class="indent">   Microchoice bounds have the property that they incidentally prove a
bound for every decision tree which can be found by pruning internal
nodes. In particular, one of the choices available when constructing a
node is to make the node a leaf. Therefore, if we begin with the tree <!--l. 1668--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>T</mi></mrow></math>and then prune to the smaller
tree <!--l. 1669--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><msup><mrow 
><mi 
>T</mi></mrow><mrow 
><mi 
>&#x2032;</mi></mrow></msup 
></mrow></math>, we can apply the
bound ( <a 
href="#x32-43001r1">5.2.1<!--tex4ht:ref: eqn:dectreemc --></a>) to <!--l. 1669--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><msup><mrow 
><mi 
>T</mi></mrow><mrow 
><mi 
>&#x2032;</mi></mrow></msup 
></mrow></math> <span 
class="ecti-1000">as if </span>the
algorithm had constructed <!--l. 1670--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><msup><mrow 
><mi 
>T</mi></mrow><mrow 
><mi 
>&#x2032;</mi></mrow></msup 
></mrow></math>
directly rather than having gone first through the tree <!--l. 1671--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>T</mi></mrow></math>. This suggests
another possible pruning criterion: prune a node if the pruning would result in an improved
microchoice bound. That is, prune if the increase in empirical error is less than the decrease
in <!--l. 1673--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>&#x03B5;</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>.
This pruning criteria is a &#x201C;pessimistic criteria&#x201D; <span class="cite">[<a 
href="thesisli2.xml#XYishay"><span 
class="ecbx-1000">38</span></a>]</span>.
</p><!--l. 1676--><p class="indent">   The similarities to SRM are discussed next.
</p>
   <h4 class="subsectionHead"><span class="titlemark">5.2.3. </span> <a 
  name="x32-450005.2.3"></a>Microchoice and Structural Risk Minimization</h4>
<!--l. 1681--><p class="noindent">The microchoice bound is essentially a compelling application of the Disjoint SRM bound  <a 
href="thesisse19.xml#x26-35001r1">4.5.1<!--tex4ht:ref: th-SRM --></a>
where the description language for a hypothesis is the sequence of data-dependent choices
which the algorithm makes in the process of deciding upon the hypothesis. The hypothesis
set <!--l. 1684--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msub><mrow 
><mi 
>H</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math>
is all hypotheses with the same description length in this language.
</p><!--l. 1687--><p class="indent">   As an example, consider a binary decision tree with <!--l. 1687--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>F</mi></mrow></math>
Boolean features and a Boolean label. The first hypothesis set, <!--l. 1688--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>H</mi></mrow><mrow 
><mn>1</mn></mrow></msub 
></mrow></math> will consist
of <!--l. 1689--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mn>2</mn></mrow></math>
hypotheses; always false and always true. In general, we will have one hypothesis set for every
legal configuration of internal nodes. The size of a hypothesis set where every tree contains <!--l. 1691--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>k</mi></mrow></math> internal nodes will be <!--l. 1692--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msup><mrow 
><mn>2</mn></mrow><mrow 
><mi 
>k</mi><mo 
class="MathClass-bin">+</mo><mn>1</mn></mrow></msup 
></mrow></math> because there are <!--l. 1692--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>k</mi> <mo 
class="MathClass-bin">+</mo> <mn>1</mn></mrow></math> leaves each of which
can take <!--l. 1692--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mn>2</mn></mrow></math> values.
The weighting <!--l. 1693--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>p</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>i</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
across the different hypothesis sets is defined by the microchoice allocation of
confidence.
                                                                     

                                                                     
</p><!--l. 1696--><p class="indent">   The principle disadvantage of the microchoice bound is that the sequence
of data-dependent choices may contain redundancy. A different SRM bound
with a different set of disjoint hypothesis sets might be able to better avoid
redundancy. As an example, assume that we are working with a decision tree on <!--l. 1699--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>F</mi></mrow></math> binary features.
There are <!--l. 1700--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>F</mi> <mo 
class="MathClass-bin">+</mo> <mn>2</mn></mrow></math>
choices (any of <!--l. 1700--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>F</mi></mrow></math>
features or <!--l. 1700--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mn>2</mn></mrow></math>
labels) at the top node. At the next node down there will be <!--l. 1701--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>F</mi> <mo 
class="MathClass-bin">+</mo> <mn>1</mn></mrow></math> choices in
both the left and right children. Repeat until a maximal decision tree is constructed. There will
be <!--l. 1703--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msubsup><mrow 
><mo 
class="MathClass-op">&#x220F;</mo>
 </mrow><mrow 
><mi 
>i</mi><mo 
class="MathClass-rel">=</mo><mn>0</mn></mrow><mrow 
><mi 
>F</mi></mrow></msubsup 
><msup><mrow 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>F</mi> <mo 
class="MathClass-bin">&#x2212;</mo> <mi 
>i</mi> <mo 
class="MathClass-bin">+</mo> <mn>2</mn></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow><mrow 
><msup><mrow 
><mn>2</mn></mrow><mrow 
><mi 
>i</mi></mrow></msup 
>
   </mrow></msup 
></mrow></math>
possible trees. This number is somewhat larger than the number of Boolean functions on <!--l. 1704--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>F</mi></mrow></math> features: <!--l. 1705--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msup><mrow 
><mn>2</mn></mrow><mrow 
><msup><mrow 
><mn>2</mn></mrow><mrow 
><mi 
>F</mi></mrow></msup 
>
   </mrow></msup 
></mrow></math>.
</p><!--l. 1708--><p class="indent">
                                                                     

                                                                     
</p>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse23.xml" >next</a>] [<a 
href="thesisse21.xml" >prev</a>] [<a 
href="thesisse21.xml#tailthesisse21.xml" >prev-tail</a>] [<a 
href="thesisse22.xml" >front</a>] [<a 
href="thesisch5.xml#thesisse22.xml" >up</a>] </p></div><a 
  name="tailthesisse22.xml"></a>   
</body> 
</html> 
