<?xml version="1.0"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "mathml.dtd"> 
<?xml-stylesheet type="text/css" href="thesis.css"?> 
<html  
xmlns="http://www.w3.org/1999/xhtml"  
><head><title>12.1 The Decision Tree Learning Algorithm</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> 
<meta name="generator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<meta name="originator" content="TeX4ht (http://www.cis.ohio-state.edu/~gurari/TeX4ht/mn.html)" /> 
<!-- 3,early_,early^,xhtml,mozilla --> 
<meta name="src" content="thesis.tex" /> 
<meta name="date" content="2002-08-28 13:56:00" /> 
<link rel="stylesheet" type="text/css" href="thesis.css" /> 
</head><body 
>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse54.xml" >next</a>] [<a 
href="#tailthesisse53.xml">tail</a>] [<a 
href="thesisch12.xml#thesisse53.xml" >up</a>] </p></div>
   <h3 class="sectionHead"><span class="titlemark">12.1. </span> <a 
  name="x74-10000012.1"></a>The Decision Tree Learning Algorithm</h3>
                                                                     

                                                                     
   <hr class="figure" /><div align="center" class="figure" 
><table class="figure"><tr class="figure"><td class="figure" 
>
                                                                     

                                                                     
<a 
  name="x74-1000011"></a>
<!--l. 4466--><p class="indent">
                                                                     

                                                                     
</p><!--l. 4466--><p class="noindent"><img 
src="thesis21x.gif" alt="PIC" class="graphics" width="252.945pt" height="361.34999pt"  /><!--tex4ht:graphics  
name="thesis21x.gif" src="count.ps"  
-->
<br /> </p><div align="center" class="caption"><table class="caption" 
><tr valign="baseline" class="caption"><td class="id">Figure&#x00A0;12.1.1: </td><td  
class="content"><a 
  name="x74-1000011"></a> A plot of the sizes over various learning problems in terms of the
number of examples. The learning problems used range over 3 orders of magnitude
in size.</td></tr></table></div><!--tex4ht:label?: x74-1000011 -->
                                                                     

                                                                     
   </td></tr></table></div><hr class="endfigure" />
<!--l. 4475--><p class="indent">   The bounds are applied to the results of decision trees learned on UCI database
problems (see figure  <a 
href="#x74-1000011">12.1.1<!--tex4ht:ref: fig-problem-size --></a>). The decision tree algorithm is a variant of ID3
which works in two phases: a tree of minimum empirical error is discovered and
then pruned with a criteria which arises naturally from the Microchoice bound.
The first phase works in a recursive fashion by maximizing the information
gain across all splitting criteria. It is assumed that each example consists of <!--l. 4481--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>n</mi></mrow></math>
features each of which takes a small number of discrete values.
The splitting criteria at each internal node are of the form &#x201C;feature <!--l. 4483--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>f</mi></mrow></math> has value
<!--l. 4483--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">     <mrow 
><mi 
>v</mi></mrow></math>&#x201D;
which implies that a binary tree is produced. Leaves of the tree are labeled with the
label most common amongst the examples in the training set which reach the
leaf. The pruning pass prunes according to the criterion: minimize the upper
bound on true error implied by the Microchoice bound. Pruning starts with
the internal nodes closest to the leaves and proceeds up the tree toward the
root node. This pruning criteria falls in the category of pessimistic criteria
<span class="cite">[<a 
href="thesisli2.xml#XYishay"><span 
class="ecbx-1000">38</span></a>]</span>. The implementation has been somewhat computationally optimized by
careful caching of information. The examples are tokenized into integers for fast
comparison and analysis of splitting criteria at each node only loops over the
examples once. Sub-calculations of the Microchoice bound are cached for use in
pruning.
</p>
   <h4 class="subsectionHead"><span class="titlemark">12.1.1. </span> <a 
  name="x74-10100012.1.1"></a>Pruning</h4>
<!--l. 4498--><p class="noindent">The proof of the Microchoice bound works by assigning a uniform weight to each choice
in a choice space. Since the choice space of every node includes the choice of
making a leaf, the Microchoice bound incidentally also proves a tighter bound for
every pruning of the output tree. We choose to prune when pruning reduces the
upper bound on the generalization error. We prune starting with internal nodes
nearest to the leaves and working toward the root node. As will be shown by the
experiments, other bounds are sometimes tighter than the Microchoice bound.
This implies that the Microchoice bound is not always the optimal pruning
criteria. However, the Microchoice bound is fast to calculate so it is used here.
In practice, it may be desirable to prune according to the criteria &#x201C;minimize <!--l. 4508--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>&#x03B1;</mi><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>S</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow> <mo 
class="MathClass-bin">+</mo> <mrow><mo 
class="MathClass-open">(</mo><mrow><mn>1</mn> <mo 
class="MathClass-bin">&#x2212;</mo> <mi 
>&#x03B1;</mi></mrow><mo 
class="MathClass-close">)</mo></mrow><mi 
>b</mi><mi 
>e</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>&#x201D; where <!--l. 4508--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0304;</mo></mover><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>
is some high probability upper bound on the true error <!--l. 4509--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>e</mi></mrow><mrow 
><mi 
>D</mi></mrow></msub 
><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>h</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>. We use <!--l. 4510--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>&#x03B1;</mi> <mo 
class="MathClass-rel">=</mo> <mn>0</mn></mrow></math> with
the bound produced by the Microchoice bound.
                                                                     

                                                                     
</p>
   <h4 class="subsectionHead"><span class="titlemark">12.1.2. </span> <a 
  name="x74-10200012.1.2"></a>Uniform Sampling from Decision Trees</h4>
<!--l. 4515--><p class="noindent">To use the Sampling Shell bound with Structural Risk Minimization, we need to sample uniformly
from the set of all decision trees of a particular size. The number of binary structures of size <!--l. 4517--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>i</mi></mrow></math> is the <!--l. 4517--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>i</mi></mrow></math>th Catalan number, <!--l. 4518--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
> <mo 
class="MathClass-rel">=</mo>   <mfrac><mrow 
><mn>1</mn></mrow> 
<mrow 
><mi 
>i</mi><mo 
class="MathClass-bin">+</mo><mn>1</mn></mrow></mfrac><mfenced separators="" 
open="(" close=")"><mfrac linethickness="0.0pt"><mrow> <mn>2</mn><mi 
>i</mi></mrow> 
 <mrow><mi 
>i</mi></mrow></mfrac></mfenced> </mrow></math> . This implies there are <!--l. 4518--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math> different tree
structures with <!--l. 4519--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>i</mi></mrow></math>
internal nodes. Sampling uniformly from the set of all tree structures with <!--l. 4520--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>i</mi></mrow></math> internal
nodes can then be done with a recursive process. For a particular node in a tree, assume that <!--l. 4521--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>j</mi></mrow></math> internal nodes need to
be created. If <!--l. 4522--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>j</mi> <mo 
class="MathClass-rel">=</mo> <mn>0</mn></mrow></math>, we
have a leaf. For <!--l. 4522--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>j</mi> <mo 
class="MathClass-rel">&#x2265;</mo> <mn>1</mn></mrow></math>,
we can construct a distribution over the number of nodes that are left branches of the tree by
calculating <!--l. 4524--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mn>0</mn></mrow></msub 
><mo 
class="MathClass-punc">,</mo><mo 
class="MathClass-op">&#x2026;</mo><mo 
class="MathClass-punc">,</mo><msub><mrow 
><mi 
>C</mi></mrow><mrow 
><mi 
>j</mi><mo 
class="MathClass-bin">&#x2212;</mo><mn>1</mn></mrow></msub 
></mrow></math>
and normalizing. Draw from this distribution to get <!--l. 4525--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>j</mi></mrow><mrow 
><mi 
>l</mi></mrow></msub 
></mrow></math>:
the number of internal nodes in the left sub-tree. Let <!--l. 4526--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>j</mi></mrow><mrow 
><mi 
>r</mi></mrow></msub 
> <mo 
class="MathClass-rel">=</mo> <mi 
>j</mi> <mo 
class="MathClass-bin">&#x2212;</mo> <msub><mrow 
><mi 
>j</mi></mrow><mrow 
><mi 
>l</mi></mrow></msub 
> <mo 
class="MathClass-bin">&#x2212;</mo> <mn>1</mn></mrow></math> be the
number of internal nodes in the right sub-tree and recurse. This construction is not yet a
decision tree because we have not placed tests in each node. We make another uniform
pick from the set of available tests at a particular node. We allow only tests on features
with at least two different feature values remaining after tests by parent nodes. This
approach is not quite uniform in the set of decision trees because when there are <!--l. 4532--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>i</mi></mrow></math> internal nodes
and <!--l. 4532--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>F</mi> <mo 
class="MathClass-rel">&#x003C;</mo> <mi 
>i</mi></mrow></math>
Boolean features, some trees have a depth greater than <!--l. 4533--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>F</mi></mrow></math> which
is not possible in a binary decision tree. In practice this does not make a difference
because the number of decision trees with a too-large depth is an exponentially
small fraction of the total number of trees. When running the algorithm we did
not ever encounter a sampled binary tree structure with depth greater than <!--l. 4537--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>F</mi></mrow></math>.
Leaves have a label picked uniformly from the set of labels. A particular label implies
some empirical error count amongst all examples which reach the leaf. By adding the leaf
error rates together we find the empirical error of a random hypothesis from the
hypothesis set.
                                                                     

                                                                     
</p>
   <h4 class="subsectionHead"><span class="titlemark">12.1.3. </span> <a 
  name="x74-10300012.1.3"></a>Fast Sampling</h4>
<!--l. 4548--><p class="noindent">In order for the Sampling Shell bound to be a significant improvement we need the number of
samples <!--l. 4549--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>l</mi></mrow></math>
to satisfy <!--l. 4549--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>l</mi> <mo 
class="MathClass-rel">=</mo> <mi 
>O</mi><mrow><mo 
class="MathClass-open">(</mo><mrow><mi 
>m</mi></mrow><mo 
class="MathClass-close">)</mo></mrow></mrow></math>.
This is not tractable using the above sampling technique so we &#x201C;cheated&#x201D;. At some critical sub-tree
size <!--l. 4551--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><mi 
>s</mi></mrow></math>
we switch from sampling to exact enumeration. (The exact value of <!--l. 4552--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>s</mi></mrow></math> is discussed
later.) Exact enumeration produces a multiset with elements corresponding to an error rate <!--l. 4553--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math> in the sub-tree
and a count <!--l. 4554--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">        <mrow 
><msub><mrow 
><mi 
>c</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
></mrow></math>
associated with each element. We recurse up the tree with the entire error multiset
rather than just one error. At an internal node we have a multiset of errors from
the left child and from the right child. The multiset of errors at an internal
node is the cross product of all errors in the left child and right child because
the choice of left sub-tree is independent of the choice of right sub-tree. Let <!--l. 4559--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><msub><mrow 
><mi 
>S</mi></mrow><mrow 
><mi 
>l</mi></mrow></msub 
></mrow></math> be the multiset of errors
in the left child and <!--l. 4560--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><msub><mrow 
><mi 
>S</mi></mrow><mrow 
><mi 
>r</mi></mrow></msub 
></mrow></math>
be the multiset of errors in the right child. The multiset of errors for the internal node is then
given by: <!--l. 4562--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="display">        <mrow 
>
                             <mfenced separators="" 
open="{"  close="}" ><mrow><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>i</mi></mrow></msub 
> <mo 
class="MathClass-bin">+</mo><msub><mrow 
> <mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>j</mi></mrow></msub 
><mo 
class="MathClass-punc">,</mo><msub><mrow 
><mi 
>c</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
> <mo 
class="MathClass-bin">&#x2217;</mo> <msub><mrow 
><mi 
>c</mi></mrow><mrow 
><mi 
>j</mi></mrow></msub 
> <mo 
class="MathClass-punc">:</mo><msub><mrow 
> <mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>i</mi></mrow></msub 
><mo 
class="MathClass-punc">,</mo><msub><mrow 
><mi 
>c</mi></mrow><mrow 
><mi 
>i</mi></mrow></msub 
> <mo 
class="MathClass-rel">&#x2208;</mo> <msub><mrow 
><mi 
>S</mi></mrow><mrow 
><mi 
>l</mi></mrow></msub 
><mo 
class="MathClass-punc">,</mo><msub><mrow 
><mover 
accent="true"><mrow 
><mi 
>e</mi></mrow><mo 
class="MathClass-op">&#x0302;</mo></mover></mrow><mrow 
><mi 
>j</mi></mrow></msub 
><mo 
class="MathClass-punc">,</mo><msub><mrow 
><mi 
>c</mi></mrow><mrow 
><mi 
>j</mi></mrow></msub 
> <mo 
class="MathClass-rel">&#x2208;</mo> <msub><mrow 
><mi 
>S</mi></mrow><mrow 
><mi 
>r</mi></mrow></msub 
></mrow></mfenced>
</mrow></math>
The multiset produced is passed recursively to the parent and each element of the set of
errors at the root node is considered an &#x201C;independent&#x201D; sample for the purposes of
applying the Sampling Shell bound.
</p><!--l. 4568--><p class="indent">   The power of this techniques comes from the fact that the set of hypotheses sampled
is exponential in the number of leaves. However, we never need to represent
the exponentially many different hypotheses explicitly because there are only <!--l. 4571--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>m</mi> <mo 
class="MathClass-bin">+</mo> <mn>1</mn></mrow></math>
possible empirical errors. Using multisets, we achieve sampling time similar to the simple
                                                                     

                                                                     
uniform sampling approach of the previous section but with exponentially more
(dependent) samples. The drawback to this approach is that the samples are no
longer independent so it is not &#x201C;fair&#x201D; to use them as independent samples in a
theoretical sense. We pretend each fast-sample is independent anyway and apply the
Sampling Shell bound. It is worth noting that the samples returned by this
technique are not biased and sometimes have lower variance than independent
samples.
</p><!--l. 4580--><p class="indent">   The choice of the critical subtree size <!--l. 4580--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>s</mi></mrow></math> is done in an anytime
fashion. The time <!--l. 4581--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">       <mrow 
><mi 
>t</mi></mrow></math>
required for learning the decision tree is first calculated. Then, starting with a size of <!--l. 4582--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>s</mi> <mo 
class="MathClass-rel">=</mo> <mn>0</mn></mrow></math>, we
sample from the decision tree, increasing the size by one after each sample. When more than
<!--l. 4583--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">     <mrow 
><mi 
>t</mi><mo 
class="MathClass-bin">/</mo><mn>1</mn><mn>0</mn></mrow></math>
time has been spent on sampling, we cease incrementing <!--l. 4584--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>s</mi></mrow></math>. If <!--l. 4584--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>s</mi></mrow></math> is the
size of the tree, then we stop and apply the exact Shell bound. Otherwise, we decrement <!--l. 4586--><math 
xmlns="http://www.w3.org/1998/Math/MathML" 
mode="inline">
<mrow 
><mi 
>s</mi></mrow></math> and
sample repeatedly until as much time has been spent on sampling as was spent on
learning the decision tree. The union of all the multisets is returned and the Sampling
Shell bound is used.
</p><!--l. 4591--><p class="indent">
                                                                     

                                                                     
</p>
   <div class="crosslinks"><p class="noindent">[<a 
href="thesisse54.xml" >next</a>] [<a 
href="thesisse53.xml" >front</a>] [<a 
href="thesisch12.xml#thesisse53.xml" >up</a>] </p></div><a 
  name="tailthesisse53.xml"></a>   
</body> 
</html> 
