The ID3 algorithm for constructing decision trees

Good decision trees

A decision tree is usually a straightforward tool to use.
However, inducing rules from sets of past decisions to build such decision trees is a difficult problem, especially when we want to reach a useful answer in as few questions as possible.

Example: 2 decision trees to answer "Does the animla lay eggs"

(a) Does animal have feathers? Yes: Lays eggs (raven, albatross, ostrich) No : Is animal warmblooded? Yes: Does not lay eggs (koala, dolphin) No: Lays eggs (crocodile) (b) Does animal have fur? Yes: Does not lay eggs (koala) No: Does animal have feathers? Yes: Lays eggs (raven, albatross, ostrich) No: Is animal warmblooded? Yes: Does not lay eggs (dolphin) No: Lays eggs (crocodile)

ID3 Algorithm
- the ID3 algorithm, originally developed by J. Ross Quinlan of the University of Sydney.
- This algorithm was first presented in 1975 (J.R. Quinlan, Machine Learning, vol 1. no. 1) and has proven to be so powerful that it has found its way into a number of commercial rule-induction packages.

Example

Table 1: Various animal attributes.

Animal Warm blooded Feathers Fur Swims Lays eggs Ostrich Yes Yes No No Yes Crocodile No No No Yes Yes Raven Yes Yes No No Yes Albatross Yes Yes No No Yes Dolphin Yes No No Yes No Koala Yes No Yes No No

Example: 2 decision trees to answer "Does the animla lay eggs"

Table 2: Neg entropies for each attribute from Table 1.

Attribute on_ctr Hits Off_ctr Hits Neg entropy Warmblooded 5 3 1 1 0.809125 Feathers 3 3 3 2 0.459148 Fur 1 0 5 4 0.601607 Swims 2 1 4 3 0.874185

Table 3: Negentropies of each attribute for animals that do not have feathers.

Attribute on_ctr Hits Off_ctr Hits Negentropy Warmblooded 2 0 1 0 0.0 Feathers 0 0 3 1 0.918296 Fur 1 0 2 1 0.666666 Swims 2 1 1 0 0.666666

ID3 algorithm

The ID3 algorithm searches through the attributes of a dataset for the one that conveys the most information about the desired target.
If the attribute classifies the target perfectly, then you stop.
Otherwise, the process is repeated recursively on the two subsets generated by setting this "best" attribute to their on/off states.
Information theory provides one way to measure the amount of information conveyed by a particular attribute.
The best attribute is the one with the lowest "negentropy," as defined in Figure 2.
The ID3 algorithm picks the attribute with the lowest total negentropy.
Information theory provides one way to measure the amount of information conveyed by a particular attribute.
The best attribute is the one with the lowest "negentropy," as defined in Figure 2.
The ID3 algorithm picks the attribute with the lowest total negentropy.
Figure 2: Definition of negentropy. p(ON) and p(OFF) are the measured probabilities
To illustrate, I'll calculate the negentropies generated by Table 1.
For example, there are five animals in the sample that are warm blooded, of which three lay eggs.
The negentropy of the "on" (true) attribute is:
There is one animal in the sample that is not warm blooded but does lay eggs.
The negentropy of the "off" (false) attribute is:
The combined negentropy is the weighted sum:
Table 2 shows the negentropies for all of the attributes:

Table 2: Neg entropies for each attribute from Table 1.

Attribute on_ctr Hits Off_ctr Hits Neg entropy Warmblooded 5 3 1 1 0.809125 Feathers 3 3 3 2 0.459148 Fur 1 0 5 4 0.601607 Swims 2 1 4 3 0.874185

None of the attributes has a negentropy of zero, so none classifies all cases correctly.

However, the "feathers" attribute has the lowest negentropy and hence conveys the most information.

This is what we use as the root attribute for our decision tree.

The process is now repeated recursively.
All animals that have feathers lay eggs, so the negentropy for this subset is zero and there are no further questions to ask.
When the animal does not have feathers, you compute the negentropies using only those animals to produce Table 3.

Table 3: Negentropies of each attribute for animals that do not have feathers.

Attribute on_ctr Hits Off_ctr Hits Negentropy Warmblooded 2 0 1 0 0.0 Feathers 0 0 3 1 0.918296 Fur 1 0 2 1 0.666666 Swims 2 1 1 0 0.666666

The attribute to use for the subtree is therefore "warm blooded," which has the lowest negentropy in this sample.

The zero value indicates that all samples are now classified correctly.

Multi-valued outcomes

The algorithm can also be used for real-valued, rather than binary-valued, attributes.

At the outset of processing, you scan all of the variables to select an attribute i and a value r so that the condition Attribute(i) >= r yields the lowest negentropy possible.

The "yes/no" question is now of the form "Is attribute greater than or equal to r?"

If the negentropy arising from this partition is zero, then the data was classified perfectly, and the calculation ends. Otherwise, the dataset is partitioned into two sets--one for which the first condition is true, and the other for false.

The process is repeated recursively over the two subsets until all negentropies are zero (perfect classification) or until no more splitting is possible and the negentropy is nonzero.

In this case, classification is not possible and you must solicit more data from the user. Such a situation could occur when all attributes in two records have the same values, but the outcomes differ. More data must then be supplied to the program to allow it to discriminate between the two cases.

Tree Pruning

Most real-world datasets do not have the convenient properties shown here. Instead, noisy data with measurement errors or incorrect classification for some examples can lead to very bushy trees in which the rule tree has many special cases to classify small numbers of uninteresting samples.

One way to address this problem is to use "rule-tree pruning."

Instead of stopping when the negentropy reaches zero, you stop when it reaches some sufficiently small value, indicating that you are near the end of a branch.

This pruning leaves a small number of examples incorrectly classified, but the overall structure of the decision tree will be preserved.

Finding the exact, nonzero cutoff value will be a matter of experimentation.

Implementing Decision Trees

In implementing the decision tree described here, I've represented it within the program (see Listing One, as a binary tree, constructed of NODE structures pointing to other NODEs or to NULLs for terminal nodes.

Rather than copying the data table for each partition, I pass the partially formed data tree to the routine that calculates negentropy, allowing the program to exclude records that are not relevant for that part of the tree. Negentropy of a partition is calculated in routine negentropy (see Listing Two), which is called for all attribute/threshold combinations by routine ID3 (Listing Three).

The ability to use real-valued as well as binary-valued attributes comes at a price. To ensure the correct value of r, we scan through all attribute values in the dataset--a process that can be quite computationally intensive for large datasets.

No claims are made for the efficiency of this implementation. For cases where many sample attribute values are the same, or where a mixture of real-valued and binary-valued attributes is to be considered, the user is probably better advised to sort the attributes into a list and to eliminate repeated values. I've also not considered the case where a question can have more than two outcomes.

http://www.drdobbs.com/database/algorithm-alley/184409907?queryText=ID3%2Bclassification

Example:

http://cs.nyu.edu/courses/spring04/G22.2560-001/id3-ex.txt

                    Example of ID3 algorithm 

This example shows the construction of a decision tree where P, Q,
and C are the predictive attributes and R is the classification attribute.

I've added line numbers for convenient reference: this is _not_ 
an attribute.

   Line number    P    Q    R    C   Number of instances
      1           Y    Y    1    Y         1
      2           Y    Y    2    N        10
      3           Y    Y    3    Y         3
      4           Y    N    1    Y         2
      5           Y    N    2    Y        11
      6           Y    N    3    Y         0
      7           N    Y    1    Y         2
      8           N    Y    2    N        20
      9           N    Y    3    Y         3
     10           N    N    1    Y         1
     11           N    N    2    Y        15
     12           N    N    3    Y         3

Total number of instances                 71

The above table is T0.

Create root node N1.
C1: ID3(T,R)
  C2: AVG_ENTROPY(P,R,T)
    C3: FREQUENCY(P,Y,T) = (1+10+3+2+11+0)/71 = 27/71 = 0.3803
    C4: SUBTABLE(P,Y,T) = lines 1,2,3,4,5,6. Call this T1. Size(T1)= 27.
    C5: ENTROPY(R,T1) 
      C6: FREQUENCY(R,1,T1) = (1+2)/27 = 0.1111
      C7: FREQUENCY(R,2,T1) = (10+11)/27 = 0.7778
      C8: FREQUENCY(R,3,T1) = (3+0)/27 = 0.1111
    C5: Return -(0.1111 log(0.1111) + 0.7777 log(0.7777) + 0.1111 log(0.1111))
               = 0.9864
    C8.1: FREQUENCY(P,Y,T) = (2+20+3+1+15+3)/71 = 44/71 = 0.6197
    C9: SUBTABLE(P,N,T) = lines 7,8,9,10,11,12. Call this T2. Size(T2) = 44.
   C10: ENTROPY(R,T2) 
      C11: FREQUENCY(R,1,T2) = (2+1)/44 = 0.0682
      C12: FREQUENCY(R,2,T2) = (20+15)/44 = 0.7955
      C13: FREQUENCY(R,3,T2) = (3+3)/44 = 0.1364
   C10: Return -(0.0682 log(0.0682) + 0.7955 log(0.7955) + 0.1364 log(0.1364))
               = 0.9188
  C2: Return (27/71) * 0.9864 + (44/71) * 0.9188 = 0.9445 

 C14: AVG_ENTROPY(Q,R,T)
   C15: FREQUENCY(Q,Y,T) = (1+10+3+2+20+3)/71 = 39/71 = 0.5493
   C16: SUBTABLE(Q,Y,T) = lines 1,2,3,7,8,9. Call this T3. Size(T3)= 39.
   C17: ENTROPY(R,T3) 
     C18: FREQUENCY(R,1,T3) = (1+2)/39 = 0.0769
     C19: FREQUENCY(R,2,T3) = (10+20)/39 = 0.7692
     C20: FREQUENCY(R,3,T3) = (3+3)/39 = 0.1538
   C17: Return -(0.0769 log(0.0769) + 0.7692 log(0.7692) + 0.1538 log(0.1538))
               = 0.9914
   C21: FREQUENCY(Q,N,T) = (2+11+0+1+15+3)/71 = 32/71 = 0.4507
   C21: SUBTABLE(Q,N,T) = lines 4,5,6,10,11,12. Call this T4. Size(T2) = 32.
   C22: ENTROPY(R,T4) 
      C23: FREQUENCY(R,1,T4) = (2+1)/32 = 0.0938
      C24: FREQUENCY(R,2,T4) = (11+15)/32 = 0.8125
      C25: FREQUENCY(R,3,T4) = (0+3)/32 = 0.0938
   C22: Return -(0.0938 log(0.0938) + 0.8125 log(0.8125) + 0.0938 log(0.0938))
               = 0.8838
  C14: Return (39/71) * 0.9914 + (32/71) * 0.8836 = 0.9394
  
From here on down, I'm abbreviating.

  C26: AVG_ENTROPY(C,R,T)
     C27: FREQUENCY(C,Y,T) = 41/71 = 0.5775
     C28: SUBTABLE(C,Y,T) = all lines but 2 and 8. Call this T5.
     C29: ENTROPY(R,T5) = (6/41) log(6/41) + (26/41) log (26/41) + 
                          (9/41) log(9/41) = 1.3028
     C30: FREQUENCY(C,N,T) = 30/71 = 0.4225
     C31: SUBTABLE(C,N,T) = lines 2 and 8. Call this T6.
     C32: ENTROPY(R,T6) = 0 log 0 + (30/30) log(30/30) + 0 log 0 = 0
  C26 returns (41/71) * 1.3028 = 0.7523.

ENTROPY(R,T) = -((6/71) log(6/71) + (56/71) log(56/71) + (9/71) log(9/71)) =
0.9492

Choose AS=C

Mark N1 as split on attribute C.


  C33: SUBTABLE(C,N,T) is T6 as before.
  C34: ID3(T6,R)
       Make a new node N2
       In all instances in T5 (lines 2 and 8), X.R = 2.  Therefore, this
          is base case 2.
       Label node N2 as "X.R=2"
  C34 returns N2 to C1

In C1: Make an arc labelled "N" from N1 to N2.

************  This is as far as you should take the solution to problem 1 ***

  C35: SUBTABLE(C,Y,T) is T5 as above.
  C36: ID3(T5,R)
    Create new node N3;
    C37: AVG_ENTROPY(P,R,T5)
       C38: SUBTABLE(P,Y,T5) is lines 1,3,4,5,6. Call this T7.
       C39: ENTROPY(R,T7) = -((3/17) log(3/17) + (11/17)log(11/17) +
                              (3/17) log(3/17)) = 1.2898
       C40: SUBTABLE(P,N,T5) is lines 7,9,10,11,12. Call this T8.
       C41: ENTROPY(R,T8) = (3/24) log(3/24) + (15/24) log(15/24) + 
                            (6/24) log(6/24) = 1.2988
   C37: AVG_ENTROPY(P,R,T5) = (17/41) 1.2898 + (24/41) 1.2988 = 1.2951
   C42: AVG_ENTROPY(Q,R,T5)
     C43: SUBTABLE(Q,Y,T5) is lines 1,3,7,9. Call this T9.
     C44: ENTROPY(R,T9) = (3/9) log(3/9) + 0 log 0 + (6/9) log (6/9) = 0.9184
     C45: SUBTABLE(Q,N,T5) is lines 4,5,6,10,11,12. This is table T4, above.
               (except that the C column has been deleted)
     C46: ENTROPY(R,T4) = 0.8836 (see C22 above)
   C42: AVG_ENTROPY(Q,R,T5) = (9/41) 0.9184 + (32/41) 0.8836 = 0.8912
  So we choose AS = Q.
   C47: ENTROPY(R,T5) was calculated in C29 above to be 1.3029 

   Mark N3 as split on attribute Q.

   C48: SUBTABLE(T5,Q,N) is T4 above: Lines 4,5,6,10,11,12 
                                            (minus columns C and Q)
   C49: ID3(T4,R)
      Create new node N4
      C50: AVG_ENTROPY(P,R,T4) 
         C51: SUBTABLE(P,Y,T4) = lines 4,5,6. Call this T10.
         C52: ENTROPY(R,T10) = (2/13) log(2/13) + (11/13) log(11/13) + 0 log 0 =
                 0.6194
         C53: SUBTABLE(P,N,T4) = lines 10,11,12. Call this T11.
         C54: ENTROPY(R,T11) = (1/19) log(1/19) + (15/19) log(15/19) + 
                               (3/19) log(3/19) = 0.8264
       C50: AVG_ENTROPY(P,R,T4) = (13/32) * 0.6194 + (19/32) * 0.8264 = 0.7423
       Choose AS = P (no other choices)
       C55: ENTROPY(R,T4) was calculated in C22 to be 0.8836.

    C49 continuing: Mark node N4 as split on P.
       C56: SUBTABLE(T4,P,N) is T11 above (lines 10,11,12)
       C57: ID3(T11,R)
              Make new node N5
              No predictive attributes remain
              Label N5: "Prob(X.R=1) = 1/19.
                         Prob(X.R=2) = 15/19
                         Prob(X.R=3) = 3/19"
              Return N5 to C49

    C49 continuing: Make an arc labelled "N" from N4 to N5.
       C58: SUBTABLE(T4,P,Y) is T10 above (lines 4,5,6)
       C59: ID3(T4,R)
            Make new node N6
            No predictive attributes remain in T10
            Label N6: "Prob(X.R=1) = 2/13.
                       Prob(X.R=2) = 11/13
                       Prob(X.R=3) = 0"
            Return N6 to C49
    C49 continuing: Make an arc labelled "Y" from N4 to N5
    C49 returns N4 to C36.
  C36 continuing: Make an arc labelled "N" from N3 to N4.
       
   C60: SUBTABLE(T5,Q,Y) is T9 above (lines 1,3,7,9)
   C61: ID3(T9,R)
      Make a new node N7
      C62: AVG_ENTROPY(P,R,T9)
        C63: SUBTABLE(P,Y,T9) is lines 1 and 3. Call this T12.
        C64: ENTROPY(R,T12) = -((1/4) log(1/4) + (3/4) log(3/4)) = 0.8113
        C65: SUBTABLE(P,N,T9) is lines 7 and 9. Call this T11
        C67: ENTROPY(R,T11) = -((2/5) log(2/5) + (3/5) log(3/5)) = 0.9710
      C68: AVG_ENTROPY(P,R,T9) = (4/9) * 0.8113 * (5/9) * 0.9710 = 0.9000

      AS is P
      C69: ENTROPY(R,T9) is calculated in C44 as 0.9184

      The result in C50 is not a substantial improvement over C51, particularly
      considering the size of the table T9.

      N7 is a leaf, labelled "Prob(X.R=1) = 3/9. Prob(X.R=3) = 6/9"
   C61 returns N7 to C36.
  C36 continuing: Make an arc labelled Y from N3 to N7
  C36 returns N3 to C1
C1 continuing: Make an arc labelled Y from N1 to N3
C1 returns N1.


Final tree:
                         ______________
                         |     N1     |
                         | Split on C |
                         --------------
                                |
            ________N___________|________Y__________
            |                                      |
      ______|_______                        _______|_______
      |    N2      |                        |     N3      |
      |Prob(R=2)=1.|                        | Split on Q  |
      --------------                        ---------------
                                                   |
                                      _____N_______|______Y______
                                      |                         |
                               _______|________           ______|_________
                               |     N4       |           |    N7        |
                               |  Split on P  |           |Prob(R=1)=3/9 |
                               ----------------           |Prob(R=3)=6/9 |
                                      |                   ----------------
                      _______N________|_______Y_______
                      |                              |     
               _______|_________             ________|________
               |     N5        |             |      N6       |
               |Prob(R=1)=1/19 |             |Prob(R=1)=2/13 |
               |Prob(R=2)=15/19|             |Prob(R=2)=11/13|
               |Prob(R=3)=3/19 |             -----------------
               -----------------

Note that, in a deterministic tree, there would be no point in the
split at N4, since both N5 and N6 predict R=2.  This split would
be eliminated in post-processing.