(a) Does animal have feathers?
Yes: Lays eggs (raven, albatross, ostrich)
No : Is animal warmblooded?
Yes: Does not lay eggs (koala, dolphin)
No: Lays eggs (crocodile)
(b) Does animal have fur?
Yes: Does not lay eggs (koala)
No: Does animal have feathers?
Yes: Lays eggs (raven, albatross, ostrich)
No: Is animal warmblooded?
Yes: Does not lay eggs (dolphin)
No: Lays eggs (crocodile)
|
Animal Warm blooded Feathers Fur Swims Lays eggs Ostrich Yes Yes No No Yes Crocodile No No No Yes Yes Raven Yes Yes No No Yes Albatross Yes Yes No No Yes Dolphin Yes No No Yes No Koala Yes No Yes No No |
(a) Does animal have feathers?
Yes: Lays eggs (raven, albatross, ostrich)
No : Is animal warmblooded?
Yes: Does not lay eggs (koala, dolphin)
No: Lays eggs (crocodile)
(b) Does animal have fur?
Yes: Does not lay eggs (koala)
No: Does animal have feathers?
Yes: Lays eggs (raven, albatross, ostrich)
No: Is animal warmblooded?
Yes: Does not lay eggs (dolphin)
No: Lays eggs (crocodile)
|
Attribute on_ctr Hits Off_ctr Hits Neg entropy Warmblooded 5 3 1 1 0.809125 Feathers 3 3 3 2 0.459148 Fur 1 0 5 4 0.601607 Swims 2 1 4 3 0.874185 |
Attribute on_ctr Hits Off_ctr Hits Negentropy Warmblooded 2 0 1 0 0.0 Feathers 0 0 3 1 0.918296 Fur 1 0 2 1 0.666666 Swims 2 1 1 0 0.666666 |
The best attribute is the one with the lowest "negentropy," as defined in Figure 2.
The ID3 algorithm picks the attribute with the lowest total negentropy.
The best attribute is the one with the lowest "negentropy," as defined in Figure 2.
The ID3 algorithm picks the attribute with the lowest total negentropy.
Figure 2: Definition of negentropy. p(ON) and p(OFF) are the measured probabilities
If p(ON) and p(OFF) are both nonzero: -p(ON)*log2 p(ON) - p(OFF)*log2 p(OFF) Otherwise: 0 |
For example, there are five animals in the sample that are warm blooded, of which three lay eggs.
The negentropy of the "on" (true) attribute is:
NE_on = -(3/5) log2 (3/5) - (2/5) log2 (2/5) = 0.970951. |
There is one animal in the sample that is not warm blooded but does lay eggs.
The negentropy of the "off" (false) attribute is:
NE_off = -(1/1) log2(1) - 0 = 0. |
The combined negentropy is the weighted sum:
5 1
( --- * NE_on + --- * NE_off ) = 0.809125.
5+1 5+1
|
Table 2 shows the negentropies for all of the attributes:
Attribute on_ctr Hits Off_ctr Hits Neg entropy Warmblooded 5 3 1 1 0.809125 Feathers 3 3 3 2 0.459148 Fur 1 0 5 4 0.601607 Swims 2 1 4 3 0.874185 |
None of the attributes has a negentropy of zero, so none classifies all cases correctly.
However, the "feathers" attribute has the lowest negentropy and hence conveys the most information.
This is what we use as the root attribute for our decision tree.
All animals that have feathers lay eggs, so the negentropy for this subset is zero and there are no further questions to ask.
When the animal does not have feathers, you compute the negentropies using only those animals to produce Table 3.
Attribute on_ctr Hits Off_ctr Hits Negentropy Warmblooded 2 0 1 0 0.0 Feathers 0 0 3 1 0.918296 Fur 1 0 2 1 0.666666 Swims 2 1 1 0 0.666666 |
The attribute to use for the subtree is therefore "warm blooded," which has the lowest negentropy in this sample.
The zero value indicates that all samples are now classified correctly.
At the outset of processing, you scan all of the variables to select an attribute i and a value r so that the condition Attribute(i) >= r yields the lowest negentropy possible.
The "yes/no" question is now of the form "Is attribute greater than or equal to r?"
If the negentropy arising from this partition is zero, then the data was classified perfectly, and the calculation ends. Otherwise, the dataset is partitioned into two sets--one for which the first condition is true, and the other for false.
The process is repeated recursively over the two subsets until all negentropies are zero (perfect classification) or until no more splitting is possible and the negentropy is nonzero.
In this case, classification is not possible and you must solicit more data from the user. Such a situation could occur when all attributes in two records have the same values, but the outcomes differ. More data must then be supplied to the program to allow it to discriminate between the two cases.
Tree Pruning
Most real-world datasets do not have the convenient properties shown here. Instead, noisy data with measurement errors or incorrect classification for some examples can lead to very bushy trees in which the rule tree has many special cases to classify small numbers of uninteresting samples.
One way to address this problem is to use "rule-tree pruning."
Instead of stopping when the negentropy reaches zero, you stop when it reaches some sufficiently small value, indicating that you are near the end of a branch.
This pruning leaves a small number of examples incorrectly classified, but the overall structure of the decision tree will be preserved.
Finding the exact, nonzero cutoff value will be a matter of experimentation.
In implementing the decision tree described here, I've represented it within the program (see Listing One, as a binary tree, constructed of NODE structures pointing to other NODEs or to NULLs for terminal nodes.
Rather than copying the data table for each partition, I pass the partially formed data tree to the routine that calculates negentropy, allowing the program to exclude records that are not relevant for that part of the tree. Negentropy of a partition is calculated in routine negentropy (see Listing Two), which is called for all attribute/threshold combinations by routine ID3 (Listing Three).
The ability to use real-valued as well as binary-valued attributes comes at a price. To ensure the correct value of r, we scan through all attribute values in the dataset--a process that can be quite computationally intensive for large datasets.
No claims are made for the efficiency of this implementation. For cases where many sample attribute values are the same, or where a mixture of real-valued and binary-valued attributes is to be considered, the user is probably better advised to sort the attributes into a list and to eliminate repeated values. I've also not considered the case where a question can have more than two outcomes.
http://www.drdobbs.com/database/algorithm-alley/184409907?queryText=ID3%2Bclassification
Example:
http://cs.nyu.edu/courses/spring04/G22.2560-001/id3-ex.txt
Example of ID3 algorithm
This example shows the construction of a decision tree where P, Q,
and C are the predictive attributes and R is the classification attribute.
I've added line numbers for convenient reference: this is _not_
an attribute.
Line number P Q R C Number of instances
1 Y Y 1 Y 1
2 Y Y 2 N 10
3 Y Y 3 Y 3
4 Y N 1 Y 2
5 Y N 2 Y 11
6 Y N 3 Y 0
7 N Y 1 Y 2
8 N Y 2 N 20
9 N Y 3 Y 3
10 N N 1 Y 1
11 N N 2 Y 15
12 N N 3 Y 3
Total number of instances 71
The above table is T0.
Create root node N1.
C1: ID3(T,R)
C2: AVG_ENTROPY(P,R,T)
C3: FREQUENCY(P,Y,T) = (1+10+3+2+11+0)/71 = 27/71 = 0.3803
C4: SUBTABLE(P,Y,T) = lines 1,2,3,4,5,6. Call this T1. Size(T1)= 27.
C5: ENTROPY(R,T1)
C6: FREQUENCY(R,1,T1) = (1+2)/27 = 0.1111
C7: FREQUENCY(R,2,T1) = (10+11)/27 = 0.7778
C8: FREQUENCY(R,3,T1) = (3+0)/27 = 0.1111
C5: Return -(0.1111 log(0.1111) + 0.7777 log(0.7777) + 0.1111 log(0.1111))
= 0.9864
C8.1: FREQUENCY(P,Y,T) = (2+20+3+1+15+3)/71 = 44/71 = 0.6197
C9: SUBTABLE(P,N,T) = lines 7,8,9,10,11,12. Call this T2. Size(T2) = 44.
C10: ENTROPY(R,T2)
C11: FREQUENCY(R,1,T2) = (2+1)/44 = 0.0682
C12: FREQUENCY(R,2,T2) = (20+15)/44 = 0.7955
C13: FREQUENCY(R,3,T2) = (3+3)/44 = 0.1364
C10: Return -(0.0682 log(0.0682) + 0.7955 log(0.7955) + 0.1364 log(0.1364))
= 0.9188
C2: Return (27/71) * 0.9864 + (44/71) * 0.9188 = 0.9445
C14: AVG_ENTROPY(Q,R,T)
C15: FREQUENCY(Q,Y,T) = (1+10+3+2+20+3)/71 = 39/71 = 0.5493
C16: SUBTABLE(Q,Y,T) = lines 1,2,3,7,8,9. Call this T3. Size(T3)= 39.
C17: ENTROPY(R,T3)
C18: FREQUENCY(R,1,T3) = (1+2)/39 = 0.0769
C19: FREQUENCY(R,2,T3) = (10+20)/39 = 0.7692
C20: FREQUENCY(R,3,T3) = (3+3)/39 = 0.1538
C17: Return -(0.0769 log(0.0769) + 0.7692 log(0.7692) + 0.1538 log(0.1538))
= 0.9914
C21: FREQUENCY(Q,N,T) = (2+11+0+1+15+3)/71 = 32/71 = 0.4507
C21: SUBTABLE(Q,N,T) = lines 4,5,6,10,11,12. Call this T4. Size(T2) = 32.
C22: ENTROPY(R,T4)
C23: FREQUENCY(R,1,T4) = (2+1)/32 = 0.0938
C24: FREQUENCY(R,2,T4) = (11+15)/32 = 0.8125
C25: FREQUENCY(R,3,T4) = (0+3)/32 = 0.0938
C22: Return -(0.0938 log(0.0938) + 0.8125 log(0.8125) + 0.0938 log(0.0938))
= 0.8838
C14: Return (39/71) * 0.9914 + (32/71) * 0.8836 = 0.9394
From here on down, I'm abbreviating.
C26: AVG_ENTROPY(C,R,T)
C27: FREQUENCY(C,Y,T) = 41/71 = 0.5775
C28: SUBTABLE(C,Y,T) = all lines but 2 and 8. Call this T5.
C29: ENTROPY(R,T5) = (6/41) log(6/41) + (26/41) log (26/41) +
(9/41) log(9/41) = 1.3028
C30: FREQUENCY(C,N,T) = 30/71 = 0.4225
C31: SUBTABLE(C,N,T) = lines 2 and 8. Call this T6.
C32: ENTROPY(R,T6) = 0 log 0 + (30/30) log(30/30) + 0 log 0 = 0
C26 returns (41/71) * 1.3028 = 0.7523.
ENTROPY(R,T) = -((6/71) log(6/71) + (56/71) log(56/71) + (9/71) log(9/71)) =
0.9492
Choose AS=C
Mark N1 as split on attribute C.
C33: SUBTABLE(C,N,T) is T6 as before.
C34: ID3(T6,R)
Make a new node N2
In all instances in T5 (lines 2 and 8), X.R = 2. Therefore, this
is base case 2.
Label node N2 as "X.R=2"
C34 returns N2 to C1
In C1: Make an arc labelled "N" from N1 to N2.
************ This is as far as you should take the solution to problem 1 ***
C35: SUBTABLE(C,Y,T) is T5 as above.
C36: ID3(T5,R)
Create new node N3;
C37: AVG_ENTROPY(P,R,T5)
C38: SUBTABLE(P,Y,T5) is lines 1,3,4,5,6. Call this T7.
C39: ENTROPY(R,T7) = -((3/17) log(3/17) + (11/17)log(11/17) +
(3/17) log(3/17)) = 1.2898
C40: SUBTABLE(P,N,T5) is lines 7,9,10,11,12. Call this T8.
C41: ENTROPY(R,T8) = (3/24) log(3/24) + (15/24) log(15/24) +
(6/24) log(6/24) = 1.2988
C37: AVG_ENTROPY(P,R,T5) = (17/41) 1.2898 + (24/41) 1.2988 = 1.2951
C42: AVG_ENTROPY(Q,R,T5)
C43: SUBTABLE(Q,Y,T5) is lines 1,3,7,9. Call this T9.
C44: ENTROPY(R,T9) = (3/9) log(3/9) + 0 log 0 + (6/9) log (6/9) = 0.9184
C45: SUBTABLE(Q,N,T5) is lines 4,5,6,10,11,12. This is table T4, above.
(except that the C column has been deleted)
C46: ENTROPY(R,T4) = 0.8836 (see C22 above)
C42: AVG_ENTROPY(Q,R,T5) = (9/41) 0.9184 + (32/41) 0.8836 = 0.8912
So we choose AS = Q.
C47: ENTROPY(R,T5) was calculated in C29 above to be 1.3029
Mark N3 as split on attribute Q.
C48: SUBTABLE(T5,Q,N) is T4 above: Lines 4,5,6,10,11,12
(minus columns C and Q)
C49: ID3(T4,R)
Create new node N4
C50: AVG_ENTROPY(P,R,T4)
C51: SUBTABLE(P,Y,T4) = lines 4,5,6. Call this T10.
C52: ENTROPY(R,T10) = (2/13) log(2/13) + (11/13) log(11/13) + 0 log 0 =
0.6194
C53: SUBTABLE(P,N,T4) = lines 10,11,12. Call this T11.
C54: ENTROPY(R,T11) = (1/19) log(1/19) + (15/19) log(15/19) +
(3/19) log(3/19) = 0.8264
C50: AVG_ENTROPY(P,R,T4) = (13/32) * 0.6194 + (19/32) * 0.8264 = 0.7423
Choose AS = P (no other choices)
C55: ENTROPY(R,T4) was calculated in C22 to be 0.8836.
C49 continuing: Mark node N4 as split on P.
C56: SUBTABLE(T4,P,N) is T11 above (lines 10,11,12)
C57: ID3(T11,R)
Make new node N5
No predictive attributes remain
Label N5: "Prob(X.R=1) = 1/19.
Prob(X.R=2) = 15/19
Prob(X.R=3) = 3/19"
Return N5 to C49
C49 continuing: Make an arc labelled "N" from N4 to N5.
C58: SUBTABLE(T4,P,Y) is T10 above (lines 4,5,6)
C59: ID3(T4,R)
Make new node N6
No predictive attributes remain in T10
Label N6: "Prob(X.R=1) = 2/13.
Prob(X.R=2) = 11/13
Prob(X.R=3) = 0"
Return N6 to C49
C49 continuing: Make an arc labelled "Y" from N4 to N5
C49 returns N4 to C36.
C36 continuing: Make an arc labelled "N" from N3 to N4.
C60: SUBTABLE(T5,Q,Y) is T9 above (lines 1,3,7,9)
C61: ID3(T9,R)
Make a new node N7
C62: AVG_ENTROPY(P,R,T9)
C63: SUBTABLE(P,Y,T9) is lines 1 and 3. Call this T12.
C64: ENTROPY(R,T12) = -((1/4) log(1/4) + (3/4) log(3/4)) = 0.8113
C65: SUBTABLE(P,N,T9) is lines 7 and 9. Call this T11
C67: ENTROPY(R,T11) = -((2/5) log(2/5) + (3/5) log(3/5)) = 0.9710
C68: AVG_ENTROPY(P,R,T9) = (4/9) * 0.8113 * (5/9) * 0.9710 = 0.9000
AS is P
C69: ENTROPY(R,T9) is calculated in C44 as 0.9184
The result in C50 is not a substantial improvement over C51, particularly
considering the size of the table T9.
N7 is a leaf, labelled "Prob(X.R=1) = 3/9. Prob(X.R=3) = 6/9"
C61 returns N7 to C36.
C36 continuing: Make an arc labelled Y from N3 to N7
C36 returns N3 to C1
C1 continuing: Make an arc labelled Y from N1 to N3
C1 returns N1.
Final tree:
______________
| N1 |
| Split on C |
--------------
|
________N___________|________Y__________
| |
______|_______ _______|_______
| N2 | | N3 |
|Prob(R=2)=1.| | Split on Q |
-------------- ---------------
|
_____N_______|______Y______
| |
_______|________ ______|_________
| N4 | | N7 |
| Split on P | |Prob(R=1)=3/9 |
---------------- |Prob(R=3)=6/9 |
| ----------------
_______N________|_______Y_______
| |
_______|_________ ________|________
| N5 | | N6 |
|Prob(R=1)=1/19 | |Prob(R=1)=2/13 |
|Prob(R=2)=15/19| |Prob(R=2)=11/13|
|Prob(R=3)=3/19 | -----------------
-----------------
Note that, in a deterministic tree, there would be no point in the
split at N4, since both N5 and N6 predict R=2. This split would
be eliminated in post-processing.