(a) Does animal have feathers? Yes: Lays eggs (raven, albatross, ostrich) No : Is animal warmblooded? Yes: Does not lay eggs (koala, dolphin) No: Lays eggs (crocodile) (b) Does animal have fur? Yes: Does not lay eggs (koala) No: Does animal have feathers? Yes: Lays eggs (raven, albatross, ostrich) No: Is animal warmblooded? Yes: Does not lay eggs (dolphin) No: Lays eggs (crocodile) |
Animal Warm blooded Feathers Fur Swims Lays eggs Ostrich Yes Yes No No Yes Crocodile No No No Yes Yes Raven Yes Yes No No Yes Albatross Yes Yes No No Yes Dolphin Yes No No Yes No Koala Yes No Yes No No |
(a) Does animal have feathers? Yes: Lays eggs (raven, albatross, ostrich) No : Is animal warmblooded? Yes: Does not lay eggs (koala, dolphin) No: Lays eggs (crocodile) (b) Does animal have fur? Yes: Does not lay eggs (koala) No: Does animal have feathers? Yes: Lays eggs (raven, albatross, ostrich) No: Is animal warmblooded? Yes: Does not lay eggs (dolphin) No: Lays eggs (crocodile) |
Attribute on_ctr Hits Off_ctr Hits Neg entropy Warmblooded 5 3 1 1 0.809125 Feathers 3 3 3 2 0.459148 Fur 1 0 5 4 0.601607 Swims 2 1 4 3 0.874185 |
Attribute on_ctr Hits Off_ctr Hits Negentropy Warmblooded 2 0 1 0 0.0 Feathers 0 0 3 1 0.918296 Fur 1 0 2 1 0.666666 Swims 2 1 1 0 0.666666 |
The best attribute is the one with the lowest "negentropy," as defined in Figure 2.
The ID3 algorithm picks the attribute with the lowest total negentropy.
The best attribute is the one with the lowest "negentropy," as defined in Figure 2.
The ID3 algorithm picks the attribute with the lowest total negentropy.
Figure 2: Definition of negentropy. p(ON) and p(OFF) are the measured probabilities
If p(ON) and p(OFF) are both nonzero: -p(ON)*log2 p(ON) - p(OFF)*log2 p(OFF) Otherwise: 0 |
For example, there are five animals in the sample that are warm blooded, of which three lay eggs.
The negentropy of the "on" (true) attribute is:
NE_on = -(3/5) log2 (3/5) - (2/5) log2 (2/5) = 0.970951. |
There is one animal in the sample that is not warm blooded but does lay eggs.
The negentropy of the "off" (false) attribute is:
NE_off = -(1/1) log2(1) - 0 = 0. |
The combined negentropy is the weighted sum:
5 1 ( --- * NE_on + --- * NE_off ) = 0.809125. 5+1 5+1 |
Table 2 shows the negentropies for all of the attributes:
Attribute on_ctr Hits Off_ctr Hits Neg entropy Warmblooded 5 3 1 1 0.809125 Feathers 3 3 3 2 0.459148 Fur 1 0 5 4 0.601607 Swims 2 1 4 3 0.874185 |
None of the attributes has a negentropy of zero, so none classifies all cases correctly.
However, the "feathers" attribute has the lowest negentropy and hence conveys the most information.
This is what we use as the root attribute for our decision tree.
All animals that have feathers lay eggs, so the negentropy for this subset is zero and there are no further questions to ask.
When the animal does not have feathers, you compute the negentropies using only those animals to produce Table 3.
Attribute on_ctr Hits Off_ctr Hits Negentropy Warmblooded 2 0 1 0 0.0 Feathers 0 0 3 1 0.918296 Fur 1 0 2 1 0.666666 Swims 2 1 1 0 0.666666 |
The attribute to use for the subtree is therefore "warm blooded," which has the lowest negentropy in this sample.
The zero value indicates that all samples are now classified correctly.
At the outset of processing, you scan all of the variables to select an attribute i and a value r so that the condition Attribute(i) >= r yields the lowest negentropy possible.
The "yes/no" question is now of the form "Is attribute greater than or equal to r?"
If the negentropy arising from this partition is zero, then the data was classified perfectly, and the calculation ends. Otherwise, the dataset is partitioned into two sets--one for which the first condition is true, and the other for false.
The process is repeated recursively over the two subsets until all negentropies are zero (perfect classification) or until no more splitting is possible and the negentropy is nonzero.
In this case, classification is not possible and you must solicit more data from the user. Such a situation could occur when all attributes in two records have the same values, but the outcomes differ. More data must then be supplied to the program to allow it to discriminate between the two cases.
Tree Pruning
Most real-world datasets do not have the convenient properties shown here. Instead, noisy data with measurement errors or incorrect classification for some examples can lead to very bushy trees in which the rule tree has many special cases to classify small numbers of uninteresting samples.
One way to address this problem is to use "rule-tree pruning."
Instead of stopping when the negentropy reaches zero, you stop when it reaches some sufficiently small value, indicating that you are near the end of a branch.
This pruning leaves a small number of examples incorrectly classified, but the overall structure of the decision tree will be preserved.
Finding the exact, nonzero cutoff value will be a matter of experimentation.
In implementing the decision tree described here, I've represented it within the program (see Listing One, as a binary tree, constructed of NODE structures pointing to other NODEs or to NULLs for terminal nodes.
Rather than copying the data table for each partition, I pass the partially formed data tree to the routine that calculates negentropy, allowing the program to exclude records that are not relevant for that part of the tree. Negentropy of a partition is calculated in routine negentropy (see Listing Two), which is called for all attribute/threshold combinations by routine ID3 (Listing Three).
The ability to use real-valued as well as binary-valued attributes comes at a price. To ensure the correct value of r, we scan through all attribute values in the dataset--a process that can be quite computationally intensive for large datasets.
No claims are made for the efficiency of this implementation. For cases where many sample attribute values are the same, or where a mixture of real-valued and binary-valued attributes is to be considered, the user is probably better advised to sort the attributes into a list and to eliminate repeated values. I've also not considered the case where a question can have more than two outcomes.
http://www.drdobbs.com/database/algorithm-alley/184409907?queryText=ID3%2Bclassification
Example:
http://cs.nyu.edu/courses/spring04/G22.2560-001/id3-ex.txt Example of ID3 algorithm This example shows the construction of a decision tree where P, Q, and C are the predictive attributes and R is the classification attribute. I've added line numbers for convenient reference: this is _not_ an attribute. Line number P Q R C Number of instances 1 Y Y 1 Y 1 2 Y Y 2 N 10 3 Y Y 3 Y 3 4 Y N 1 Y 2 5 Y N 2 Y 11 6 Y N 3 Y 0 7 N Y 1 Y 2 8 N Y 2 N 20 9 N Y 3 Y 3 10 N N 1 Y 1 11 N N 2 Y 15 12 N N 3 Y 3 Total number of instances 71 The above table is T0. Create root node N1. C1: ID3(T,R) C2: AVG_ENTROPY(P,R,T) C3: FREQUENCY(P,Y,T) = (1+10+3+2+11+0)/71 = 27/71 = 0.3803 C4: SUBTABLE(P,Y,T) = lines 1,2,3,4,5,6. Call this T1. Size(T1)= 27. C5: ENTROPY(R,T1) C6: FREQUENCY(R,1,T1) = (1+2)/27 = 0.1111 C7: FREQUENCY(R,2,T1) = (10+11)/27 = 0.7778 C8: FREQUENCY(R,3,T1) = (3+0)/27 = 0.1111 C5: Return -(0.1111 log(0.1111) + 0.7777 log(0.7777) + 0.1111 log(0.1111)) = 0.9864 C8.1: FREQUENCY(P,Y,T) = (2+20+3+1+15+3)/71 = 44/71 = 0.6197 C9: SUBTABLE(P,N,T) = lines 7,8,9,10,11,12. Call this T2. Size(T2) = 44. C10: ENTROPY(R,T2) C11: FREQUENCY(R,1,T2) = (2+1)/44 = 0.0682 C12: FREQUENCY(R,2,T2) = (20+15)/44 = 0.7955 C13: FREQUENCY(R,3,T2) = (3+3)/44 = 0.1364 C10: Return -(0.0682 log(0.0682) + 0.7955 log(0.7955) + 0.1364 log(0.1364)) = 0.9188 C2: Return (27/71) * 0.9864 + (44/71) * 0.9188 = 0.9445 C14: AVG_ENTROPY(Q,R,T) C15: FREQUENCY(Q,Y,T) = (1+10+3+2+20+3)/71 = 39/71 = 0.5493 C16: SUBTABLE(Q,Y,T) = lines 1,2,3,7,8,9. Call this T3. Size(T3)= 39. C17: ENTROPY(R,T3) C18: FREQUENCY(R,1,T3) = (1+2)/39 = 0.0769 C19: FREQUENCY(R,2,T3) = (10+20)/39 = 0.7692 C20: FREQUENCY(R,3,T3) = (3+3)/39 = 0.1538 C17: Return -(0.0769 log(0.0769) + 0.7692 log(0.7692) + 0.1538 log(0.1538)) = 0.9914 C21: FREQUENCY(Q,N,T) = (2+11+0+1+15+3)/71 = 32/71 = 0.4507 C21: SUBTABLE(Q,N,T) = lines 4,5,6,10,11,12. Call this T4. Size(T2) = 32. C22: ENTROPY(R,T4) C23: FREQUENCY(R,1,T4) = (2+1)/32 = 0.0938 C24: FREQUENCY(R,2,T4) = (11+15)/32 = 0.8125 C25: FREQUENCY(R,3,T4) = (0+3)/32 = 0.0938 C22: Return -(0.0938 log(0.0938) + 0.8125 log(0.8125) + 0.0938 log(0.0938)) = 0.8838 C14: Return (39/71) * 0.9914 + (32/71) * 0.8836 = 0.9394 From here on down, I'm abbreviating. C26: AVG_ENTROPY(C,R,T) C27: FREQUENCY(C,Y,T) = 41/71 = 0.5775 C28: SUBTABLE(C,Y,T) = all lines but 2 and 8. Call this T5. C29: ENTROPY(R,T5) = (6/41) log(6/41) + (26/41) log (26/41) + (9/41) log(9/41) = 1.3028 C30: FREQUENCY(C,N,T) = 30/71 = 0.4225 C31: SUBTABLE(C,N,T) = lines 2 and 8. Call this T6. C32: ENTROPY(R,T6) = 0 log 0 + (30/30) log(30/30) + 0 log 0 = 0 C26 returns (41/71) * 1.3028 = 0.7523. ENTROPY(R,T) = -((6/71) log(6/71) + (56/71) log(56/71) + (9/71) log(9/71)) = 0.9492 Choose AS=C Mark N1 as split on attribute C. C33: SUBTABLE(C,N,T) is T6 as before. C34: ID3(T6,R) Make a new node N2 In all instances in T5 (lines 2 and 8), X.R = 2. Therefore, this is base case 2. Label node N2 as "X.R=2" C34 returns N2 to C1 In C1: Make an arc labelled "N" from N1 to N2. ************ This is as far as you should take the solution to problem 1 *** C35: SUBTABLE(C,Y,T) is T5 as above. C36: ID3(T5,R) Create new node N3; C37: AVG_ENTROPY(P,R,T5) C38: SUBTABLE(P,Y,T5) is lines 1,3,4,5,6. Call this T7. C39: ENTROPY(R,T7) = -((3/17) log(3/17) + (11/17)log(11/17) + (3/17) log(3/17)) = 1.2898 C40: SUBTABLE(P,N,T5) is lines 7,9,10,11,12. Call this T8. C41: ENTROPY(R,T8) = (3/24) log(3/24) + (15/24) log(15/24) + (6/24) log(6/24) = 1.2988 C37: AVG_ENTROPY(P,R,T5) = (17/41) 1.2898 + (24/41) 1.2988 = 1.2951 C42: AVG_ENTROPY(Q,R,T5) C43: SUBTABLE(Q,Y,T5) is lines 1,3,7,9. Call this T9. C44: ENTROPY(R,T9) = (3/9) log(3/9) + 0 log 0 + (6/9) log (6/9) = 0.9184 C45: SUBTABLE(Q,N,T5) is lines 4,5,6,10,11,12. This is table T4, above. (except that the C column has been deleted) C46: ENTROPY(R,T4) = 0.8836 (see C22 above) C42: AVG_ENTROPY(Q,R,T5) = (9/41) 0.9184 + (32/41) 0.8836 = 0.8912 So we choose AS = Q. C47: ENTROPY(R,T5) was calculated in C29 above to be 1.3029 Mark N3 as split on attribute Q. C48: SUBTABLE(T5,Q,N) is T4 above: Lines 4,5,6,10,11,12 (minus columns C and Q) C49: ID3(T4,R) Create new node N4 C50: AVG_ENTROPY(P,R,T4) C51: SUBTABLE(P,Y,T4) = lines 4,5,6. Call this T10. C52: ENTROPY(R,T10) = (2/13) log(2/13) + (11/13) log(11/13) + 0 log 0 = 0.6194 C53: SUBTABLE(P,N,T4) = lines 10,11,12. Call this T11. C54: ENTROPY(R,T11) = (1/19) log(1/19) + (15/19) log(15/19) + (3/19) log(3/19) = 0.8264 C50: AVG_ENTROPY(P,R,T4) = (13/32) * 0.6194 + (19/32) * 0.8264 = 0.7423 Choose AS = P (no other choices) C55: ENTROPY(R,T4) was calculated in C22 to be 0.8836. C49 continuing: Mark node N4 as split on P. C56: SUBTABLE(T4,P,N) is T11 above (lines 10,11,12) C57: ID3(T11,R) Make new node N5 No predictive attributes remain Label N5: "Prob(X.R=1) = 1/19. Prob(X.R=2) = 15/19 Prob(X.R=3) = 3/19" Return N5 to C49 C49 continuing: Make an arc labelled "N" from N4 to N5. C58: SUBTABLE(T4,P,Y) is T10 above (lines 4,5,6) C59: ID3(T4,R) Make new node N6 No predictive attributes remain in T10 Label N6: "Prob(X.R=1) = 2/13. Prob(X.R=2) = 11/13 Prob(X.R=3) = 0" Return N6 to C49 C49 continuing: Make an arc labelled "Y" from N4 to N5 C49 returns N4 to C36. C36 continuing: Make an arc labelled "N" from N3 to N4. C60: SUBTABLE(T5,Q,Y) is T9 above (lines 1,3,7,9) C61: ID3(T9,R) Make a new node N7 C62: AVG_ENTROPY(P,R,T9) C63: SUBTABLE(P,Y,T9) is lines 1 and 3. Call this T12. C64: ENTROPY(R,T12) = -((1/4) log(1/4) + (3/4) log(3/4)) = 0.8113 C65: SUBTABLE(P,N,T9) is lines 7 and 9. Call this T11 C67: ENTROPY(R,T11) = -((2/5) log(2/5) + (3/5) log(3/5)) = 0.9710 C68: AVG_ENTROPY(P,R,T9) = (4/9) * 0.8113 * (5/9) * 0.9710 = 0.9000 AS is P C69: ENTROPY(R,T9) is calculated in C44 as 0.9184 The result in C50 is not a substantial improvement over C51, particularly considering the size of the table T9. N7 is a leaf, labelled "Prob(X.R=1) = 3/9. Prob(X.R=3) = 6/9" C61 returns N7 to C36. C36 continuing: Make an arc labelled Y from N3 to N7 C36 returns N3 to C1 C1 continuing: Make an arc labelled Y from N1 to N3 C1 returns N1. Final tree: ______________ | N1 | | Split on C | -------------- | ________N___________|________Y__________ | | ______|_______ _______|_______ | N2 | | N3 | |Prob(R=2)=1.| | Split on Q | -------------- --------------- | _____N_______|______Y______ | | _______|________ ______|_________ | N4 | | N7 | | Split on P | |Prob(R=1)=3/9 | ---------------- |Prob(R=3)=6/9 | | ---------------- _______N________|_______Y_______ | | _______|_________ ________|________ | N5 | | N6 | |Prob(R=1)=1/19 | |Prob(R=1)=2/13 | |Prob(R=2)=15/19| |Prob(R=2)=11/13| |Prob(R=3)=3/19 | ----------------- ----------------- Note that, in a deterministic tree, there would be no point in the split at N4, since both N5 and N6 predict R=2. This split would be eliminated in post-processing.