====================== Homework #4 ====================== ======================================== PART A. Training and testing ======================================== The inputs are vectors with values for each of the four features outlook, temperature, humidity and wind. The output is a label from the set { play, don't play} The training data set constains 14 examples (each one has 4 features + 1 desired output). Using that data set and the C4.5 algorithm we get the following decision tree: outlook = overcast: Play (4.0) outlook = sunny: | humidity <= 75 : Play (2.0) | humidity > 75 : Don't Play (3.0) outlook = rain: | windy = true: Don't Play (2.0) | windy = false: Play (3.0) The C4.5 software produces the following report: Evaluation on training data (14 items): Before Pruning After Pruning ---------------- --------------------------- Size Errors Size Errors Estimate 8 0( 0.0%) 8 0( 0.0%) (38.5%) << It says that the generated decision tree has a total of 8 nodes (how many of them are leaves?) and the decision tree does not commit any errors when used with the training data. The pruning process does not prune any node from the tree. Then the error on training data is still zero. However, since the training set does not cover all the possible combinations of features, we cannot guarantee that the tree will have the same performance when used with unknown inputs. The value 38.8% is a (pessimistic) estimate of the performance of the tree when used with new data. That estimate is usually dependent on the amount of data used for training, the number of features and the variability of the training set. ======================================== PART B. Feature selection ======================================== There are 4 features in our example: Wind, Outlook, Humidity and Temperature Based on the structure of the tree we can say that Outlook is the most important feature and Temperature is the least important. Actually, the tree does not use Temperature at all (!). ======================================== PART C. ======================================== There are 5 rules because the tree has 5 leaves. IF Outlook = overcast THEN Play IF (Outlook = sunny) AND (Humidity <= 75) THEN Play IF (Outlook = sunny) AND (Humidity > 75) THEN Don't Play IF (Outlook = rain) AND (Windy = true) THEN Don't Play IF (Outlook = rain) AND (Windy = false) THEN Play ======================================== PART D. Computing Entropy ======================================== We have to compute the average entropy of each feature to measure the effectiveness of that attribute as a classifier. From here we will use the following notation: S : the training set (golf.data) s : a training example (a row in data) W : wind feature H : humidity feature T : temperature feature O : outlook feature s(W), s(H), s(T), s(O) : the value of feature W,H,T or O in the example s + : Play - : Don't play ------ Wind ------ The feature wind can take one of the values from {false,true} We divide the training set S in two subsets: Wf = { s in S | s(W) = false} Wt = { s in S | s(W) = true} The subset Wf has 8 examples [2-,6+] The subset Wt has 6 examples [3-,3+] Now we compute the entropy for each subset: E(Wf) = -(2/8)*log2(2/8) -(6/8)*log2(6/8) = 0.81128 E(Wt) = -(3/6)*log2(3/6) -(3/6)*log2(3/6) = 1.0 And finally we can compute the average entropy for the feature "Wind": avgEntropy(W) = (8/14)*E(Wf) + (6/14)*E(Wt) = (8/14)*(0.81128) + (6/14)*(1.0) = 0.89216 ---------- Outlook ---------- The feature outlook can take one of the values from {sunny,overcast,rain} Then we divide the training set S in three subsets: Os = {s in S | s(O) = sunny} Oo = {s in S | s(O) = overcast} Or = {s in S | s(O) = rain} The subset Os has 5 examples [3-,2+] The subset Oo has 4 examples [0-,4+] The subset Or has 5 examples [2-,3+] Now we compute the entropy for each subset E(Os) = -(3/5)*log2(3/5) - (2/5)*log2(2/5) = 0.97095 E(Oo) = 0 - (4/4)*log2(4/4) = 0 E(Or) = -(2/5)*log2(2/5) - (3/5)*log2(3/5) = 0.97095 And finally we can compute the average entropy for the feature "Outlook": avgEntropy(O) = (5/14)*E(O=sunny) + (4/14)*E(O=overcast) + (5/14)*E(O=rain) = (5/14)*0.97095 + (4/14)*0 + (5/14)*0.97095 = 0.69354 The following features are continuous-valued, then computing the entropy is a bit more tricky. We are going to find the threshold that divides our data with minimum average entropy using the following procedure: (1) Sort the training data by the feature F. For example: F ... class 12 ... + 15 ... - 16 ... + 18 ... + 25 ... - (2) each time that the class labels of two consecutive examples si,sj are different, define a candidate threshold as the average (si(F) + sj(F))/2 For the previous examples the candidate thresholds would be: F > (12+15)/2 F > (15+16)/2 F > (18+25)/2 (3) Now choose the threshold that produces the lowest average entropy ---------- Humidity ---------- The feature Humidity is continuous-valued. Using each candidate threshold h, we can generate two subsets H0 = {s in S | s(H) < h } H1 = {s in S | s(H) >= h } For the training set S the candidate thresholds are: - h=67.5 H1 is composed by [5-,8+] ---> E(H1)= 0.96124 H0 is composed by [0-,1+] ---> E(H0)= 0 avgE(H) = 0.89258 - h=72.5 H1 is composed by [4-,6+] ---> E(H1)= 0.97095 H0 is composed by [1-,3+] ---> E(H0)= 0.81128 avgE(H) = 0.92533 - h=79 H1 is composed by [4-,4+] ---> E(H1)= 1 H0 is composed by [1-,5+] ---> E(H0)= 0.65 avgE(H) = 0.85001 - h=82.5 H1 is composed by [3-,2+] ---> E(H1)=0.97095 H0 is composed by [2-,7+] ---> E(H0)=0.76420 avgE(H) = 0.83804 <----------------------- This is the best! - h=87.5 H1 is composed by [2-,2+] ---> E(H1)= 1.0 H0 is composed by [3-,7+] ---> E(H0)= 0.88129 avgE(H) = 0.91521 - h=92.5 H1 is composed by [1-,1+] ---> E(H1)= 1.0 H0 is composed by [4-,8+] ---> E(H0)= 0.91830 avgE(H) = 0.92997 - h=95.5 H1 is composed by [0-,1+] ---> E(H1)= 0.0 H0 is composed by [5-,8+] ---> E(H0)= 0.96124 avgE(H) = 0.89258 Then we choose h=82.5 as the threshold for H, which produces avgE(H) = 0.83804 ------------- Temperature ------------- The feature Temperature is a continuous-values feature, then we will use the same procedure used for the Humidity. Using each candidate threshold t, we can generate two subsets T0 = {s in S | s(H) < t } T1 = {s in S | s(H) >= t } For the training set S the candidate thresholds are: - t=64.5 T1 is composed by [5-,8+] --> E(T1) = 0.96124 T0 is composed by [0-,1+] --> E(T0) = 0 avgE(T) = 0.89258 - t=66.5 T1 is composed by [4-,8+] --> E(T1) = 0.91830 T0 is composed by [1-,1+] --> E(T0) = 1 avgE(T) = 0.92997 - t=70.5 T1 is composed by [4-,5+] --> E(T1) = 0.99108 T0 is composed by [1-,4+] --> E(T0) = 0.72193 avgE(T) = 0.89495 - t=71.5 T1 is composed by [3-,5+] --> E(T1) = 0.95443 T0 is composed by [2-,4+] --> E(T0) = 0.91830 avgE(T) = 0.93895 - t=73.5 T1 is composed by [2-,4+] --> E(T1) = 0.91830 T0 is composed by [3-,5+] --> E(T0) = 0.95443 avgE(T) = 0.93895 - t=77.5 T1 is composed by [2-,2+] --> E(T1) = 1.0 T0 is composed by [3-.7+] --> E(T0) = 0.88129 avgE(T) = 0.91521 - t=80.5 T1 is composed by [1-,2+] --> E(T1) = 0.91830 T0 is composed by [4-,7+] --> E(T0) = 0.94566 avgE(T) = 0.93980 - t=84 T1 is composed by [1-,0+] --> E(T1) = 0 T0 is composed by [4-,9+] --> E(T0) = 0.89049 avgE(T) = 0.82689 <----------------------This is the best! For the temperature the best threshold is t=84 which produces average entropy avgE(T) = 0.82689 Observe that using this threshold the subset T1 only has one examples (and it's a wrong examples!) Why does it happen? Is it correct to choose t=84 as the best threshold? finally we can compare the average entropies: avgE(O) = 0.69354 avgE(W) = 0.89216 avgE(H) = 0.83804 avgE(T) = 0.82689 and choose Outlook as the most important feature. (Observe that the average entropy of T is smaller that the average entropy of W and H. Why does C4.5 choose T as the least important feature? ======================================== PART E. Irrelevant attributes ======================================== You were asked to add an irrelevant attribute to the existing training and testing set. Before doing such task, you should notice that an irrelevant feature is a feature that has a high average entropy. The higher the average entropy the more irrelevant the feature is. The average entropy of the new feature should be high related to each of the other attributes and to each of the partitions of the training set that it can generate. Obviously finding such a feature is not an easy task. However, it should be noticed that for small training sets an small variation in the values of the "irrelevant" feature of some training examples can change dramatically its own relevance. This behaviour can be worse when the new attribute can take many different values. (Why? How does it affect the average entropy?)