======================
     Homework #4
======================


========================================
PART A. Training and testing
========================================

The inputs are vectors with values for each of the four features
outlook, temperature, humidity and wind. The output is a label from the
set { play, don't play}

The training data set constains 14 examples (each one has 4 features + 1 desired output).
Using that data set and the C4.5 algorithm we get the following
decision tree:
 
outlook = overcast: Play (4.0)
outlook = sunny:
|   humidity <= 75 : Play (2.0)
|   humidity > 75 : Don't Play (3.0)
outlook = rain:
|   windy = true: Don't Play (2.0)
|   windy = false: Play (3.0)

The C4.5 software produces the following report:

Evaluation on training data (14 items): 
         Before Pruning           After Pruning
        ----------------   ---------------------------
        Size      Errors   Size      Errors   Estimate
 
           8    0( 0.0%)      8    0( 0.0%)    (38.5%)   <<

It says that the generated decision tree has a total of 8 nodes
(how many of them are leaves?) and the decision tree does not commit 
any errors when used with the training data.

The pruning process does not prune any node from the tree. Then the
error on training data is still zero.

However, since the training set does not cover all the possible 
combinations of features, we cannot guarantee that the tree will 
have the same performance when used with unknown inputs.

The value 38.8% is a (pessimistic) estimate of the performance 
of the tree when used with new data. That estimate is usually
dependent on the amount of data used for training, the number 
of features and the variability of the training set.


========================================
PART B. Feature selection
========================================
There are 4 features in our example: Wind, Outlook, Humidity and Temperature
Based on the structure of the tree we can say that Outlook is the most
important feature and Temperature is the least important. Actually, the tree does
not use Temperature at all (!).

========================================
PART C.
========================================
There are 5 rules because the tree has 5 leaves.

IF Outlook = overcast
THEN Play

IF (Outlook = sunny)  AND (Humidity <= 75)
THEN Play

IF (Outlook = sunny)  AND (Humidity > 75)
THEN Don't Play

IF (Outlook = rain) AND (Windy = true)
THEN Don't Play

IF (Outlook = rain) AND (Windy = false)
THEN Play


========================================
PART D. Computing Entropy
========================================
We have to compute the average entropy of each feature to measure 
the effectiveness of that attribute as a classifier.

From here we will use the following notation:
   S  : the training set (golf.data)
   s  : a training example (a row in data)
   W  : wind feature
   H  : humidity feature
   T  : temperature feature
   O  : outlook feature
   s(W), 
   s(H), 
   s(T),
   s(O) : the value of feature W,H,T or O in the example s
   + : Play
   - : Don't play

------
Wind 
------
The feature wind can take one of the values from {false,true}

We divide the training set S in two subsets:
Wf = { s in S | s(W) = false}
Wt = { s in S | s(W) = true}

The subset Wf has 8 examples [2-,6+]
The subset Wt has 6 examples [3-,3+]

Now we compute the entropy for each subset:
E(Wf) = -(2/8)*log2(2/8) -(6/8)*log2(6/8) = 0.81128
E(Wt) = -(3/6)*log2(3/6) -(3/6)*log2(3/6) = 1.0 

And finally we can compute the average entropy for the feature "Wind":
avgEntropy(W) = (8/14)*E(Wf) + (6/14)*E(Wt) 
           = (8/14)*(0.81128)  + (6/14)*(1.0)
           = 0.89216

----------
Outlook
----------
The feature outlook can take one of the values from {sunny,overcast,rain}
Then we divide the training set S in three subsets:
Os = {s in S | s(O) = sunny}
Oo = {s in S | s(O) = overcast} 
Or = {s in S | s(O) = rain}

The subset Os has 5 examples [3-,2+]
The subset Oo has 4 examples [0-,4+]
The subset Or has 5 examples [2-,3+]

Now we compute the entropy for each subset

E(Os) = -(3/5)*log2(3/5) - (2/5)*log2(2/5) = 0.97095
E(Oo) = 0 - (4/4)*log2(4/4) = 0
E(Or) = -(2/5)*log2(2/5) - (3/5)*log2(3/5) = 0.97095

And finally we can compute the average entropy for the feature "Outlook":
avgEntropy(O) = (5/14)*E(O=sunny) + (4/14)*E(O=overcast) + (5/14)*E(O=rain)
           = (5/14)*0.97095 + (4/14)*0 + (5/14)*0.97095
           = 0.69354


The following features are continuous-valued, then computing the entropy is 
a bit more tricky.
We are going to find the threshold that divides our data with minimum 
average entropy using the following procedure:
 (1) Sort the training data by the feature F.

   For example:
  F  ... class
 12  ...  +
 15  ...  -
 16  ...  +
 18  ...  +
 25  ...  -

 (2) each time that the class labels of two consecutive examples si,sj are 
   different, define a candidate threshold as the average 
	(si(F) + sj(F))/2
   For the previous examples the candidate thresholds would be:
 F > (12+15)/2
 F > (15+16)/2
 F > (18+25)/2

 (3) Now choose the threshold that produces the lowest average entropy

----------
Humidity
----------
The feature Humidity is continuous-valued. 

 Using each candidate threshold h, we can generate two subsets
 H0 = {s in S | s(H) < h  }
 H1 = {s in S | s(H) >= h }

 For the training set S the candidate thresholds are: 

- h=67.5   
   H1 is composed by [5-,8+]  ---> E(H1)= 0.96124
   H0 is composed by [0-,1+]  ---> E(H0)= 0
  avgE(H) = 0.89258
		
- h=72.5
   H1 is composed by [4-,6+]  ---> E(H1)= 0.97095
   H0 is composed by [1-,3+]  ---> E(H0)= 0.81128
  avgE(H) = 0.92533

- h=79	   
   H1 is composed by [4-,4+]  ---> E(H1)= 1
   H0 is composed by [1-,5+]  ---> E(H0)= 0.65
  avgE(H) = 0.85001

- h=82.5   
   H1 is composed by [3-,2+]  ---> E(H1)=0.97095
   H0 is composed by [2-,7+]  ---> E(H0)=0.76420
  avgE(H) = 0.83804    <----------------------- This is the best!

- h=87.5   
   H1 is composed by [2-,2+]  ---> E(H1)= 1.0
   H0 is composed by [3-,7+]  ---> E(H0)= 0.88129
  avgE(H) = 0.91521

- h=92.5   
   H1 is composed by [1-,1+]  ---> E(H1)= 1.0
   H0 is composed by [4-,8+]  ---> E(H0)= 0.91830
  avgE(H) = 0.92997

- h=95.5   
   H1 is composed by [0-,1+]  ---> E(H1)= 0.0
   H0 is composed by [5-,8+]  ---> E(H0)= 0.96124
  avgE(H) = 0.89258

 Then we choose h=82.5 as the threshold for H, which produces
	avgE(H) = 0.83804 

-------------
Temperature
-------------
The feature Temperature is a continuous-values feature, then we will
use the same procedure used for the Humidity.

 Using each candidate threshold t, we can generate two subsets
 T0 = {s in S | s(H) < t  }
 T1 = {s in S | s(H) >= t }

 For the training set S the candidate thresholds are: 

- t=64.5
   T1 is composed by [5-,8+]  --> E(T1) = 0.96124
   T0 is composed by [0-,1+]  --> E(T0) = 0
  avgE(T) = 0.89258

- t=66.5
   T1 is composed by [4-,8+]  --> E(T1)  = 0.91830
   T0 is composed by [1-,1+]  --> E(T0) = 1
  avgE(T) = 0.92997

- t=70.5	
   T1 is composed by [4-,5+]  --> E(T1) = 0.99108
   T0 is composed by [1-,4+]  --> E(T0) = 0.72193
  avgE(T) = 0.89495		

- t=71.5
   T1 is composed by [3-,5+]  --> E(T1) = 0.95443
   T0 is composed by [2-,4+]  --> E(T0) = 0.91830
  avgE(T) = 0.93895

- t=73.5
   T1 is composed by [2-,4+]  --> E(T1) = 0.91830
   T0 is composed by [3-,5+]  --> E(T0) = 0.95443 
  avgE(T) = 0.93895

- t=77.5
   T1 is composed by [2-,2+]  --> E(T1) = 1.0
   T0 is composed by [3-.7+]  --> E(T0) = 0.88129
  avgE(T) = 0.91521

- t=80.5
   T1 is composed by [1-,2+]  --> E(T1) = 0.91830
   T0 is composed by [4-,7+]  --> E(T0) = 0.94566
  avgE(T) = 0.93980

- t=84
   T1 is composed by [1-,0+]  --> E(T1) = 0
   T0 is composed by [4-,9+]  --> E(T0) = 0.89049
  avgE(T) = 0.82689      <----------------------This is the best!

For the temperature the best threshold is t=84 which produces average entropy
 avgE(T) = 0.82689

Observe that using this threshold the subset T1 only has one examples 
(and it's a wrong examples!) Why does it happen? Is it correct to choose
t=84 as the best threshold?

finally we can compare the average entropies:
 avgE(O) = 0.69354
 avgE(W) = 0.89216
 avgE(H) = 0.83804    
 avgE(T) = 0.82689
and choose Outlook as the most important feature.

(Observe that the average entropy of T is smaller that the average entropy
of W and H. Why does C4.5 choose T as the least important feature?

========================================
PART E. Irrelevant attributes
========================================
You were asked to add an irrelevant attribute to the existing training and testing set.
Before doing such task, you should notice that an irrelevant feature is a feature 
that has a high average entropy. The higher the average entropy the more irrelevant 
the feature is.

The average entropy of the new feature should be high related to each of the other 
attributes and to each of the partitions of the training set that it can generate. 
Obviously finding such a feature is not an easy task.

However, it should be noticed that for small training sets an small variation in 
the values of the "irrelevant" feature of some training examples can change 
dramatically its own relevance. This behaviour can be  worse when the new attribute 
can take many different values. (Why? How does it affect the average entropy?)