CSPP 56553 - Artificial Intelligence
Winter 2004
Homework #4: Due February 18, 2004
Goals
Through this assignment you will:
- Explore the use of decision tree learners.
- Understand the use of the information theoretic entropy measure
for selection of feature tests.
- Relate decision trees to conditional rules.
Background
This homework continues our exploration of general techniques that can
use the regularities in observed data to acquire functions from
inputs vectors to output values or classifications. In other words,
we will explore systems that can automatically learn from examples.
In this assignment we will examine decision tree classifiers.
Problem 1
Playing Golf
In this assignment, we apply a public domain decision tree classifier
c4.5 to the classification task. c4.5 is
available on departmental Debian 2 linux machines (e.g. those
in the 4th floor labs). c4.5 fairly
directly implements the decision tree building and classification
process described in class. c4.5 assumes that all the information
for training and testing the decision tree classifier appears in
three user defined files with a common base name. The three
files for this example are:
- golf.names: The first line of the file defines the output labels for
the classifier. The remaining lines specify the names and types of
the feature values, where the names are arbitrary, hopefully mnemonic,
strings and the types are either discrete, continuous, or ignore. The
last type instructs the learning algorithm to ignore the corresponding
feature in training and testing.
- golf.data: This file contains the training examples for the decision
tree. Each line of the file corresponds to the feature vector for the
example, where features appear in the same order as the names file, separated
by commas, except that the last entry is the classification label.
- golf.test: This file contains the held-out test examples for evaluation
of the decision tree on unseen data. It has the same format as the
training example file.
The golf.data and golf.names files can be found here.
You will create your own golf.test. Additional training and test example
tasks can be found in /opt/c4.5/c4.5-r8/Data/.
The program is invoked as follows:
c4.5 -f golf
OR
c4.5 -u -f golf
where the -u flag indicates testing on the unseen test file and -f identifies
the root filename for the experiment.
We consider the simple example of learning suitable weather conditions
for playing an outdoor sport - such as golf - from examples.
Part A: Training and Testing
Use c4.5 as described above to build the classifier. What is the
reported error rate on the training data? c4.5 also reports an
"estimated" error on unseen data. Why is it different from the
value on the training data? Why do you think it is so high for
current training set?
After examining the training data and using your own sense of
suitable weather, create a small (~10 samples) test data file. (golf.test)
How does the classifier fare on your test data. Please submit
your test file as well as the classification output.
Part B: Feature Selection
Based on the structure of the tree, what features are most important?
Least important?
Part C: Trees to Rules
We have claimed that one of the virtues of decision trees is that
it is easy read a set of if-then rules directly from the structure of
the tree. Convert the decision tree generated by c4.5 to a set of
if-then rules, where the elements of the antecedent (if) are the feature
tests, and the consequent (then) is the classification.
Part D: Computing Entropy
To check your understanding, use the average entropy measure introduced
in class to identify the first feature test for the tree yourself,
based on the selecting the test that yields the lowest average disorder.
If you do this by hand, please show ALL your work. You may alternatively
write a short program to do this. If you code the entropy computation,
please submit both your code and the output.
Part E: Irrelevant Attributes
It is claimed that decision trees are relatively robust to irrelevant
attributes. Test this claim by adding an irrelevant feature - say
the day of the week - randomly to the existing training and test samples.
This will require that you modify the golf.names, golf.data, and golf.test
files appropriately. Rerun the classifier. Are the claims about
feature selection supported?