[CS logo]

CSPP 56553 - Artificial Intelligence
Winter 2004
Homework #4: Due February 18, 2004


Goals

Through this assignment you will:

Background

This homework continues our exploration of general techniques that can use the regularities in observed data to acquire functions from inputs vectors to output values or classifications. In other words, we will explore systems that can automatically learn from examples. In this assignment we will examine decision tree classifiers.

Problem 1

Playing Golf

In this assignment, we apply a public domain decision tree classifier c4.5 to the classification task. c4.5 is available on departmental Debian 2 linux machines (e.g. those in the 4th floor labs). c4.5 fairly directly implements the decision tree building and classification process described in class. c4.5 assumes that all the information for training and testing the decision tree classifier appears in three user defined files with a common base name. The three files for this example are: The golf.data and golf.names files can be found here. You will create your own golf.test. Additional training and test example tasks can be found in /opt/c4.5/c4.5-r8/Data/. The program is invoked as follows:

c4.5 -f golf

OR

c4.5 -u -f golf

where the -u flag indicates testing on the unseen test file and -f identifies the root filename for the experiment. We consider the simple example of learning suitable weather conditions for playing an outdoor sport - such as golf - from examples.

Part A: Training and Testing

Use c4.5 as described above to build the classifier. What is the reported error rate on the training data? c4.5 also reports an "estimated" error on unseen data. Why is it different from the value on the training data? Why do you think it is so high for current training set?

After examining the training data and using your own sense of suitable weather, create a small (~10 samples) test data file. (golf.test) How does the classifier fare on your test data. Please submit your test file as well as the classification output.

Part B: Feature Selection

Based on the structure of the tree, what features are most important? Least important?

Part C: Trees to Rules

We have claimed that one of the virtues of decision trees is that it is easy read a set of if-then rules directly from the structure of the tree. Convert the decision tree generated by c4.5 to a set of if-then rules, where the elements of the antecedent (if) are the feature tests, and the consequent (then) is the classification.

Part D: Computing Entropy

To check your understanding, use the average entropy measure introduced in class to identify the first feature test for the tree yourself, based on the selecting the test that yields the lowest average disorder. If you do this by hand, please show ALL your work. You may alternatively write a short program to do this. If you code the entropy computation, please submit both your code and the output.

Part E: Irrelevant Attributes

It is claimed that decision trees are relatively robust to irrelevant attributes. Test this claim by adding an irrelevant feature - say the day of the week - randomly to the existing training and test samples. This will require that you modify the golf.names, golf.data, and golf.test files appropriately. Rerun the classifier. Are the claims about feature selection supported?