University of Chicago: Computer Science: CSPP 56553: Winter 2004: Homework #4

CSPP 56553 - Artificial Intelligence
Winter 2004
Homework #4: Due February 18, 2004

Goals

Through this assignment you will:

Explore the use of decision tree learners.
Understand the use of the information theoretic entropy measure for selection of feature tests.
Relate decision trees to conditional rules.

Background

This homework continues our exploration of general techniques that can use the regularities in observed data to acquire functions from inputs vectors to output values or classifications. In other words, we will explore systems that can automatically learn from examples. In this assignment we will examine decision tree classifiers.

Problem 1

Playing Golf

In this assignment, we apply a public domain decision tree classifier c4.5 to the classification task. c4.5 is available on departmental Debian 2 linux machines (e.g. those in the 4th floor labs). c4.5 fairly directly implements the decision tree building and classification process described in class. c4.5 assumes that all the information for training and testing the decision tree classifier appears in three user defined files with a common base name. The three files for this example are:

golf.names: The first line of the file defines the output labels for the classifier. The remaining lines specify the names and types of the feature values, where the names are arbitrary, hopefully mnemonic, strings and the types are either discrete, continuous, or ignore. The last type instructs the learning algorithm to ignore the corresponding feature in training and testing.
golf.data: This file contains the training examples for the decision tree. Each line of the file corresponds to the feature vector for the example, where features appear in the same order as the names file, separated by commas, except that the last entry is the classification label.
golf.test: This file contains the held-out test examples for evaluation of the decision tree on unseen data. It has the same format as the training example file.

The golf.data and golf.names files can be found here. You will create your own golf.test. Additional training and test example tasks can be found in /opt/c4.5/c4.5-r8/Data/. The program is invoked as follows:

c4.5 -f golf

c4.5 -u -f golf

where the -u flag indicates testing on the unseen test file and -f identifies the root filename for the experiment. We consider the simple example of learning suitable weather conditions for playing an outdoor sport - such as golf - from examples.

Part A: Training and Testing

Use c4.5 as described above to build the classifier. What is the reported error rate on the training data? c4.5 also reports an "estimated" error on unseen data. Why is it different from the value on the training data? Why do you think it is so high for current training set?

After examining the training data and using your own sense of suitable weather, create a small (~10 samples) test data file. (golf.test) How does the classifier fare on your test data. Please submit your test file as well as the classification output.

Part B: Feature Selection

Based on the structure of the tree, what features are most important? Least important?

Part C: Trees to Rules

We have claimed that one of the virtues of decision trees is that it is easy read a set of if-then rules directly from the structure of the tree. Convert the decision tree generated by c4.5 to a set of if-then rules, where the elements of the antecedent (if) are the feature tests, and the consequent (then) is the classification.

Part D: Computing Entropy

To check your understanding, use the average entropy measure introduced in class to identify the first feature test for the tree yourself, based on the selecting the test that yields the lowest average disorder. If you do this by hand, please show ALL your work. You may alternatively write a short program to do this. If you code the entropy computation, please submit both your code and the output.

Part E: Irrelevant Attributes

It is claimed that decision trees are relatively robust to irrelevant attributes. Test this claim by adding an irrelevant feature - say the day of the week - randomly to the existing training and test samples. This will require that you modify the golf.names, golf.data, and golf.test files appropriately. Rerun the classifier. Are the claims about feature selection supported?

CSPP 56553 - Artificial Intelligence Winter 2004 Homework #4: Due February 18, 2004