Inappropriate use of Chi Square:

 Consider the following question, which was given in a recent final exam in introductory statistics:

 

Professor Snarf gave a surprise test composed of 20 true-false questions (scored 1 if correct, 0 if incorrect) to his small seminar. The following results were found:

Student

Score

1

12

2

16

3

12

4

12

Looking at the results, the professor was neither please nor surprised with the overall performance of his class and snorted that a chimpanzee could have done just as well. Do everything necessary to test Professor Snarf's assertion.

Most students who attempted to answer the question answered it incorrectly. The problem states that four students took a 20-question true-false test and that each got a score. While it is true that each score represents the sum of a series of 0s and 1s (and hence represents repeated measures taken on the same subject), the only score we have to work with is the total correct of each student. That is, we have only a single score for each student.

The problem involves four scores, and they do not appear to be subdivided in any manner, so it seems that there is a single group of scores. Thus, the appropriate test is the z, t, or chi-square test. To determine which of these three tests is appropriate; we must look at those things that make the tests different from one another. The data are not independent frequencies, so chi square can be eliminated, leaving only z and t.

The difference between the two tests is immediately apparent. For z, one needs the population standard deviation, for t one needs the sample standard deviation. We have no way of determining the population standard deviation from the information given in the problem, but we can find the sample standard deviation simply by finding the standard deviation of the four scores and dividing by the square root of N which is 2.

To work through the problem: the sample mean is 13, sample standard deviation is 2, standard error of the sample mean is 1.

We have everything needed to compute t except the value of mu. To determine mu we must look at the problem again. Professor Snarf said that a chimpanzee could do as well on the examination as the students. From this, we may (reasonably, we hope) assume that the professor was not saying anything about chimpanzees, but rather that he was saying something about his students - namely, that taken as a sample representative of a population, they were only guessing. If they were guessing, they would have correctly answered about half of the true-false questions. Therefore, the professor was saying that the mean of the four examination scores did not differ from the mean that would be obtained by chance. That mean would be mu = 10, since this is half of the 20 true-false questions on the test, and it can be assumed that the chance is one of two that the student can guess each answer correctly.

 

t = ( M - mu) / std error of the mean = ( 13 - 10 ) / 1 = 3.00

The number of df = N - 1 = 3, and a t of 3.18 or larger is need to reject (at the 0.05 level) the hypothesis that the observed mean comes from a population that has a true mean equal to the mu. Thus, Professor Snarf's assertion that the students were guessing may be true; there is no evidence to suggest that the students, as a randomly selected sample from some population, scored better than chance.

It may be helpful to look at the way most of the students taking the final statistics exam attempt to solve the problem. Most of them tried to use chi square. Their apparent reasoning was that there were four groups of right-wrong, or 1-0, responses. This would yield an experimental design consisting of four rows and two columns such as:

Student

Correct

Incorrect

1

12

8

2

16

4

3

12

8

4

12

8

 

The expected values were then computed for each cell by multiplying the appropriate row sum by the appropriate column sum and dividing by the total, which was 80. This yields a chi square of 3.45, which with 3 df is not significant. Thus, through an incorrect technique, the proper conclusion was accidentally reached.

The use of chi square here is improper for a number of reasons. The first is that, although the data are frequency data (each score represents the frequency of correct responses), they are not independent, since each score represents several 0 or 1 scores taken from the same person. That is, student 2 got 16 correct of 20, and we can perhaps assume that he had studied. The number of correct responses he got on the first ten items probably is correlated with the number he got correct on the second ten items: that is, if he knew enough to get several items correct on the first half of the test, he probably knew enough to get several items correct on the second half of the test. Chi square should not be used unless the data are independent. If we decide that the data taken from each student are correlated (not independent), we cannot use chi square appropriately.

On the other hand, even if we could assume the data were independent, chi square would still be inappropriate. This is because the chi square that would test the data would not answer the question originally posed: Does the mean of the correct responses differ from the mean of 10 expected by chance? It can be easily shown that the chi square obtained can be increased or decreased by changing the number of correct and incorrect responses while leaving the row and column totals constant. Suppose the scores for the four students were 10, 18, 8, and 16 correct. The chi square would equal 23.71, and with df = 3, this chi square is highly significant. Since the hypothesis deals with the mean of the four scores, and since the value of chi square can vary while the mean remains constant, it is apparent that chi square can vary while the mean remains constant, it is apparent that chi square does not test the hypothesis. Chi square in this type of problem tests the interaction between the rows and columns. Since the original hypothesis implies nothing about this interaction, chi square is clearly inappropriate.

In summary, a t test is the appropriate method to test Professor Snarf's hypothesis that the performance of his class on the true-false test was not different from chance. Most students inappropriately used chi square to solve the problem. The apparent reasons for this was that they erroneously assumed that the score obtained for a student represented the sum of several independent correct responses. When several scores are obtained from the same individual, it is likely that a correlation exists among these scores - that the scores are not independent. Chi square, then, is inappropriate because its use assumes independent frequency measures.