UW STUDENT RATINGS RESEARCH

Supplement to UW's December 4 Press Release

Colleges and universities rely on student ratings of instruction as the primary method for evaluating teaching effectiveness. University of Washington (UW) research that reveals substantial problems in student ratings measures was reported by Anthony G. Greenwald and Gerald M. Gillmore in the December 1997 issue of Journal of Educational Psychology, published by the American Psychological Association.

WHAT DID WE FIND? (A Disturbing Result)

It was no surprise for Greenwald & Gillmore to find that high-graded courses (ones in which students expected their highest grades) were also ones that they gave high ratings to. This has also been reported by many others. However, it was surprising to discover that the high-graded courses were also ones for which students reported doing least work. As Greenwald and Gillmore explain, this finding has disturbing implications regarding the impact of student ratings measures on the educational process.

Greenwald and Gillmore used novel measures and analyses in seeking to understand why high work demands and low course ratings go hand in hand with strict grading. Their answer centers on variations in professors' goals for their courses. On the one hand there are professors whose chief aim is for students to perform well; these professors are likely to set the amount of course material at a level that permits most students to achieve a high level of mastery. On the other hand there are professors who seek to maximize the content coverage in their courses. Professors of the first type can give relatively high grades and their students (especially the brighter ones) need not work very hard. Professors of the second type, by contrast, must challenge students, which means making it relatively difficult to get high grades -- if it is too easy to get a high grade, students may skip some assigned work and will not learn as much as the professor hopes.

It is of course too simple to think that all professors are one of these two types. Nevertheless, the UW data did reveal that professors vary considerably in the amount of work they expect of students. The professor's workload expectation, in turn, becomes a central factor in shaping the course's grading scheme -- the greater the goal for student work, the stricter the grading must be in order to assure that students will do the work.

Data obtained recently at UW indicate that instructors resembling the second type -- ones who appear to subscribe to a 'no pain, no gain' theory -- are found especially in math and science courses. For many math and science courses students report on their ratings surveys that they expect low grades and that they do a lot of work for the course. Although students may rate these courses and instructors as 'good', ratings for these courses nevertheless often fall well below the average for the university as a whole.

WHAT DOES IT MEAN? (Some Undesirable Consequences)

When difficult courses get relatively low ratings, some undesirable thing can happen:

1. New instructors with high standards may be discouraged. New teachers may expect students to work as hard as they themselves did as undergraduates. These new instructors all too soon discover that their students find the workload excessive and the grading standards unreasonable. The low ratings that are likely to result can be quite discouraging.

2. Students may avoid math and science courses. University-level math and science courses have the reputation of being ones in which even good students can get low grades. Is it surprising that many students plan their undergraduate programs with few or no math and science courses?

3. Higher education becomes education lite. When instructors are discouraged from teaching challenging courses and students gravitate toward less demanding, high-grading courses, there will necessarily be a reduction both of the number of demanding courses available and of the enrollments of such courses that survive. These trends can be described as a 'dumbing down' of higher education or as the evolution of higher education 'lite'. (The 'lite' label is borrowed from Mark Edmunson, writing in the September, 1997 issue of Harper's Magazine, pp 39-49.)

4. Grades creep up. Although grade inflation appears to go hand in hand with the lite-ening of higher education, grade inflation may not itself be a cause for concern. If gradually increasing grades went together with gradually increasing educational content, the upward movement of grades might be taken as a positive sign. However, actuality may be just the reverse -- gradually increasing grades appear to be associated with gradually decreasing educational content.

WHAT CAN BE DONE? (Repairing the Student Ratings System)

Despite having some problems, student ratings have two very attractive features. First, they are easy to administer. Second, they provide a simple numerical index. (No matter that the index is questionable as a measure of quality of instruction.) With these two very desirable properties, it makes more sense to repair the student ratings system than to abandon it. Two types of changes will improve the use of ratings. First, ratings should be improved by use of statistical adjustments. Second, the users of ratings should become better educated about what student ratings can and do measure.

Statistical adjustment. Some well known numerical indexes are used with statistical adjustments to correct for extraneous influences. Some examples: (a) The US monthly unemployment index is statistically adjusted to correct for seasonal employment fluctuations in industries such as tourism, construction, and agriculture; without that correction it would be inappropriate to directly compare index values for winter and summer months. (b) IQ measures are statistically adjusted by correcting raw performance scores for the test taker's chronological age; without that correction it would be inappropriate to compare IQs of people of different ages. (c) Computerized rankings of college football teams in the US are statistically adjusted by correcting won-lost records for difficulty of the opposing teams played; without that correction it would be inappropriate to compare teams in different conferences. In the same fashion, student ratings of instruction can be statistically adjusted in order to correct for unwanted influences of grading policy (higher grades produce higher ratings), class size (larger classes get lower ratings), and potentially other unwanted influences. Such adjustments are now beginning to be used at University of Washington.

Interpreting ratings. Quality of instruction in a course has two components: (a) productivity -- how much students learn from the course, and (b) satisfaction -- how much students enjoy taking the course. These two types of outcome (productivity and satisfaction) are found in many situations. In industry, managers are evaluated in terms of both how productive and how happy their subordinates are. In medicine, doctors are evaluted both for effectiveness of treatment and for bedside manner (which translates to patient satisfaction). In professional sports, managers are evaluated both for the team's won-and-lost record and for their players' morale.

Productivity may appear to be the bottom line, but satisfaction is also important -- in part because satisfaction often affects productivity. For example, workers who don't enjoy their jobs may quit, patients who dislike the doctor's bedside manner may not show up for appointments, and athletes with low morale may perform below their peak levels. Students' enjoyment of a course is important in much the same fashion. Students who don't enjoy a course may lose interest either in the course's subject matter or, worse, in education generally.

It is obviously desirable for student ratings to measure both productivity and enjoyment. However, present-day ratings methods do a much better job of measuring satisfaction than productivity. Student ratings surveys may try to measure productivity, for example by including questions such as, "How much did you learn from the course?" However, the use of such questions (a) presumes (dubiously) that students are competent to assess what they have learned and (b) overlooks much evidence that responses to such questions are distorted (by influences known technically as halo effects and self-serving attributions).

Greenwald and Gillmore participate in UW's Faculty Council on Instructional Quality and maintain a continuing research program directed at improving student ratings as measures of instructional quality.