The Grade Debate: Absolute and Relative Measures
Schools and universities in many parts of the world seek to measure students’ academic performance on an absolute basis and specify pre-determined thresholds that students must reach to pass a course or earn academic distinctions. When students from these institutions face the “forced curve” of relative grading, which MBA programs commonly employ around the world, their reactions range from bafflement to skepticism.
When the entire class does well, these students ask, isn’t relative grading unfair? Conversely, what happens when nobody does well, perhaps due to deliberate collusion among students? In either case, doesn’t the system pit student against student, and thus interfere with collaborative learning and community cohesion, two virtues that business schools often extol as attributes of their MBA student communities?
Faculty periodically debate the merits of different grading philosophies and occasionally act to change grading systems, but they tend to refer student questions about grading to program deans. Based on my MBA teaching experience at several universities in different parts of the world, and my experience as a program dean, I respond to students’ questions with a series of observations.
The Arguments for Relative Grading
I begin by pointing out that the difference between absolute and relative grading is the reference group. Though we tend to think of absolute performance as being independent of how others perform, in practice professors measure performance based on our particular, finite experience with students. Absolute grading is predicated on the idea that we can assess a student’s performance relative to a much wider reference group—the entire human population—not just against other students taking the course.
But unless we statistically control the difficulty of our tests, all we can reliably assess is relative performance. Absolute grading is an elusive goal because it is difficult to know how the world’s population might perform in our courses. Much of my education in India was predicated on the premise that it is possible to measure absolute performance through a comprehensive system of exams. Exams were set to a scale of 100, with à priori thresholds for passing and for placing in different “divisions” of performance. It was just our good or bad luck if the exam we took was easy or hard. (Naturally, we always felt that our exams were harder than past exams.)
Designers of standardized tests such as the GMAT rely on statistical techniques to control the difficulty level of the tests so that a score of, say, 740, reflects the same ability/aptitude on each administration of the test. The principal disadvantage of these statistical calibrations is that they require test questions to have objective answers, ruling out qualitative, nuanced questions. This drawback renders standardized tests unsuitable for most MBA courses.
In large courses, the best way to do absolute grading is via relative grading. This sounds paradoxical and warrants an explanation. In all courses, the determinants of a student’s performance include ability, aptitude, motivation, and effort. In courses with large populations of students, which is typical of MBA core courses, these factors exhibit much greater stability over time—stationarity in statistical terms—than the measuring instruments we use, such as tests and quizzes. Put another way, if the mean score on an exam in a large class this year is substantially greater than it was last year, it is most likely not because this year’s class is more motivated or capable than last year’s; it is because the exam was easier.
This stationarity of the student population, coupled with the Law of Large Numbers, implies that the professor teaching a course with large enrollments merely needs assessments that elicit dispersion or spread. Once dispersion is achieved, the absolute difficulty of the tests does not matter. This is because a particular relative performance (such as a score that is 1.8 standard deviations above, or 0.3 standard deviations below, the mean) is likely to reflect the same absolute performance from one year to the next.
Standardized tests, which offer the only way to create exams of reliably consistent difficulty, are too restrictive for measuring student learning.
This is the virtue of relative grading: It frees the professor from having to make careful statistical calibration of tests to ensure consistency from year to year; all that is required is dispersion! As someone who has written exams for over three decades, I can say that it is vastly more challenging to create tests of consistent difficulty than to achieve dispersion. I am not suggesting that faculty abandon writing exams of consistent difficulty. My point is that standardized tests, which offer the only way to create exams of reliably consistent difficulty, are too restrictive for measuring student learning.
Faculty tend to have divergent views about the relevant reference group; hence, enforcing a consistent reference group is useful. Even when they agree on relative grading as the right grading approach, faculty implicitly or explicitly have different reference groups in mind. Some of us might think, for instance, that students at our institution are better than those elsewhere! We might therefore worry that a low GPA unfairly disadvantages the employment prospects of our students. Others may wish to compare current students with past students they have taught.
A grading system with a specified curve explicitly signals that the reference group is only the current class and that grades reflect nothing more than the students’ relative performance within the class. And despite its limited validity, a relative grade is also the best proxy for absolute performance in large courses because of the stationarity mentioned above. Specifying the grade distribution offers twin virtues: It makes student grades consistent and comparable across courses and faculty within the institution, and it curbs grade inflation.
Both problems—inconsistency across courses/professors and grade inflation—have afflicted universities in many parts of the world. Concerned about rampant grade inflation, many universities have begun to provide contextual information about students’ grades. (These concerns have been highlighted in Robert Zaretsky’s 2013 article in The Chronicle of Higher Education, a 2014 article in The Economist, and Aina Katsikas’ 2015 piece in The Atlantic.) Some Canadian universities, for instance, note the average grade received in the course next to the student’s grade. As Lauren Sieben notes in another article in The Chronicle, since 2011, University of North Carolina transcripts have included the “schedule point average” for each term to indicate the median GPA earned in the specific courses the student took that term.
This contextual information helps the reader of the transcript translate the student’s letter grade into a measure of relative performance. It also allows the reader of the transcript to ascribe a value judgment to letter grades, mitigating the bias that might otherwise be brought to their meaning.
Letter grades often do not help. The use of letter grades signals that student performance cannot be meaningfully differentiated any more finely than what is allowed by the small number of letter grades used in the grading system. Letter grades are also an acknowledgement that our assessment instruments are prone to measurement error. Most faculty, however, first measure student performance on a numerical scale more fine-grained than the letter scale, and then lump the numerical scores into a small number of letter grade bins. This compounds measurement error with aggregation error.
A simple way to avoid these problems is to report just the percentile rank of each student relative to others in the class. In small classes, it might be sufficient to report the percentile rank in quartiles or thirds. The percentile rank is also free of distributional assumptions that underlie other measures of relative standing.
There is no perfect grading system; each has limitations and unintended consequences. Yet grades, and the private feedback they provide to students, are a useful part of the learning process.
The Law of Large Numbers does not work for small sections or for advanced electives due to self-selection. This is why many schools, including my own, accord greater flexibility to the grading of students in electives and small classes. Though large departures from the recommended curve can lead to grade inflation and distort incentives for students, a strict imposition of the curve is not justifiable, at least on statistical grounds.
Competition and collaboration are not mutually exclusive, and both can contribute to learning. Students often believe that relative grading engenders unhealthy competition among them. In many MBA courses, students collaborate within teams and compete with other teams. Indeed, the better they collaborate within the team, the better their team is able to compete against other teams.
Moreover, absolute grading systems do not eliminate competition. Though my high school and undergraduate grading systems were supposedly absolute, all important consequences of academic performance, such as access to graduate schools or jobs, depended on one student’s rank relative to other students. Across cultures, relative performance seems to matter in competitive situations, whether or not the grading systems are explicitly relative.
No Perfect System
Grading philosophies are perennial topics of discussion at universities. My comments here do not address the broader debate about the purpose of grades and their relationship to learning outcomes, the causes and consequences of grade inflation, and the effect of grades on intrinsic motivation.
There is no perfect grading system; each has limitations and unintended consequences. The exams, quizzes, and term papers we use to test their knowledge do an incomplete job of assessing how well students have synthesized what we teach them and how likely they are to apply that knowledge in the future. Yet grades, and the private feedback they provide to students, are a useful part of the learning process. The instruments with which we arrive at grades signal to students what we regard as important ideas, and they enable us to give students more substantive feedback than just a set of numbers or letters.
Though the philosophies that underpin them shift over time, grades and grading systems are not likely to go away. Many students and academics consider absolute grading the Platonic ideal; to them the forced curve of relative grading appears as social Darwinism. Yet, in many MBA courses, relative grading has virtues that make it not only less problematic than absolute grading, but also the best approximation to absolute grading—bringing it closer to the ideal we all seek.