Scoring Students’ Critical Thinking at Scale
Sponsored Content
- Assessing critical thinking is time-intensive and difficult to do consistently at scale.
- An interrater reliability study across five disciplines found that AI’s scoring aligned with faculty scoring as consistently as one instructor’s scoring aligned with another’s.
- The findings reveal what AI can do and what rubric design must get right for human or AI scoring to be reliable.
Educators now face a daunting problem: How can they best assess students’ critical thinking amid the rise of generative AI? GenAI is being positioned as a solution to nearly every problem in higher education, and open-ended assessment scoring is no exception. For program directors and assessment coordinators already managing competing accreditation demands and limited staff, that proposition is worth taking seriously.
AACSB's accreditation standards require programs to demonstrate whether students can analyze, evaluate, and apply what they have learned in complex situations. Employers are asking for the same thing: According to a recent report from the National Association of Colleges and Employers, critical thinking and problem-solving consistently rank among the most sought-after competencies in new graduates.
The challenge is that measuring critical thinking rigorously requires open-ended responses, and open-ended responses require human judgment to score, which takes a lot of faculty time. Also, well-intentioned faculty raters applying the same rubric will not always agree. Even a single reviewer may score items differently from one day to another.
In fact, research on interrater reliability in open-ended assessment consistently shows that some degree of scoring variation among humans is normal. But “normal” does not mean inconsequential, particularly when programs are trying to use assessment data to drive continuous improvement.
The result is a genuine tension: The most meaningful form of critical thinking assessment is also the least scalable. For programs with large enrollments, limited assessment staff, or faculty who are already stretched thin across multiple responsibilities, the administrative burden of open-ended scoring can undermine the entire effort. Assessments get delayed, sample sizes shrink, and the data that reaches program review isn’t what it should be.
AI scoring has emerged as a potential solution. But for the institutions we work with, the question is whether AI is trustworthy. At Peregrine Global Services, we did not want to assume AI scoring worked, so we evaluated it.
Designing a Meaningful Test
We set out to answer the question: Do faculty raters tend to agree with AI about as much as they agree with other faculty raters?
The benchmark mattered because we were not asking whether AI is perfect. Instead, we were asking whether it is comparable in reliability to a human peer doing the same work. If human raters do not always agree with each other, then it would be ideal for AI to perform within that same range of variability.
To explore this, we designed an interrater reliability study using 50 open-ended learner responses from each of five disciplines: business integration and strategic management, communications, ethics, global dimensions of business, and leadership. Each set of responses was scored independently by 10 to 11 faculty subject-matter raters; they used Peregrine’s five-criteria rubric, which uses an ordinal 0–4 scale (0=Not Attempted, 1=Novice, 2=Developing, 3=Proficient, and 4=Exemplary). AI scored the same responses using only the rubric as guidance.
If human raters do not always agree with each other, then it would be ideal for AI to perform within that same range of variability.
To analyze agreement, we used three complementary metrics, with Quadratic Weighted Kappa (QWK) serving as the primary metric because the scores were based on ordered rubric levels. QWK accounts for the magnitude of disagreement, treating a one-level difference (3 versus 4) as less severe than a larger difference (1 versus 4).
We also used intraclass correlation to examine scoring reliability across raters and scoring sources, and Lin’s concordance correlation to evaluate how closely paired scores aligned. Together, these measures provided a more complete view of agreement than any single statistic could provide.
What We Found
Across all five disciplines, AI agreement with human raters was either comparable to or higher than human-to-human agreement. When raters scored a response as strong, AI tended to score it as strong. When raters found a response weak, AI did, too.
That finding was encouraging, but the study also surfaced a consistent friction point: Criterion 2 of our five-criteria rubric. This criterion asks scorers to evaluate whether learners have identified assumptions, recognized biases, explored alternate perspectives, and addressed ethical considerations. Here, we saw weaker agreement across multiple disciplines, for both human-to-human and human-to-AI comparisons.
Criterion 2 is what researchers call “double-barreled” because it asks scorers to evaluate more than one construct at the same time. Different raters, entirely in good faith, weigh those constructs differently. The data made the rubric’s ambiguity visible. We were also able to validate the study findings against production data from live scoring. Peregrine's assessment used AI scoring for more than two years while in the beta phase. Across 7,406 scored responses per criterion and 13 institutions, faculty left AI scores unchanged approximately 90 percent to 92 percent of the time. Criterion 2 had the lowest acceptance rate at 90.29 percent, with the highest volume of faculty overrides at 719. Of those overrides, 80.7 percent were upward adjustments: Faculty scored higher than AI.
The data made the rubric’s ambiguity visible.
The most common transition was AI scoring a response at 2 (Developing) while faculty scored it 3 (Proficient). This is exactly what we would expect given a criterion where it is easy for a human to credit learners for implied or partial evidence of perspective-taking, but harder for AI to infer the same without explicit instruction.
The regression analysis also added an important nuance. When overall agreement was moderate, AI did not score on quite the same “ruler” as humans: It tended to evaluate more conservatively on lower-scoring responses and became more aligned on stronger responses. Understanding that pattern allows programs to design smarter and more sustainable workflows, where AI handles the scale problem while faculty still retain authority.
As an administrator at the University of Alaska Fairbanks put it, “The AI grading seemed to do a really good job providing assessment of students’ short-answer responses. It was easy and did not take much time to review the scores and make changes, if necessary.”
What This Finding Means
The Criterion 2 finding illustrates what happens when rubric ambiguity meets human or automated scoring. In a purely human-scored system, that ambiguity can hide in plain sight: Raters develop informal shared understandings through calibration and experience, and variability gets absorbed without being diagnosed. With AI in the loop, ambiguity becomes more measurable and, therefore, more visible.
For program directors and assessment coordinators, this is a useful frame: Rubrics that produce inconsistent AI scores are often rubrics that produce inconsistent human scores, too. Double-barreled criteria, threshold-sensitive language, and constructs that require inferring unstated student reasoning create friction.
One of the most consistent questions we field from institutions is, “Can we trust the AI scoring enough to let it do its job?” Peregrine’s Critical Thinking Assessment gives programs full visibility and access into AI scores. It is easy for faculty to review and override the scores, and the service does not prescribe how much review a program should do. Some programs review every response, while others engage selectively.
The production data supports both approaches. As the findings above show, faculty who engage with the review process find that AI scores hold up the vast majority of the time.
What Comes Next
This study is the beginning of our exploration of AI’s ability to assess student work consistently. The findings are already informing concrete improvements: We have revised Criterion 2 to eliminate double-barreled language, refined AI prompting to require evidence-first scoring, and built a faculty feedback loop so that override patterns continue to drive rubric and prompt development over time. The logic mirrors any serious assurance of learning process: Test, measure, find what isn't working, and improve.
If you are considering AI scoring, the question worth asking is not whether AI is good enough; it is whether your program has clear rubrics, accessible data, and a review process that fits how faculty actually work. Those are the conditions that make it trustworthy. And as this study shows, when you build in the rigor to find out where it falls short, you learn something valuable either way.
About the Research
Peregrine’s Critical Thinking Assessment is currently used by institutions including Arkansas State University, Ohio University, East Central University, Guilford College, and Salem University in the U.S.; the University of Balamand in Lebanon; and many others. The interrater reliability study described in this article was conducted as an internal validation effort.
A full white paper with complete methodology, discipline-level findings, regression analysis, and recommended next steps is now available for download. Or connect with our team to learn more about how the Critical Thinking Assessment can work within your program’s accreditation and continuous improvement goals.