Scoring Students’ Critical Thinking at Scale

Article

16 June 2026

Photo by iStock/gece33

What we learned from a side-by-side comparison of AI and faculty evaluations of the same student responses—how well did their assessments align?

Designing a Meaningful Test

We set out to answer the question: Do faculty raters tend to agree with AI about as much as they agree with other faculty raters?

The benchmark mattered because we were not asking whether AI is perfect. Instead, we were asking whether it is comparable in reliability to a human peer doing the same work. If human raters do not always agree with each other, then it would be ideal for AI to perform within that same range of variability.

To explore this, we designed an interrater reliability study using 50 open-ended learner responses from each of five disciplines: business integration and strategic management, communications, ethics, global dimensions of business, and leadership. Each set of responses was scored independently by 10 to 11 faculty subject-matter raters; they used Peregrine’s five-criteria rubric, which uses an ordinal 0–4 scale (0=Not Attempted, 1=Novice, 2=Developing, 3=Proficient, and 4=Exemplary). AI scored the same responses using only the rubric as guidance.

If human raters do not always agree with each other, then it would be ideal for AI to perform within that same range of variability.

To analyze agreement, we used three complementary metrics, with Quadratic Weighted Kappa (QWK) serving as the primary metric because the scores were based on ordered rubric levels. QWK accounts for the magnitude of disagreement, treating a one-level difference (3 versus 4) as less severe than a larger difference (1 versus 4).

We also used intraclass correlation to examine scoring reliability across raters and scoring sources, and Lin’s concordance correlation to evaluate how closely paired scores aligned. Together, these measures provided a more complete view of agreement than any single statistic could provide.

What We Found

Across all five disciplines, AI agreement with human raters was either comparable to or higher than human-to-human agreement. When raters scored a response as strong, AI tended to score it as strong. When raters found a response weak, AI did, too.

That finding was encouraging, but the study also surfaced a consistent friction point: Criterion 2 of our five-criteria rubric. This criterion asks scorers to evaluate whether learners have identified assumptions, recognized biases, explored alternate perspectives, and addressed ethical considerations. Here, we saw weaker agreement across multiple disciplines, for both human-to-human and human-to-AI comparisons.

Criterion 2 is what researchers call “double-barreled” because it asks scorers to evaluate more than one construct at the same time. Different raters, entirely in good faith, weigh those constructs differently. The data made the rubric’s ambiguity visible. We were also able to validate the study findings against production data from live scoring. Peregrine's assessment used AI scoring for more than two years while in the beta phase. Across 7,406 scored responses per criterion and 13 institutions, faculty left AI scores unchanged approximately 90 percent to 92 percent of the time. Criterion 2 had the lowest acceptance rate at 90.29 percent, with the highest volume of faculty overrides at 719. Of those overrides, 80.7 percent were upward adjustments: Faculty scored higher than AI.

The data made the rubric’s ambiguity visible.

The most common transition was AI scoring a response at 2 (Developing) while faculty scored it 3 (Proficient). This is exactly what we would expect given a criterion where it is easy for a human to credit learners for implied or partial evidence of perspective-taking, but harder for AI to infer the same without explicit instruction.

The regression analysis also added an important nuance. When overall agreement was moderate, AI did not score on quite the same “ruler” as humans: It tended to evaluate more conservatively on lower-scoring responses and became more aligned on stronger responses. Understanding that pattern allows programs to design smarter and more sustainable workflows, where AI handles the scale problem while faculty still retain authority.

As an administrator at the University of Alaska Fairbanks put it, “The AI grading seemed to do a really good job providing assessment of students’ short-answer responses. It was easy and did not take much time to review the scores and make changes, if necessary.”

What This Finding Means

The Criterion 2 finding illustrates what happens when rubric ambiguity meets human or automated scoring. In a purely human-scored system, that ambiguity can hide in plain sight: Raters develop informal shared understandings through calibration and experience, and variability gets absorbed without being diagnosed. With AI in the loop, ambiguity becomes more measurable and, therefore, more visible.

For program directors and assessment coordinators, this is a useful frame: Rubrics that produce inconsistent AI scores are often rubrics that produce inconsistent human scores, too. Double-barreled criteria, threshold-sensitive language, and constructs that require inferring unstated student reasoning create friction.

One of the most consistent questions we field from institutions is, “Can we trust the AI scoring enough to let it do its job?” Peregrine’s Critical Thinking Assessment gives programs full visibility and access into AI scores. It is easy for faculty to review and override the scores, and the service does not prescribe how much review a program should do. Some programs review every response, while others engage selectively.

The production data supports both approaches. As the findings above show, faculty who engage with the review process find that AI scores hold up the vast majority of the time.

What Comes Next

This study is the beginning of our exploration of AI’s ability to assess student work consistently. The findings are already informing concrete improvements: We have revised Criterion 2 to eliminate double-barreled language, refined AI prompting to require evidence-first scoring, and built a faculty feedback loop so that override patterns continue to drive rubric and prompt development over time. The logic mirrors any serious assurance of learning process: Test, measure, find what isn't working, and improve.

If you are considering AI scoring, the question worth asking is not whether AI is good enough; it is whether your program has clear rubrics, accessible data, and a review process that fits how faculty actually work. Those are the conditions that make it trustworthy. And as this study shows, when you build in the rigor to find out where it falls short, you learn something valuable either way.

About the Research

Peregrine’s Critical Thinking Assessment is currently used by institutions including Arkansas State University, Ohio University, East Central University, Guilford College, and Salem University in the U.S.; the University of Balamand in Lebanon; and many others. The interrater reliability study described in this article was conducted as an internal validation effort.

A full white paper with complete methodology, discipline-level findings, regression analysis, and recommended next steps is now available for download. Or connect with our team to learn more about how the Critical Thinking Assessment can work within your program’s accreditation and continuous improvement goals.

What did you think of this content?

Your feedback helps us create better content

Thank you for your input!

(Optional) If you have the time, our team would like to hear your thoughts

Authors

Desiree Moore

Director of Business Development, Peregrine Global Services

Michael Napolitano

Vice President of Technology, Peregrine Global Services

The views expressed by contributors to AACSB Insights do not represent an official position of AACSB, unless clearly stated.

Article Tags

Subscribe to LINK, AACSB's weekly newsletter!

AACSB LINK—Leading Insights, News, and Knowledge—is an email newsletter that brings members and subscribers the newest, most relevant information in global business education.