Scoring Students’ Critical Thinking at Scale

Article Icon Article
16 June 2026
Photo by iStock/gece33
What we learned from a side-by-side comparison of AI and faculty evaluations of the same student responses—how well did their assessments align?

Sponsored Content

  • Assessing critical thinking is time-intensive and difficult to do consistently at scale.
  • An interrater reliability study across five disciplines found that AI’s scoring aligned with faculty scoring as consistently as one instructor’s scoring aligned with another’s.
  • The findings reveal what AI can do and what rubric design must get right for human or AI scoring to be reliable.

 

Educators now face a daunting problem: How can they best assess students’ critical thinking amid the rise of generative AI? GenAI is being positioned as a solution to nearly every problem in higher education, and open-ended assessment scoring is no exception. For program directors and assessment coordinators already managing competing accreditation demands and limited staff, that proposition is worth taking seriously.

AACSB's accreditation standards require programs to demonstrate whether students can analyze, evaluate, and apply what they have learned in complex situations. Employers are asking for the same thing: According to a recent report from the National Association of Colleges and Employers, critical thinking and problem-solving consistently rank among the most sought-after competencies in new graduates.

The challenge is that measuring critical thinking rigorously requires open-ended responses, and open-ended responses require human judgment to score, which takes a lot of faculty time. Also, well-intentioned faculty raters applying the same rubric will not always agree. Even a single reviewer may score items differently from one day to another.

In fact, research on interrater reliability in open-ended assessment consistently shows that some degree of scoring variation among humans is normal. But “normal” does not mean inconsequential, particularly when programs are trying to use assessment data to drive continuous improvement.

The result is a genuine tension: The most meaningful form of critical thinking assessment is also the least scalable. For programs with large enrollments, limited assessment staff, or faculty who are already stretched thin across multiple responsibilities, the administrative burden of open-ended scoring can undermine the entire effort. Assessments get delayed, sample sizes shrink, and the data that reaches program review isn’t what it should be.

AI scoring has emerged as a potential solution. But for the institutions we work with, the question is whether AI is trustworthy. At Peregrine Global Services, we did not want to assume AI scoring worked, so we evaluated it.

Designing a Meaningful Test

We set out to answer the question: Do faculty raters tend to agree with AI about as much as they agree with other faculty raters?

The benchmark mattered because we were not asking whether AI is perfect. Instead, we were asking whether it is comparable in reliability to a human peer doing the same work. If human raters do not always agree with each other, then it would be ideal for AI to perform within that same range of variability.

To explore this, we designed an interrater reliability study using 50 open-ended learner responses from each of five disciplines: business integration and strategic management, communications, ethics, global dimensions of business, and leadership. Each set of responses was scored independently by 10 to 11 faculty subject-matter raters; they used Peregrine’s five-criteria rubric, which uses an ordinal 0–4 scale (0=Not Attempted, 1=Novice, 2=Developing, 3=Proficient, and 4=Exemplary). AI scored the same responses using only the rubric as guidance.

If human raters do not always agree with each other, then it would be ideal for AI to perform within that same range of variability.

To analyze agreement, we used three complementary metrics, with Quadratic Weighted Kappa (QWK) serving as the primary metric because the scores were based on ordered rubric levels. QWK accounts for the magnitude of disagreement, treating a one-level difference (3 versus 4) as less severe than a larger difference (1 versus 4).

We also used intraclass correlation to examine scoring reliability across raters and scoring sources, and Lin’s concordance correlation to evaluate how closely paired scores aligned. Together, these measures provided a more complete view of agreement than any single statistic could provide.

What We Found

Across all five disciplines, AI agreement with human raters was either comparable to or higher than human-to-human agreement. When raters scored a response as strong, AI tended to score it as strong. When raters found a response weak, AI did, too.

That finding was encouraging, but the study also surfaced a consistent friction point: Criterion 2 of our five-criteria rubric. This criterion asks scorers to evaluate whether learners have identified assumptions, recognized biases, explored alternate perspectives, and addressed ethical considerations. Here, we saw weaker agreement across multiple disciplines, for both human-to-human and human-to-AI comparisons.

Criterion 2 is what researchers call “double-barreled” because it asks scorers to evaluate more than one construct at the same time. Different raters, entirely in good faith, weigh those constructs differently. The data made the rubric’s ambiguity visible. We were also able to validate the study findings against production data from live scoring. Peregrine's assessment used AI scoring for more than two years while in the beta phase. Across 7,406 scored responses per criterion and 13 institutions, faculty left AI scores unchanged approximately 90 percent to 92 percent of the time. Criterion 2 had the lowest acceptance rate at 90.29 percent, with the highest volume of faculty overrides at 719. Of those overrides, 80.7 percent were upward adjustments: Faculty scored higher than AI.

The data made the rubric’s ambiguity visible.

The most common transition was AI scoring a response at 2 (Developing) while faculty scored it 3 (Proficient). This is exactly what we would expect given a criterion where it is easy for a human to credit learners for implied or partial evidence of perspective-taking, but harder for AI to infer the same without explicit instruction.

The regression analysis also added an important nuance. When overall agreement was moderate, AI did not score on quite the same “ruler” as humans: It tended to evaluate more conservatively on lower-scoring responses and became more aligned on stronger responses. Understanding that pattern allows programs to design smarter and more sustainable workflows, where AI handles the scale problem while faculty still retain authority.

As an administrator at the University of Alaska Fairbanks put it, “The AI grading seemed to do a really good job providing assessment of students’ short-answer responses. It was easy and did not take much time to review the scores and make changes, if necessary.”

What This Finding Means

The Criterion 2 finding illustrates what happens when rubric ambiguity meets human or automated scoring. In a purely human-scored system, that ambiguity can hide in plain sight: Raters develop informal shared understandings through calibration and experience, and variability gets absorbed without being diagnosed. With AI in the loop, ambiguity becomes more measurable and, therefore, more visible.

For program directors and assessment coordinators, this is a useful frame: Rubrics that produce inconsistent AI scores are often rubrics that produce inconsistent human scores, too. Double-barreled criteria, threshold-sensitive language, and constructs that require inferring unstated student reasoning create friction.

One of the most consistent questions we field from institutions is, “Can we trust the AI scoring enough to let it do its job?” Peregrine’s Critical Thinking Assessment gives programs full visibility and access into AI scores. It is easy for faculty to review and override the scores, and the service does not prescribe how much review a program should do. Some programs review every response, while others engage selectively.

The production data supports both approaches. As the findings above show, faculty who engage with the review process find that AI scores hold up the vast majority of the time.

What Comes Next

This study is the beginning of our exploration of AI’s ability to assess student work consistently. The findings are already informing concrete improvements: We have revised Criterion 2 to eliminate double-barreled language, refined AI prompting to require evidence-first scoring, and built a faculty feedback loop so that override patterns continue to drive rubric and prompt development over time. The logic mirrors any serious assurance of learning process: Test, measure, find what isn't working, and improve.

If you are considering AI scoring, the question worth asking is not whether AI is good enough; it is whether your program has clear rubrics, accessible data, and a review process that fits how faculty actually work. Those are the conditions that make it trustworthy. And as this study shows, when you build in the rigor to find out where it falls short, you learn something valuable either way.

About the Research

Peregrine’s Critical Thinking Assessment is currently used by institutions including Arkansas State University, Ohio University, East Central University, Guilford College, and Salem University in the U.S.; the University of Balamand in Lebanon; and many others. The interrater reliability study described in this article was conducted as an internal validation effort.

A full white paper with complete methodology, discipline-level findings, regression analysis, and recommended next steps is now available for download. Or connect with our team to learn more about how the Critical Thinking Assessment can work within your program’s accreditation and continuous improvement goals.

What did you think of this content?
Your feedback helps us create better content
Thank you for your input!
(Optional) If you have the time, our team would like to hear your thoughts
Authors
Desiree Moore
Director of Business Development, Peregrine Global Services
Michael Napolitano
Vice President of Technology, Peregrine Global Services
The views expressed by contributors to AACSB Insights do not represent an official position of AACSB, unless clearly stated.
Subscribe to LINK, AACSB's weekly newsletter!
AACSB LINK—Leading Insights, News, and Knowledge—is an email newsletter that brings members and subscribers the newest, most relevant information in global business education.
Sign up for AACSB's LINK email newsletter.
Our members and subscribers receive Leading Insights, News, and Knowledge in global business education.
Thank you for subscribing to AACSB LINK! We look forward to keeping you up to date on global business education.
Weekly, no spam ever, unsubscribe when you want.