pith. machine review for the scientific record. sign in

arxiv: 2604.12227 · v1 · submitted 2026-04-14 · 💻 cs.AI · cs.CL

Recognition: unknown

Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords LLM-assisted scoringrubric designphysics educationconstructed responsesAI reliabilityhandwritten examsGPT-4o
0
0 comments X

The pith

Reliable AI-assisted scoring of physics exam responses depends primarily on clear, well-structured rubrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests GPT-4o for scoring twenty authentic handwritten undergraduate physics exam responses that mix symbols, calculations, and diagrams. It varies rubric detail, prompting formats, and temperature settings while comparing AI scores to those from four human instructors. The results show that AI-human agreement reaches levels similar to human inter-rater reliability, especially with fine-grained skill-based rubrics that break down conceptual and procedural elements. Prompting and temperature play smaller roles than rubric clarity. This matters for making partial-credit scoring in STEM faster and more consistent.

Core claim

The paper establishes that AI-assisted scoring using GPT-4o achieves overall human-AI agreement on total scores comparable to human inter-rater reliability when applied to skill-based rubrics of differing analytic granularity. Alignment is strongest for high- and low-performing responses and for clearly defined conceptual skills, but weaker for mid-level responses with partial or ambiguous reasoning. A checklist-based rubric improves consistency over holistic scoring, while prompting format has a secondary effect and temperature has limited impact.

What carries the argument

Skill-based rubrics that decompose student responses into specific conceptual and procedural skills at varying levels of granularity.

If this is right

  • Human-AI agreement on total scores is comparable to human inter-rater reliability.
  • Agreement is highest for high- and low-performing responses but declines for mid-level ones.
  • Stronger alignment occurs for clearly defined conceptual skills than for extended procedural judgments.
  • A more fine-grained checklist rubric improves consistency relative to holistic scoring.
  • Prompting format plays a secondary role and temperature has relatively limited impact on reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Rubric-focused design could allow AI tools to handle grading workloads in large-enrollment physics courses.
  • The approach may extend to other subjects involving constructed responses with symbolic and visual elements.
  • Further tests on broader response samples would help confirm if the pattern holds beyond the twenty cases studied.
  • Embedding these rubrics in LLM systems could support consistent student feedback at scale.

Load-bearing premise

That the twenty handwritten responses capture representative student variation and that the four instructors establish a reliable human baseline.

What would settle it

A study with a larger sample of mid-performing responses showing substantially lower human-AI agreement than human-human agreement would challenge the claim.

read the original abstract

Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students' reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required. Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity. Prompting format and temperature settings were systematically varied. Overall, human-AI agreement on total scores was comparable to human inter-rater reliability and was highest for high- and low-performing responses, but declined for mid-level responses involving partial or ambiguous reasoning. Criterion-level analyses showed stronger alignment for clearly defined conceptual skills than for extended procedural judgments. A more fine-grained, checklist-based rubric improved consistency relative to holistic scoring. These findings indicate that reliable AI-assisted scoring depends primarily on clear, well-structured rubrics, while prompting format plays a secondary role and temperature has relatively limited impact. More broadly, the study provides transferable design recommendations for implementing reliable LLM-assisted scoring in STEM contexts through skill-based rubrics and controlled LLM settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that AI-assisted scoring of handwritten physics exam responses using GPT-4o achieves reliability comparable to human raters, with agreement highest for high- and low-performing students. It finds that fine-grained, checklist-based rubrics improve consistency over holistic scoring, and that rubric design is the primary factor influencing reliability, while prompting format is secondary and temperature has limited impact. The study is based on 20 authentic responses scored by four instructors and the AI under varied conditions.

Significance. This research is significant for advancing reliable use of LLMs in educational assessment of constructed responses in STEM fields. It offers transferable recommendations for rubric design and LLM configuration. The empirical comparison of human and AI scoring provides a foundation for future work, though the small sample size limits the generalizability of the factor importance claims.

major comments (2)
  1. [Abstract] The central claim that rubric design exerts primary influence on reliability (with prompting secondary and temperature limited) lacks supporting quantitative evidence such as effect size comparisons or statistical tests distinguishing the factors. The abstract provides only directional findings without details on agreement metrics or error analysis.
  2. [Methods] The selection of only twenty responses raises concerns about statistical power and representativeness for establishing the primacy of rubrics over other factors, as differences may not generalize beyond the specific exams chosen or the four instructors' scoring patterns.
minor comments (2)
  1. [Abstract] Specify the exact metrics used for 'human-AI agreement' and 'human inter-rater reliability' (e.g., correlation, kappa, or percentage).
  2. The abstract mentions criterion-level analyses but does not detail how 'clearly defined conceptual skills' were distinguished from 'extended procedural judgments'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript examining LLM-assisted scoring of physics constructed responses. We address the major comments point by point below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] The central claim that rubric design exerts primary influence on reliability (with prompting secondary and temperature limited) lacks supporting quantitative evidence such as effect size comparisons or statistical tests distinguishing the factors. The abstract provides only directional findings without details on agreement metrics or error analysis.

    Authors: The abstract serves as a concise summary and space limits preclude full statistical detail. The manuscript reports quantitative agreement metrics (percentage agreement and Cohen's kappa) comparing rubric granularity, prompting formats, and temperature settings, with rubric design showing the largest observed differences in consistency. We have revised the abstract to include key agreement statistics and a brief reference to the comparative reliability improvements supporting the relative influence of factors. revision: yes

  2. Referee: [Methods] The selection of only twenty responses raises concerns about statistical power and representativeness for establishing the primacy of rubrics over other factors, as differences may not generalize beyond the specific exams chosen or the four instructors' scoring patterns.

    Authors: We agree that a sample of 20 responses constrains statistical power and limits strong claims of generalizability for factor primacy. This size enabled detailed, resource-intensive scoring of authentic handwritten responses across multiple conditions. We have expanded the Methods and Limitations sections to explicitly acknowledge this constraint, frame conclusions as initial evidence, and recommend larger-scale validation studies. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison study

full rationale

This paper reports an empirical human-AI scoring comparison on 20 physics exam responses scored by four instructors and GPT-4o under varied rubric granularities, prompting formats, and temperatures. No derivations, equations, first-principles predictions, or fitted parameters appear; all claims rest on direct agreement metrics (human-AI vs. human inter-rater) computed from the same dataset. The central finding—that rubric structure exerts the largest effect—is an observed ordering of reliability differences, not a reduction to self-definition, self-citation, or renamed inputs. The study is self-contained against its own human baseline and does not invoke load-bearing prior results from the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study with no mathematical model. Relies on the domain assumption that the chosen rubrics validly measure student reasoning skills and that the small set of 20 responses captures relevant variation.

axioms (1)
  • domain assumption Skill-based rubrics with differing analytic granularity accurately reflect student conceptual and procedural understanding
    The study treats rubric design as the primary variable and assumes the rubrics themselves are valid instruments.

pith-pipeline@v0.9.0 · 5560 in / 1320 out tokens · 27671 ms · 2026-05-10T15:41:02.333360+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

    cs.CL 2026-05 unverdicted novelty 7.0

    OralMLLM-Bench is a new benchmark with 27 tasks in four cognitive categories that evaluates six MLLMs on dental radiographs and shows clear performance gaps versus clinicians.

  2. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

    cs.CL 2026-05 unverdicted novelty 7.0

    OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Such tasks provide rich evidence of student thinking, yet are time-consuming to score and susceptible to rater variability, particularly in high-enrollment gateway courses

    Introduction Constructed-response tasks play a central role in STEM education by eliciting students’ conceptual reasoning and problem-solving processes (Neumann et al., 2013; Pellegrino, 2012). Such tasks provide rich evidence of student thinking, yet are time-consuming to score and susceptible to rater variability, particularly in high-enrollment gateway...

  2. [2]

    GPT-4 Technical Report

    Discussion This study examined the reliability of generative AI–assisted scoring for physics constructed responses under different rubric structures, prompting formats, and temperature conditions. Across two rounds of scoring, results show that GPT-4o produces scores that align closely with human ratings when student responses are unambiguous (either clea...