LLM-as-a-judge validity in physics assessment depends more on the task than the model

Elise Agra; Paul Mackay; Tom Hardy; Will Yeadon

arxiv: 2603.14732 · v2 · pith:TGXXZZE4new · submitted 2026-03-16 · ⚛️ physics.ed-ph · cs.CL

LLM-as-a-judge validity in physics assessment depends more on the task than the model

Will Yeadon , Tom Hardy , Paul Mackay , Elise Agra This is my paper

classification ⚛️ physics.ed-ph cs.CL

keywords humanagreementmarkingrank-orderacrossassessmentmodelstask

0 comments

read the original abstract

As large language models (LLMs) are increasingly considered for automated assessment and feedback, understanding when LLM marking is valid is essential. We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations against human markers under blind, solution-provided, false-solution, and anchored conditions. We distinguish absolute accuracy from rank-order agreement, since a marking system can match the distribution of human marks while failing to order responses by quality. Across task types, performance is sharply task-dependent. For blind university exam questions ($n=771$) and secondary and university structured questions ($n=1151$), models show robust rank-order agreement with human markers (Spearman $\rho > 0.6$), with official solutions reducing error and strengthening agreement. False solutions degrade absolute accuracy, showing that models defer to provided references, but leave rank-ordering intact. Essay marking behaves fundamentally differently. Across $n=55$ scripts ($n=275$ essays), blind AI marking is harsher and more variable than human marking and adding a mark scheme does not improve rank-order agreement. Anchored exemplars shift the AI mean close to the human mean and compress variance below the human standard deviation, but rank-order agreement remains near-zero. For code-based plot elements ($n=1400$), models achieve high rank-order agreement ($\rho > 0.84$) with near-linear calibration. Across all task types, validity tracks the structure of the assessment task - the extent to which marks can be mapped to explicit, observable grading features - and the reliability of the human benchmark, rather than raw model capability.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Safeguarding LLM Agents from Misalignment through Provenance Analysis
cs.CL 2026-05 unverdicted novelty 6.0

ProvenanceGuard applies a provenance-based framework to detect three types of misalignment in LLM agent tool calls, cutting error rates on misaligned traces from 42.9% to 1.8% on one benchmark while lowering unnecessa...