Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
Pith reviewed 2026-06-30 23:16 UTC · model grok-4.3
The pith
AI scoring models agree with experts on fully correct or incorrect short answers but degrade on mid-range ones, with degradation worst in few-shot LLMs and least in fine-tuned encoders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.
What carries the argument
quality-conditioned agreement, the measured drop in model-expert alignment specifically on mid-range student responses as a function of the model's degree of task-specific adaptation
If this is right
- Mid-range student responses may receive inequitable automated evaluation compared with clear right or wrong answers.
- Fine-tuned encoder models achieve the most stable agreement across quality levels among the systems tested.
- Increasing the amount of task-specific data supplied to LLMs reduces the severity of mid-range degradation.
- Human raters maintain high agreement at every quality level, setting a benchmark that adapted models can approach.
- Automated short-answer scoring systems should be evaluated separately on each quality band rather than by overall agreement alone.
Where Pith is reading between the lines
- Deploying few-shot LLMs for classroom scoring without additional adaptation could widen outcome differences for students whose answers sit in the middle of the quality range.
- The pattern suggests that any automated scorer used for high-stakes partial-credit items will need explicit checks on mid-range fairness.
- The result invites direct comparison of the same models on short-answer items from other disciplines to test whether the adaptation effect is domain-general.
Load-bearing premise
Scores from one biology education expert form reliable ground truth against which every model and human rater is judged, including on the nuanced mid-range answers.
What would settle it
A replication study that collects scores from several independent biology experts on the same responses and finds no systematic mid-range agreement gap between few-shot LLMs and fine-tuned encoders.
Figures
read the original abstract
Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates quality-conditioned agreement in automated short answer scoring (ASAS) for two open-ended biology items. It compares few-shot LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5), a fine-tuned BERT encoder, and human experts on several hundred student responses scored by one biology education expert. The central claim is that all AI models exhibit substantial degradation in agreement on mid-range quality responses while performing well on fully correct and fully incorrect responses; this degradation is most severe under limited task-specific adaptation (few-shot with few examples) and decreases with more adaptation or fine-tuning, whereas human-human agreement remains highest and stable across the quality spectrum. The authors conclude that this pattern may lead to inequitable evaluation of responses indicating developing understanding.
Significance. If the empirical patterns hold after validation of ground-truth stability, the result would be significant for the ASAS literature as it transitions toward LLMs. It supplies concrete evidence that task-specific adaptation modulates fairness across response quality levels and identifies mid-range responses as a locus of risk, which is directly relevant to deployment decisions in educational assessment.
major comments (2)
- [Abstract] Abstract: the central claim of mid-range degradation (and its modulation by task-specific adaptation) is computed against scores from a single biology education expert. No inter-rater reliability statistics, second annotator, or mid-range-specific human agreement figures are supplied. If expert variance is high on partial-credit items, the observed model degradation is partly confounded by label noise rather than model behavior; this is load-bearing for the quality-conditioned claim.
- [Abstract] Abstract: the statement that 'human-human agreement is highest and stable across the full quality spectrum' is presented without quantitative values, statistical tests, per-quality-bin breakdowns, or sample sizes per category. The abstract likewise supplies no agreement coefficients, confidence intervals, or per-bin counts for the AI models, preventing verification of the magnitude or statistical reliability of the claimed 'substantial degradation'.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying areas where the abstract requires greater quantitative support. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of mid-range degradation (and its modulation by task-specific adaptation) is computed against scores from a single biology education expert. No inter-rater reliability statistics, second annotator, or mid-range-specific human agreement figures are supplied. If expert variance is high on partial-credit items, the observed model degradation is partly confounded by label noise rather than model behavior; this is load-bearing for the quality-conditioned claim.
Authors: We acknowledge that all ground-truth labels were assigned by a single biology education expert. A second rater scored a subset of responses to compute inter-rater reliability; those statistics and the associated per-bin human agreement figures appear in the full results section but were omitted from the abstract. We will revise the abstract to report the inter-rater reliability coefficient, the size of the double-scored subset, and the quality-bin-specific human agreement values. We will also add an explicit limitations paragraph discussing the possibility of label noise on partial-credit items and its implications for interpreting model degradation. revision: yes
-
Referee: [Abstract] Abstract: the statement that 'human-human agreement is highest and stable across the full quality spectrum' is presented without quantitative values, statistical tests, per-quality-bin breakdowns, or sample sizes per category. The abstract likewise supplies no agreement coefficients, confidence intervals, or per-bin counts for the AI models, preventing verification of the magnitude or statistical reliability of the claimed 'substantial degradation'.
Authors: We agree that the abstract currently lacks the supporting numbers. The full manuscript contains quadratic-weighted kappa values, 95 % confidence intervals, per-bin sample sizes, and statistical comparisons for both human-human and model-human agreement. We will revise the abstract to include the key quantitative results (e.g., human-human QWK across bins, model QWK values with CIs, and bin counts) so that readers can directly assess the magnitude and reliability of the reported patterns. revision: yes
Circularity Check
No circularity: direct empirical measurement against external human scores
full rationale
The paper reports an empirical comparison of model-human agreement metrics across quality bins on two biology items, using scores from one expert as fixed ground truth. No derivations, equations, fitted parameters, or predictions appear; the central claim (mid-range degradation modulated by adaptation level) is a direct observation of agreement statistics, not a reduction to any self-referential input or self-citation chain. The single-expert ground truth is an external benchmark assumption whose reliability is a separate validity concern, not a circularity issue within the reported results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human expert scores serve as reliable ground truth for measuring model agreement.
Reference graph
Works this paper leans on
-
[1]
Marie Bexte, Andrea Horbach, and Torsten Zesch
Emergent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788. Marie Bexte, Andrea Horbach, and Torsten Zesch. 2023. Similarity-based content scoring-a more classroom- suitable alternative to instance-based scoring? In Findings of the association for computational linguis- tics: Acl 2023, pages 1892–1903. Sridevi Bonthu, S Rama Sre...
-
[2]
LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 309–315, Mexico City, Mexico. Association for Computational Linguistics. Li-Hsin Chang and Filip Ginter. 2024. Automatic short answer grading for fin...
-
[3]
Association for Com- puting Machinery
Automatic short answer grading in the LLM era: Does GPT-4 with prompt engineering beat tra- ditional models? InProceedings of the 15th Inter- national Learning Analytics and Knowledge Confer- ence, LAK ’25, page 93–103. Association for Com- puting Machinery. Guher Gorgun and Seyma N. Yildirim-Erbasli. 2026. Algorithmic bias in BERT for response accuracy p...
-
[4]
Gerd Kortemeyer
Routledge. Gerd Kortemeyer. 2023. Toward AI grading of student problem solutions in introductory physics: A feasi- bility study.Phys. Rev. Phys. Educ. Res., 19:020163. Gerd Kortemeyer. 2024. Performance of the pre- trained large language model gpt-4 on automated short answer grading.Discover Artificial Intelli- gence, 4(1):47. Zhaohui Li, Yajur Tomar, and...
2023
-
[5]
A semantic feature-wise transformation re- lation network for automatic short answer grading. InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 6030–6040. Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chan- dra Bhagavatula, and Yejin Choi. 2023. The unlock- in...
-
[6]
The many dimensions of algorithmic fairness in educational applications. InProceedings of the fourteenth workshop on innovative use of NLP for building educational applications, pages 1–10. Michael Madaio, Su Lin Blodgett, Elijah Mayfield, and Ezekiel Dixon-Román. 2022. Beyond “fairness”: Structural (in) justice lenses on AI for education. InThe ethics of...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Dictabert: A state-of-the-art bert suite for modern hebrew.arXiv preprint arXiv:2308.16687. Valerie J Shute. 2008. Focus on formative feedback. Review of educational research, 78(1):153–189. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 others. 2025. OpenAI g...
-
[8]
A framework for evaluation and use of auto- mated scoring.Educational Measurement: Issues and Practice, 31(1):2–13. X. Wu, P. P. Saraf, G. Lee, and 1 others. 2025. Un- veiling scoring processes: Dissecting the differences between LLMs and human graders in automatic scor- ing.Technology, Knowledge and Learning. Xiaoming Xi. 2010. How do we go about investi...
2025
-
[9]
Changes in the rate or amount of energy Production
Human experts vs. LLMs: Who is better at explaining student clustering? InProceedings of the 2nd Human-Centric eXplainable AI in Education (HEXED) Workshop at EDM 2025. A Prompt Example The general prompt structure is taken from Ariely et al. (2025). The scoring instructions are tuned per category. Each category-specific prompt is then followed by randoml...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.