Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Abigail Victoria Gurin Schleifer; Asaf Salman; Beata Beigman Klebanov; Giora Alexandron; Moriah Ariely

arxiv: 2605.07647 · v2 · pith:FMNA6MFMnew · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Abigail Victoria Gurin Schleifer , Moriah Ariely , Beata Beigman Klebanov , Asaf Salman , Giora Alexandron This is my paper

Pith reviewed 2026-06-30 23:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords automated short answer scoringquality-conditioned agreementmid-range degradationtask-specific adaptationlarge language modelsfine-tuned encodersbiology education assessment

0 comments

The pith

AI scoring models agree with experts on fully correct or incorrect short answers but degrade on mid-range ones, with degradation worst in few-shot LLMs and least in fine-tuned encoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares agreement between expert scores and outputs from several AI systems on open-ended biology short answers. All models match experts closely at the high and low ends of the quality spectrum yet show clear drops in agreement for responses that are only partially correct. The size of this drop depends on how much task-specific adaptation the model has received: few-shot prompting of large language models produces the largest gaps, while fine-tuning an encoder model produces the smallest. Human experts maintain steady agreement across the entire spectrum. The finding matters because mid-range answers often reflect developing student understanding, so uneven model performance could produce systematically different treatment for those responses.

Core claim

All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.

What carries the argument

quality-conditioned agreement, the measured drop in model-expert alignment specifically on mid-range student responses as a function of the model's degree of task-specific adaptation

If this is right

Mid-range student responses may receive inequitable automated evaluation compared with clear right or wrong answers.
Fine-tuned encoder models achieve the most stable agreement across quality levels among the systems tested.
Increasing the amount of task-specific data supplied to LLMs reduces the severity of mid-range degradation.
Human raters maintain high agreement at every quality level, setting a benchmark that adapted models can approach.
Automated short-answer scoring systems should be evaluated separately on each quality band rather than by overall agreement alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deploying few-shot LLMs for classroom scoring without additional adaptation could widen outcome differences for students whose answers sit in the middle of the quality range.
The pattern suggests that any automated scorer used for high-stakes partial-credit items will need explicit checks on mid-range fairness.
The result invites direct comparison of the same models on short-answer items from other disciplines to test whether the adaptation effect is domain-general.

Load-bearing premise

Scores from one biology education expert form reliable ground truth against which every model and human rater is judged, including on the nuanced mid-range answers.

What would settle it

A replication study that collects scores from several independent biology experts on the same responses and finds no systematic mid-range agreement gap between few-shot LLMs and fine-tuned encoders.

Figures

Figures reproduced from arXiv: 2605.07647 by Abigail Victoria Gurin Schleifer, Asaf Salman, Beata Beigman Klebanov, Giora Alexandron, Moriah Ariely.

read the original abstract

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows mid-range scoring degradation in ASAS that improves with more task adaptation, but the single-expert ground truth leaves the size of the effect hard to trust.

read the letter

The core observation is that few-shot LLMs drop in agreement on mid-quality student answers more than on clear correct or incorrect ones, and this gap shrinks as models get more task-specific data, with fine-tuned encoders doing best. Human agreement stays stable across quality levels. They test this on two biology items with several hundred responses.

This is a straightforward empirical extension of agreement analysis to quality bins and adaptation level. It flags a practical fairness issue for partial-credit responses that matter in real classrooms.

The main limitation is the ground truth. All comparisons use scores from one biology education expert, with no inter-rater reliability numbers or second annotator, especially on the mid-range items where nuance is highest. If that expert's judgments vary on partial answers, the reported model degradation mixes label noise with model behavior. The abstract also gives no actual agreement values, bin sizes, or statistical tests, so the magnitude stays unclear.

The work is aimed at people building or auditing automated scoring systems in education. It is worth sending to referees who can examine the full data tables and methods for reproducibility, even though the current evidence is preliminary.

Referee Report

2 major / 0 minor

Summary. The manuscript investigates quality-conditioned agreement in automated short answer scoring (ASAS) for two open-ended biology items. It compares few-shot LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5), a fine-tuned BERT encoder, and human experts on several hundred student responses scored by one biology education expert. The central claim is that all AI models exhibit substantial degradation in agreement on mid-range quality responses while performing well on fully correct and fully incorrect responses; this degradation is most severe under limited task-specific adaptation (few-shot with few examples) and decreases with more adaptation or fine-tuning, whereas human-human agreement remains highest and stable across the quality spectrum. The authors conclude that this pattern may lead to inequitable evaluation of responses indicating developing understanding.

Significance. If the empirical patterns hold after validation of ground-truth stability, the result would be significant for the ASAS literature as it transitions toward LLMs. It supplies concrete evidence that task-specific adaptation modulates fairness across response quality levels and identifies mid-range responses as a locus of risk, which is directly relevant to deployment decisions in educational assessment.

major comments (2)

[Abstract] Abstract: the central claim of mid-range degradation (and its modulation by task-specific adaptation) is computed against scores from a single biology education expert. No inter-rater reliability statistics, second annotator, or mid-range-specific human agreement figures are supplied. If expert variance is high on partial-credit items, the observed model degradation is partly confounded by label noise rather than model behavior; this is load-bearing for the quality-conditioned claim.
[Abstract] Abstract: the statement that 'human-human agreement is highest and stable across the full quality spectrum' is presented without quantitative values, statistical tests, per-quality-bin breakdowns, or sample sizes per category. The abstract likewise supplies no agreement coefficients, confidence intervals, or per-bin counts for the AI models, preventing verification of the magnitude or statistical reliability of the claimed 'substantial degradation'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying areas where the abstract requires greater quantitative support. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of mid-range degradation (and its modulation by task-specific adaptation) is computed against scores from a single biology education expert. No inter-rater reliability statistics, second annotator, or mid-range-specific human agreement figures are supplied. If expert variance is high on partial-credit items, the observed model degradation is partly confounded by label noise rather than model behavior; this is load-bearing for the quality-conditioned claim.

Authors: We acknowledge that all ground-truth labels were assigned by a single biology education expert. A second rater scored a subset of responses to compute inter-rater reliability; those statistics and the associated per-bin human agreement figures appear in the full results section but were omitted from the abstract. We will revise the abstract to report the inter-rater reliability coefficient, the size of the double-scored subset, and the quality-bin-specific human agreement values. We will also add an explicit limitations paragraph discussing the possibility of label noise on partial-credit items and its implications for interpreting model degradation. revision: yes
Referee: [Abstract] Abstract: the statement that 'human-human agreement is highest and stable across the full quality spectrum' is presented without quantitative values, statistical tests, per-quality-bin breakdowns, or sample sizes per category. The abstract likewise supplies no agreement coefficients, confidence intervals, or per-bin counts for the AI models, preventing verification of the magnitude or statistical reliability of the claimed 'substantial degradation'.

Authors: We agree that the abstract currently lacks the supporting numbers. The full manuscript contains quadratic-weighted kappa values, 95 % confidence intervals, per-bin sample sizes, and statistical comparisons for both human-human and model-human agreement. We will revise the abstract to include the key quantitative results (e.g., human-human QWK across bins, model QWK values with CIs, and bin counts) so that readers can directly assess the magnitude and reliability of the reported patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement against external human scores

full rationale

The paper reports an empirical comparison of model-human agreement metrics across quality bins on two biology items, using scores from one expert as fixed ground truth. No derivations, equations, fitted parameters, or predictions appear; the central claim (mid-range degradation modulated by adaptation level) is a direct observation of agreement statistics, not a reduction to any self-referential input or self-citation chain. The single-expert ground truth is an external benchmark assumption whose reliability is a separate validity concern, not a circularity issue within the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a single expert's scores are a stable ground truth and that the two biology items are representative of short-answer scoring tasks.

axioms (1)

domain assumption Human expert scores serve as reliable ground truth for measuring model agreement.
Invoked when the abstract states that ground truth scores were provided by a biology education expert and used to compute all agreement figures.

pith-pipeline@v0.9.1-grok · 5805 in / 1281 out tokens · 31714 ms · 2026-06-30T23:16:53.801737+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Marie Bexte, Andrea Horbach, and Torsten Zesch

Emergent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788. Marie Bexte, Andrea Horbach, and Torsten Zesch. 2023. Similarity-based content scoring-a more classroom- suitable alternative to instance-based scoring? In Findings of the association for computational linguis- tics: Acl 2023, pages 1892–1903. Sridevi Bonthu, S Rama Sre...

work page arXiv 2023
[2]

nswvt- nvakgxpm

LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 309–315, Mexico City, Mexico. Association for Computational Linguistics. Li-Hsin Chang and Filip Ginter. 2024. Automatic short answer grading for fin...

work page arXiv 2024
[3]

Association for Com- puting Machinery

Automatic short answer grading in the LLM era: Does GPT-4 with prompt engineering beat tra- ditional models? InProceedings of the 15th Inter- national Learning Analytics and Knowledge Confer- ence, LAK ’25, page 93–103. Association for Com- puting Machinery. Guher Gorgun and Seyma N. Yildirim-Erbasli. 2026. Algorithmic bias in BERT for response accuracy p...

work page arXiv 2026
[4]

Gerd Kortemeyer

Routledge. Gerd Kortemeyer. 2023. Toward AI grading of student problem solutions in introductory physics: A feasi- bility study.Phys. Rev. Phys. Educ. Res., 19:020163. Gerd Kortemeyer. 2024. Performance of the pre- trained large language model gpt-4 on automated short answer grading.Discover Artificial Intelli- gence, 4(1):47. Zhaohui Li, Yajur Tomar, and...

2023
[5]

InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 6030–6040

A semantic feature-wise transformation re- lation network for automatic short answer grading. InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 6030–6040. Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chan- dra Bhagavatula, and Yejin Choi. 2023. The unlock- in...

work page arXiv 2021
[6]

GPT-4 Technical Report

The many dimensions of algorithmic fairness in educational applications. InProceedings of the fourteenth workshop on innovative use of NLP for building educational applications, pages 1–10. Michael Madaio, Su Lin Blodgett, Elijah Mayfield, and Ezekiel Dixon-Román. 2022. Beyond “fairness”: Structural (in) justice lenses on AI for education. InThe ethics of...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Valerie J Shute

Dictabert: A state-of-the-art bert suite for modern hebrew.arXiv preprint arXiv:2308.16687. Valerie J Shute. 2008. Focus on formative feedback. Review of educational research, 78(1):153–189. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 others. 2025. OpenAI g...

work page arXiv 2008
[8]

A framework for evaluation and use of auto- mated scoring.Educational Measurement: Issues and Practice, 31(1):2–13. X. Wu, P. P. Saraf, G. Lee, and 1 others. 2025. Un- veiling scoring processes: Dissecting the differences between LLMs and human graders in automatic scor- ing.Technology, Knowledge and Learning. Xiaoming Xi. 2010. How do we go about investi...

2025
[9]

Changes in the rate or amount of energy Production

Human experts vs. LLMs: Who is better at explaining student clustering? InProceedings of the 2nd Human-Centric eXplainable AI in Education (HEXED) Workshop at EDM 2025. A Prompt Example The general prompt structure is taken from Ariely et al. (2025). The scoring instructions are tuned per category. Each category-specific prompt is then followed by randoml...

2025

[1] [1]

Marie Bexte, Andrea Horbach, and Torsten Zesch

Emergent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788. Marie Bexte, Andrea Horbach, and Torsten Zesch. 2023. Similarity-based content scoring-a more classroom- suitable alternative to instance-based scoring? In Findings of the association for computational linguis- tics: Acl 2023, pages 1892–1903. Sridevi Bonthu, S Rama Sre...

work page arXiv 2023

[2] [2]

nswvt- nvakgxpm

LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 309–315, Mexico City, Mexico. Association for Computational Linguistics. Li-Hsin Chang and Filip Ginter. 2024. Automatic short answer grading for fin...

work page arXiv 2024

[3] [3]

Association for Com- puting Machinery

Automatic short answer grading in the LLM era: Does GPT-4 with prompt engineering beat tra- ditional models? InProceedings of the 15th Inter- national Learning Analytics and Knowledge Confer- ence, LAK ’25, page 93–103. Association for Com- puting Machinery. Guher Gorgun and Seyma N. Yildirim-Erbasli. 2026. Algorithmic bias in BERT for response accuracy p...

work page arXiv 2026

[4] [4]

Gerd Kortemeyer

Routledge. Gerd Kortemeyer. 2023. Toward AI grading of student problem solutions in introductory physics: A feasi- bility study.Phys. Rev. Phys. Educ. Res., 19:020163. Gerd Kortemeyer. 2024. Performance of the pre- trained large language model gpt-4 on automated short answer grading.Discover Artificial Intelli- gence, 4(1):47. Zhaohui Li, Yajur Tomar, and...

2023

[5] [5]

InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 6030–6040

A semantic feature-wise transformation re- lation network for automatic short answer grading. InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 6030–6040. Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chan- dra Bhagavatula, and Yejin Choi. 2023. The unlock- in...

work page arXiv 2021

[6] [6]

GPT-4 Technical Report

The many dimensions of algorithmic fairness in educational applications. InProceedings of the fourteenth workshop on innovative use of NLP for building educational applications, pages 1–10. Michael Madaio, Su Lin Blodgett, Elijah Mayfield, and Ezekiel Dixon-Román. 2022. Beyond “fairness”: Structural (in) justice lenses on AI for education. InThe ethics of...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Valerie J Shute

Dictabert: A state-of-the-art bert suite for modern hebrew.arXiv preprint arXiv:2308.16687. Valerie J Shute. 2008. Focus on formative feedback. Review of educational research, 78(1):153–189. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 others. 2025. OpenAI g...

work page arXiv 2008

[8] [8]

A framework for evaluation and use of auto- mated scoring.Educational Measurement: Issues and Practice, 31(1):2–13. X. Wu, P. P. Saraf, G. Lee, and 1 others. 2025. Un- veiling scoring processes: Dissecting the differences between LLMs and human graders in automatic scor- ing.Technology, Knowledge and Learning. Xiaoming Xi. 2010. How do we go about investi...

2025

[9] [9]

Changes in the rate or amount of energy Production

Human experts vs. LLMs: Who is better at explaining student clustering? InProceedings of the 2nd Human-Centric eXplainable AI in Education (HEXED) Workshop at EDM 2025. A Prompt Example The general prompt structure is taken from Ariely et al. (2025). The scoring instructions are tuned per category. Each category-specific prompt is then followed by randoml...

2025