Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
Pith reviewed 2026-05-11 01:49 UTC · model grok-4.3
The pith
AI models for scoring short answers agree well with experts on fully correct and incorrect responses but show major degradation on mid-range ones, with less degradation after more task-specific adaptation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.
What carries the argument
Quality-conditioned scoring agreement, which tracks model-expert alignment separately for low-quality, mid-quality, and high-quality student responses to expose adaptation effects in the mid-range.
Where Pith is reading between the lines
- Deploying few-shot LLMs in educational scoring without checks on mid-range responses risks inequitable outcomes for certain students.
- Hybrid approaches that route ambiguous mid-range answers to humans could improve overall reliability.
- The degradation pattern may appear in other domains involving partial credit or subjective judgment, such as essay assessment.
- Increasing the number of human raters used for ground truth could test whether the observed drops reflect model shortcomings or reference variability.
Load-bearing premise
The ground-truth scores assigned by a single biology education expert accurately capture the nuanced interpretation required for mid-range responses and serve as a stable reference for measuring model agreement.
What would settle it
Re-scoring the same student responses with a second independent biology expert and checking whether the AI models still exhibit the same degree of mid-range degradation relative to the new reference scores.
Figures
read the original abstract
Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an empirical study on automated short answer scoring (ASAS) comparing few-shot large language models (GPT-5.2, GPT-4o, Claude Opus 4.5), a fine-tuned BERT-based encoder, and human experts on two biology open-ended items. Using several hundred student responses with ground-truth scores from a biology education expert, it claims that AI models achieve high agreement on fully correct and fully incorrect responses but show substantial degradation in agreement on mid-range responses. This degradation is most severe for few-shot LLMs with limited examples and is reduced with greater task-specific adaptation, with fine-tuned models performing best. Human-human agreement is reported as highest and stable across the quality spectrum. The authors suggest this may lead to inequitable evaluation of student responses indicating developing understanding.
Significance. If the central findings hold, this work is significant for highlighting quality-conditioned fairness issues in ASAS, particularly the vulnerability of mid-range scoring which is crucial for assessing partial student understanding. The systematic comparison across different levels of task-specific adaptation (few-shot vs. fine-tuned) provides actionable insights into how to improve model alignment on complex scoring tasks. The use of real student responses from biology items adds ecological validity to the results.
major comments (2)
- Methods section on data annotation: The ground-truth scores are assigned by a single biology education expert without reported inter-rater reliability (IRR) metrics or validation procedures, especially for the mid-range responses that require nuanced interpretation. Since the degradation claim is defined as reduced agreement with these scores, and the abstract notes that mid-range responses need nuanced interpretation, it is unclear whether the observed drop reflects model limitations or divergence from one rater's specific judgments. Clarifying whether human-human agreement uses independent raters or repeated scoring by the same expert on mid-range items is necessary to support the claim.
- Results section: The results from several hundred responses across two items are presented without statistical tests, confidence intervals, or exact agreement values per quality band (e.g., Cohen's kappa or percentage agreement for low/mid/high). This absence makes it difficult to evaluate the robustness and magnitude of the reported mid-range degradation and its conditioning on task-specific adaptation.
minor comments (2)
- Abstract: The abstract mentions 'several hundred student responses' but could specify the exact number and the two items for better context.
- Methods: Exact prompt templates used for the few-shot LLMs are not detailed, which would aid reproducibility given the known sensitivity of LLM outputs to prompting.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important aspects of our methods and results presentation that we will address through revisions to improve clarity and rigor. We respond to each major comment below.
read point-by-point responses
-
Referee: Methods section on data annotation: The ground-truth scores are assigned by a single biology education expert without reported inter-rater reliability (IRR) metrics or validation procedures, especially for the mid-range responses that require nuanced interpretation. Since the degradation claim is defined as reduced agreement with these scores, and the abstract notes that mid-range responses need nuanced interpretation, it is unclear whether the observed drop reflects model limitations or divergence from one rater's specific judgments. Clarifying whether human-human agreement uses independent raters or repeated scoring by the same expert on mid-range items is necessary to support the claim.
Authors: We thank the referee for raising this critical point on annotation reliability. The ground-truth scores were assigned by one biology education expert using a rubric developed through iterative consultation with domain specialists. Human-human agreement was calculated between this expert and a second independent rater on a stratified subset of 100 responses (including mid-range items) to establish a baseline; the second rater received the same rubric and training materials. We will revise the Methods section to explicitly detail this two-rater procedure, the subset selection, and the resulting agreement values. This clarification will distinguish inter-rater stability from potential single-rater idiosyncrasies and better contextualize the AI degradation findings. Full IRR across the entire dataset was not collected due to practical constraints, but the reported human agreement supports the claim of stability across quality levels. revision: yes
-
Referee: Results section: The results from several hundred responses across two items are presented without statistical tests, confidence intervals, or exact agreement values per quality band (e.g., Cohen's kappa or percentage agreement for low/mid/high). This absence makes it difficult to evaluate the robustness and magnitude of the reported mid-range degradation and its conditioning on task-specific adaptation.
Authors: We agree that additional statistical detail is needed to strengthen the results. In the revised manuscript, we will report exact percentage agreement and Cohen's kappa for each quality band (low, mid, high) per model and item. We will add 95% confidence intervals computed via bootstrap resampling and include statistical tests (chi-square tests for differences in agreement proportions across bands, plus regression models testing the interaction between quality band and adaptation level). These will appear in updated tables and the Results text, allowing readers to assess the magnitude and significance of mid-range degradation and its reduction with greater task-specific adaptation. revision: yes
Circularity Check
No significant circularity: empirical comparison to external labels
full rationale
The paper is a direct empirical evaluation of model agreement with ground-truth scores supplied by an external biology education expert on student responses. No equations, derivations, fitted parameters, or predictions are present that could reduce to the inputs by construction. Claims about mid-range degradation are measured against these independent human labels rather than derived from model internals or self-referential assumptions. Human-human agreement is reported as a separate benchmark but does not serve as a load-bearing premise for the AI results. The study therefore contains no self-definitional, fitted-input, or self-citation circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Scores assigned by a single biology education expert constitute accurate ground truth for measuring model agreement.
Reference graph
Works this paper leans on
-
[1]
International Journal of Science Education , pages=
Ariely, Moriah and Salman, Asaf and Yarden, Anat and Alexandron, Giora , title=. International Journal of Science Education , pages=
-
[2]
Wu, X. and Saraf, P. P. and Lee, G. and others , journal =. Unveiling Scoring Processes: Dissecting the Differences Between. 2025 , doi =
work page 2025
-
[3]
Eman Mudhi AlGhamdi and Yuheng Li and Dragan Gašević and Guanliang Chen , keywords =. Leveraging prompt-based LLMs for automated scoring and feedback generation in higher education , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.compedu.2025.105511 , url =
-
[4]
International Journal of Artificial Intelligence in Education , volume=
Towards trustworthy autograding of short, multi-lingual, multi-type answers , author=. International Journal of Artificial Intelligence in Education , volume=. 2023 , publisher=
work page 2023
-
[5]
Kortemeyer, Gerd , journal =. Toward. 2023 , month =. doi:10.1103/PhysRevPhysEducRes.19.020163 , url =
-
[6]
Gr. L. BMC Medical Education , volume=. 2024 , publisher=
work page 2024
-
[7]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Automatic short answer grading for finnish with chatgpt , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[8]
arXiv preprint arXiv:2501.06658 , year=
Comparing few-shot prompting of GPT-4 LLMs with BERT classifiers for open-response assessment in tutor equity training , author=. arXiv preprint arXiv:2501.06658 , year=
-
[9]
Journal of Research in Science Teaching , volume=
Causal-mechanical explanations in biology: Applying automated assessment for personalized learning in the science classroom , author=. Journal of Research in Science Teaching , volume=. 2024 , publisher=
work page 2024
-
[10]
Assessment & Evaluation in Higher Education , pages=
Who grades best? Comparing ChatGPT, peer, and instructor evaluations across varying levels of student project quality , author=. Assessment & Evaluation in Higher Education , pages=. 2025 , publisher=
work page 2025
-
[11]
Journal of the Learning Sciences , volume=
On the benefits of seeking (and avoiding) help in online problem-solving environments , author=. Journal of the Learning Sciences , volume=. 2014 , publisher=
work page 2014
-
[12]
Review of educational research , volume=
Focus on formative feedback , author=. Review of educational research , volume=. 2008 , publisher=
work page 2008
-
[13]
Do BERT -Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLM s?
Zhang, Junyan and Huang, Yiming and Liu, Shuliang and Gao, Yubo and Hu, Xuming. Do BERT -Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLM s?. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025
work page 2025
-
[14]
Proceedings of the Eleventh ACM Conference on Learning @ Scale , pages =
Henkel, Owen and Hills, Libby and Boxer, Adam and Roberts, Bill and Levonian, Zach , title =. Proceedings of the Eleventh ACM Conference on Learning @ Scale , pages =. 2024 , isbn =. doi:10.1145/3657604.3664693 , abstract =
-
[15]
Emergent Abilities in Large Language Models: A Survey,
Emergent abilities in large language models: A survey , author=. arXiv preprint arXiv:2503.05788 , year=
-
[16]
Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT-4o , author=. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , pages=
work page 2025
-
[17]
FairAIED: Navigating fairness, bias, and ethics in educational
Chinta, Sribala Vidyadhari and Wang, Zichong and Yin, Zhipeng and Hoang, Nhat and Gonzalez, Matthew and Quy, T Le and Zhang, Wenbin , journal=. FairAIED: Navigating fairness, bias, and ethics in educational
-
[18]
Journal of Educational Data Mining , volume=
Multi-dimensional performance analysis of large language models for classroom discussion assessment , author=. Journal of Educational Data Mining , volume=
-
[19]
Schaller, Nils-Jonathan and Ding, Yuning and Horbach, Andrea and Meyer, Jennifer and Jansen, Thorben. Fairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on G erman Learner Essays from Secondary Education. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024
work page 2024
-
[20]
Building Better Open-Source Tools to Support Fairness in Automated Scoring
Madnani, Nitin and Loukina, Anastassia and von Davier, Alina and Burstein, Jill and Cahill, Aoife. Building Better Open-Source Tools to Support Fairness in Automated Scoring. Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. 2017. doi:10.18653/v1/W17-1605
-
[21]
Ding, Yuning and Riordan, Brian and Horbach, Andrea and Cahill, Aoife and Zesch, Torsten. Don ' t take ``nswvtnvakgxpm'' for an answer -- The surprising vulnerability of automatic content scoring systems to adversarial input. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.76
-
[22]
Automatic Short-Answer Grading in Sustainability Education:
Emirtekin, Emrah and. Automatic Short-Answer Grading in Sustainability Education:. Journal of Computer Assisted Learning , volume=. 2026 , publisher=
work page 2026
-
[23]
arXiv preprint arXiv:2204.03503 (2022)
Survey on automated short answer grading with deep learning: from word embeddings to transformers , author=. arXiv preprint arXiv:2204.03503 , year=
-
[24]
International cross-domain conference for machine learning and knowledge extraction , pages=
Automated short answer grading using deep learning: A survey , author=. International cross-domain conference for machine learning and knowledge extraction , pages=. 2021 , organization=
work page 2021
-
[25]
Annual Review of Statistics and Its Application , volume=
Algorithmic Fairness: Choices, Assumptions, and Definitions , author=. Annual Review of Statistics and Its Application , volume=. 2021 , publisher=. doi:10.1146/annurev-statistics-042720-125902 , url=
-
[26]
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V
FinBERT2: A specialized bidirectional encoder for bridging the gap in finance-specific deployment of large language models , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=
-
[27]
A Survey of Large Language Models
A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 , pages=
On the marriage of lp-norms and edit distance , author=. Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 , pages=
-
[29]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[30]
Claude Opus 4.5 System Card , author =. 2025 , month = nov, note =
work page 2025
-
[31]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Singh, Aaditya and Fry, Adam and Perelman, Adam and Tart, Adam and Ganesh, Adi and El-Kishky, Ahmed and McLaughlin, Aidan and Low, Aiden and Ostrow, AJ and Ananthram, Akhila and others , journal=. Open
-
[33]
Transformer-based Hebrew NLP models for short answer scoring in biology , author=. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=
work page 2023
-
[34]
Findings of the association for computational linguistics: Acl 2023 , pages=
Similarity-based content scoring-a more classroom-suitable alternative to instance-based scoring? , author=. Findings of the association for computational linguistics: Acl 2023 , pages=
work page 2023
-
[35]
Don’t take “nswvtnvakgxpm” for an answer--The surprising vulnerability of automatic content scoring systems to adversarial input , author=. Proceedings of the 28th international conference on computational linguistics , pages=
-
[36]
Ferreira Mello, Rafael and Pereira Junior, Cleon and Rodrigues, Luiz and Pereira, Filipe Dwan and Cabral, Luciano and Costa, Newarney and Ramalho, Geber and Gasevic, Dragan , title =. 2025 , isbn =. doi:10.1145/3706468.3706481 , booktitle =
-
[37]
Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
Text classification via large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
work page 2023
-
[38]
Computers and Education: Artificial Intelligence , volume=
Automatic assessment of text-based responses in post-secondary education: A systematic review , author=. Computers and Education: Artificial Intelligence , volume=. 2024 , publisher=
work page 2024
-
[39]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...
-
[40]
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
- [41]
-
[42]
Yacobson, E. and Rapp, S. and Blonder, R. and Alexandron, G. , booktitle=. Human Experts vs. 2025 , url=
work page 2025
-
[43]
Discover Artificial Intelligence , volume=
Performance of the pre-trained large language model GPT-4 on automated short answer grading , author=. Discover Artificial Intelligence , volume=. 2024 , publisher=
work page 2024
-
[44]
International Conference on Artificial Intelligence in Education , pages=
Exploring automatic short answer grading as a tool to assist in human rating , author=. International Conference on Artificial Intelligence in Education , pages=. 2020 , organization=
work page 2020
-
[45]
LLM s in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches
Chamieh, Imran and Zesch, Torsten and Giebermann, Klaus. LLM s in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024
work page 2024
-
[46]
Support-vector networks , author=. Machine learning , volume=. 1995 , publisher=
work page 1995
- [47]
-
[48]
International Journal of Artificial Intelligence in Education , pages=
Algorithmic fairness in automatic short answer scoring , author=. International Journal of Artificial Intelligence in Education , pages=. 2025 , publisher=
work page 2025
-
[49]
Examining the responsible use of zero-shot AI approaches to scoring essays , author=. Scientific Reports , volume=. 2024 , publisher=
work page 2024
-
[50]
The many dimensions of algorithmic fairness in educational applications , author=. Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications , pages=
- [51]
-
[52]
International journal of artificial intelligence in education , volume=
Algorithmic bias in education , author=. International journal of artificial intelligence in education , volume=. 2022 , publisher=
work page 2022
-
[53]
Proceedings of the aaai conference on artificial intelligence , volume=
Unveiling the tapestry of automated essay scoring: A comprehensive investigation of accuracy, fairness, and generalizability , author=. Proceedings of the aaai conference on artificial intelligence , volume=
-
[54]
Uncovering measurement biases in
Gurin Schleifer, Abigail and Beigman Klebanov, Beata and Alexandron, Giora , journal=. Uncovering measurement biases in. 2025 , publisher=
work page 2025
-
[55]
arXiv preprint arXiv:2308.16687 , year=
Dictabert: A state-of-the-art bert suite for modern hebrew , author=. arXiv preprint arXiv:2308.16687 , year=
-
[56]
Educational measurement , volume=
Test fairness , author=. Educational measurement , volume=
-
[57]
Advancing natural language processing in educational assessment , pages=
Evaluating fairness of automated scoring in educational measurement , author=. Advancing natural language processing in educational assessment , pages=. 2023 , publisher=
work page 2023
-
[58]
How do we go about investigating test fairness? , author=. Language testing , volume=. 2010 , publisher=
work page 2010
-
[59]
and Xi, Xiaoming and Breyer, F
Williamson, David M. and Xi, Xiaoming and Breyer, F. Jay , title =. Educational Measurement: Issues and Practice , volume =. doi:https://doi.org/10.1111/j.1745-3992.2011.00223.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1745-3992.2011.00223.x , year =
-
[60]
Ariely, Moriah and Nazaretsky, Tanya and Alexandron, Giora , journal=. Machine learning and. 2023 , publisher=
work page 2023
-
[61]
Lin, Bill Yuchen and Ravichander, Abhilasha and Lu, Ximing and Dziri, Nouha and Sclar, Melanie and Chandu, Khyathi and Bhagavatula, Chandra and Choi, Yejin , journal=. The unlocking spell on base
-
[62]
Fairlearn: A toolkit for assessing and improving fairness in AI , author=. 2020 , publisher=
work page 2020
-
[63]
Gorgun, Guher and Yildirim-Erbasli, Seyma N. , title =. Journal of Educational Measurement , volume =. doi:https://doi.org/10.1111/jedm.12420 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/jedm.12420 , abstract =
-
[64]
Journal of Computer Assisted Learning , volume=
Semi-automatic coding of open-ended text responses in large-scale assessments , author=. Journal of Computer Assisted Learning , volume=. 2023 , publisher=
work page 2023
-
[65]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
A semantic feature-wise transformation relation network for automatic short answer grading , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.