Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Abigail Victoria Gurin Schleifer; Asaf Salman; Beata Beigman Klebanov; Giora Alexandron; Moriah Ariely

arxiv: 2605.07647 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Abigail Victoria Gurin Schleifer , Moriah Ariely , Beata Beigman Klebanov , Asaf Salman , Giora Alexandron This is my paper

Pith reviewed 2026-05-11 01:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords automated short answer scoringLLM evaluationfew-shot learningscoring agreementmid-range qualitytask adaptationbiology assessmentfairness in AI scoring

0 comments

The pith

AI models for scoring short answers agree well with experts on fully correct and incorrect responses but show major degradation on mid-range ones, with less degradation after more task-specific adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares AI models including few-shot LLMs and fine-tuned encoders against human experts in scoring open-ended biology short answers. All models perform reliably on responses that are completely right or completely wrong. Agreement falls substantially on responses in the middle of the quality range that need nuanced interpretation. This falloff is greatest when LLMs use only a few examples and lessens when more task-specific data is provided, with fine-tuned models showing the least degradation. The pattern suggests potential unfairness in evaluating students whose answers show partial but developing understanding.

Core claim

All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.

What carries the argument

Quality-conditioned scoring agreement, which tracks model-expert alignment separately for low-quality, mid-quality, and high-quality student responses to expose adaptation effects in the mid-range.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deploying few-shot LLMs in educational scoring without checks on mid-range responses risks inequitable outcomes for certain students.
Hybrid approaches that route ambiguous mid-range answers to humans could improve overall reliability.
The degradation pattern may appear in other domains involving partial credit or subjective judgment, such as essay assessment.
Increasing the number of human raters used for ground truth could test whether the observed drops reflect model shortcomings or reference variability.

Load-bearing premise

The ground-truth scores assigned by a single biology education expert accurately capture the nuanced interpretation required for mid-range responses and serve as a stable reference for measuring model agreement.

What would settle it

Re-scoring the same student responses with a second independent biology expert and checking whether the AI models still exhibit the same degree of mid-range degradation relative to the new reference scores.

Figures

Figures reproduced from arXiv: 2605.07647 by Abigail Victoria Gurin Schleifer, Asaf Salman, Beata Beigman Klebanov, Giora Alexandron, Moriah Ariely.

read the original abstract

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows AI scorers degrade on mid-range answers more than on clear ones, with the effect tied to how much task-specific adaptation the model has, though the single-rater ground truth is a soft spot for the nuanced cases.

read the letter

The one or two things to know: this work finds that agreement between AI models and human scores on short biology answers drops substantially for mid-range responses, and that this drop is larger for few-shot LLMs than for fine-tuned models. The paper compares three LLMs in few-shot settings, a fine-tuned BERT, and human scoring on two open-ended items using several hundred student responses. It reports that humans maintain high agreement across all quality levels while the AI models do fine on fully correct and fully incorrect answers but struggle in the middle. The severity of the mid-range issue decreases as task-specific data increases, with the fine-tuned encoder coming out on top. They suggest this could create inequitable outcomes for students showing partial understanding. What the paper does well is extend the ASAS literature with a quality-conditioned analysis and a head-to-head look at different levels of adaptation. The empirical comparison on actual student data from biology items gives it a practical angle that prior work sometimes lacks. The soft spots are around the ground truth. Everything rests on scores from one biology education expert, and while the abstract says human-human agreement is highest and stable, it does not specify if that involves multiple independent raters or just consistency from the same expert. For mid-range responses that require nuanced interpretation, this matters. The measured degradation in AI agreement could partly come from variance in how those partial answers should be scored rather than from model shortcomings alone. If the full paper includes inter-rater reliability broken down by quality band, that would strengthen the claims; otherwise the central finding has some uncertainty attached. This is for researchers focused on automated assessment tools and fairness in educational AI. Readers interested in how LLMs perform on nuanced tasks compared to fine-tuned models will get value from the adaptation angle. It deserves a serious referee because the topic is relevant and the basic setup is sound, even if more validation on the rating process would help.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical study on automated short answer scoring (ASAS) comparing few-shot large language models (GPT-5.2, GPT-4o, Claude Opus 4.5), a fine-tuned BERT-based encoder, and human experts on two biology open-ended items. Using several hundred student responses with ground-truth scores from a biology education expert, it claims that AI models achieve high agreement on fully correct and fully incorrect responses but show substantial degradation in agreement on mid-range responses. This degradation is most severe for few-shot LLMs with limited examples and is reduced with greater task-specific adaptation, with fine-tuned models performing best. Human-human agreement is reported as highest and stable across the quality spectrum. The authors suggest this may lead to inequitable evaluation of student responses indicating developing understanding.

Significance. If the central findings hold, this work is significant for highlighting quality-conditioned fairness issues in ASAS, particularly the vulnerability of mid-range scoring which is crucial for assessing partial student understanding. The systematic comparison across different levels of task-specific adaptation (few-shot vs. fine-tuned) provides actionable insights into how to improve model alignment on complex scoring tasks. The use of real student responses from biology items adds ecological validity to the results.

major comments (2)

Methods section on data annotation: The ground-truth scores are assigned by a single biology education expert without reported inter-rater reliability (IRR) metrics or validation procedures, especially for the mid-range responses that require nuanced interpretation. Since the degradation claim is defined as reduced agreement with these scores, and the abstract notes that mid-range responses need nuanced interpretation, it is unclear whether the observed drop reflects model limitations or divergence from one rater's specific judgments. Clarifying whether human-human agreement uses independent raters or repeated scoring by the same expert on mid-range items is necessary to support the claim.
Results section: The results from several hundred responses across two items are presented without statistical tests, confidence intervals, or exact agreement values per quality band (e.g., Cohen's kappa or percentage agreement for low/mid/high). This absence makes it difficult to evaluate the robustness and magnitude of the reported mid-range degradation and its conditioning on task-specific adaptation.

minor comments (2)

Abstract: The abstract mentions 'several hundred student responses' but could specify the exact number and the two items for better context.
Methods: Exact prompt templates used for the few-shot LLMs are not detailed, which would aid reproducibility given the known sensitivity of LLM outputs to prompting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important aspects of our methods and results presentation that we will address through revisions to improve clarity and rigor. We respond to each major comment below.

read point-by-point responses

Referee: Methods section on data annotation: The ground-truth scores are assigned by a single biology education expert without reported inter-rater reliability (IRR) metrics or validation procedures, especially for the mid-range responses that require nuanced interpretation. Since the degradation claim is defined as reduced agreement with these scores, and the abstract notes that mid-range responses need nuanced interpretation, it is unclear whether the observed drop reflects model limitations or divergence from one rater's specific judgments. Clarifying whether human-human agreement uses independent raters or repeated scoring by the same expert on mid-range items is necessary to support the claim.

Authors: We thank the referee for raising this critical point on annotation reliability. The ground-truth scores were assigned by one biology education expert using a rubric developed through iterative consultation with domain specialists. Human-human agreement was calculated between this expert and a second independent rater on a stratified subset of 100 responses (including mid-range items) to establish a baseline; the second rater received the same rubric and training materials. We will revise the Methods section to explicitly detail this two-rater procedure, the subset selection, and the resulting agreement values. This clarification will distinguish inter-rater stability from potential single-rater idiosyncrasies and better contextualize the AI degradation findings. Full IRR across the entire dataset was not collected due to practical constraints, but the reported human agreement supports the claim of stability across quality levels. revision: yes
Referee: Results section: The results from several hundred responses across two items are presented without statistical tests, confidence intervals, or exact agreement values per quality band (e.g., Cohen's kappa or percentage agreement for low/mid/high). This absence makes it difficult to evaluate the robustness and magnitude of the reported mid-range degradation and its conditioning on task-specific adaptation.

Authors: We agree that additional statistical detail is needed to strengthen the results. In the revised manuscript, we will report exact percentage agreement and Cohen's kappa for each quality band (low, mid, high) per model and item. We will add 95% confidence intervals computed via bootstrap resampling and include statistical tests (chi-square tests for differences in agreement proportions across bands, plus regression models testing the interaction between quality band and adaptation level). These will appear in updated tables and the Results text, allowing readers to assess the magnitude and significance of mid-range degradation and its reduction with greater task-specific adaptation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparison to external labels

full rationale

The paper is a direct empirical evaluation of model agreement with ground-truth scores supplied by an external biology education expert on student responses. No equations, derivations, fitted parameters, or predictions are present that could reduce to the inputs by construction. Claims about mid-range degradation are measured against these independent human labels rather than derived from model internals or self-referential assumptions. Human-human agreement is reported as a separate benchmark but does not serve as a load-bearing premise for the AI results. The study therefore contains no self-definitional, fitted-input, or self-citation circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on treating expert-assigned scores as reliable ground truth and on the assumption that the two chosen biology items are representative of the scoring challenges that produce mid-range responses.

axioms (1)

domain assumption Scores assigned by a single biology education expert constitute accurate ground truth for measuring model agreement.
The paper uses these scores to compute all agreement metrics and to identify mid-range degradation.

pith-pipeline@v0.9.0 · 5574 in / 1216 out tokens · 32751 ms · 2026-05-11T01:49:18.189353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

[1]

International Journal of Science Education , pages=

Ariely, Moriah and Salman, Asaf and Yarden, Anat and Alexandron, Giora , title=. International Journal of Science Education , pages=

work page
[2]

and Saraf, P

Wu, X. and Saraf, P. P. and Lee, G. and others , journal =. Unveiling Scoring Processes: Dissecting the Differences Between. 2025 , doi =

work page 2025
[3]

Leveraging prompt-based LLMs for automated scoring and feedback generation in higher education , journal =

Eman Mudhi AlGhamdi and Yuheng Li and Dragan Gašević and Guanliang Chen , keywords =. Leveraging prompt-based LLMs for automated scoring and feedback generation in higher education , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.compedu.2025.105511 , url =

work page doi:10.1016/j.compedu.2025.105511 2026
[4]

International Journal of Artificial Intelligence in Education , volume=

Towards trustworthy autograding of short, multi-lingual, multi-type answers , author=. International Journal of Artificial Intelligence in Education , volume=. 2023 , publisher=

work page 2023
[5]

Kortemeyer, Gerd , journal =. Toward. 2023 , month =. doi:10.1103/PhysRevPhysEducRes.19.020163 , url =

work page doi:10.1103/physrevphyseducres.19.020163 2023
[6]

Gr. L. BMC Medical Education , volume=. 2024 , publisher=

work page 2024
[7]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Automatic short answer grading for finnish with chatgpt , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[8]

arXiv preprint arXiv:2501.06658 , year=

Comparing few-shot prompting of GPT-4 LLMs with BERT classifiers for open-response assessment in tutor equity training , author=. arXiv preprint arXiv:2501.06658 , year=

work page arXiv
[9]

Journal of Research in Science Teaching , volume=

Causal-mechanical explanations in biology: Applying automated assessment for personalized learning in the science classroom , author=. Journal of Research in Science Teaching , volume=. 2024 , publisher=

work page 2024
[10]

Assessment & Evaluation in Higher Education , pages=

Who grades best? Comparing ChatGPT, peer, and instructor evaluations across varying levels of student project quality , author=. Assessment & Evaluation in Higher Education , pages=. 2025 , publisher=

work page 2025
[11]

Journal of the Learning Sciences , volume=

On the benefits of seeking (and avoiding) help in online problem-solving environments , author=. Journal of the Learning Sciences , volume=. 2014 , publisher=

work page 2014
[12]

Review of educational research , volume=

Focus on formative feedback , author=. Review of educational research , volume=. 2008 , publisher=

work page 2008
[13]

Do BERT -Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLM s?

Zhang, Junyan and Huang, Yiming and Liu, Shuliang and Gao, Yubo and Hu, Xuming. Do BERT -Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLM s?. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

work page 2025
[14]

Proceedings of the Eleventh ACM Conference on Learning @ Scale , pages =

Henkel, Owen and Hills, Libby and Boxer, Adam and Roberts, Bill and Levonian, Zach , title =. Proceedings of the Eleventh ACM Conference on Learning @ Scale , pages =. 2024 , isbn =. doi:10.1145/3657604.3664693 , abstract =

work page doi:10.1145/3657604.3664693 2024
[15]

Emergent Abilities in Large Language Models: A Survey,

Emergent abilities in large language models: A survey , author=. arXiv preprint arXiv:2503.05788 , year=

work page arXiv
[16]

Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , pages=

Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT-4o , author=. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , pages=

work page 2025
[17]

FairAIED: Navigating fairness, bias, and ethics in educational

Chinta, Sribala Vidyadhari and Wang, Zichong and Yin, Zhipeng and Hoang, Nhat and Gonzalez, Matthew and Quy, T Le and Zhang, Wenbin , journal=. FairAIED: Navigating fairness, bias, and ethics in educational

work page
[18]

Journal of Educational Data Mining , volume=

Multi-dimensional performance analysis of large language models for classroom discussion assessment , author=. Journal of Educational Data Mining , volume=

work page
[19]

Fairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on G erman Learner Essays from Secondary Education

Schaller, Nils-Jonathan and Ding, Yuning and Horbach, Andrea and Meyer, Jennifer and Jansen, Thorben. Fairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on G erman Learner Essays from Secondary Education. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024

work page 2024
[20]

Building Better Open-Source Tools to Support Fairness in Automated Scoring

Madnani, Nitin and Loukina, Anastassia and von Davier, Alina and Burstein, Jill and Cahill, Aoife. Building Better Open-Source Tools to Support Fairness in Automated Scoring. Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. 2017. doi:10.18653/v1/W17-1605

work page doi:10.18653/v1/w17-1605 2017
[21]

Don ' t take ``nswvtnvakgxpm'' for an answer -- The surprising vulnerability of automatic content scoring systems to adversarial input

Ding, Yuning and Riordan, Brian and Horbach, Andrea and Cahill, Aoife and Zesch, Torsten. Don ' t take ``nswvtnvakgxpm'' for an answer -- The surprising vulnerability of automatic content scoring systems to adversarial input. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.76

work page doi:10.18653/v1/2020.coling-main.76 2020
[22]

Automatic Short-Answer Grading in Sustainability Education:

Emirtekin, Emrah and. Automatic Short-Answer Grading in Sustainability Education:. Journal of Computer Assisted Learning , volume=. 2026 , publisher=

work page 2026
[23]

arXiv preprint arXiv:2204.03503 (2022)

Survey on automated short answer grading with deep learning: from word embeddings to transformers , author=. arXiv preprint arXiv:2204.03503 , year=

work page arXiv
[24]

International cross-domain conference for machine learning and knowledge extraction , pages=

Automated short answer grading using deep learning: A survey , author=. International cross-domain conference for machine learning and knowledge extraction , pages=. 2021 , organization=

work page 2021
[25]

Annual Review of Statistics and Its Application , volume=

Algorithmic Fairness: Choices, Assumptions, and Definitions , author=. Annual Review of Statistics and Its Application , volume=. 2021 , publisher=. doi:10.1146/annurev-statistics-042720-125902 , url=

work page doi:10.1146/annurev-statistics-042720-125902 2021
[26]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

FinBERT2: A specialized bidirectional encoder for bridging the gap in finance-specific deployment of large language models , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

work page
[27]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 , pages=

On the marriage of lp-norms and edit distance , author=. Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 , pages=

work page
[29]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[30]

2025 , month = nov, note =

Claude Opus 4.5 System Card , author =. 2025 , month = nov, note =

work page 2025
[31]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Singh, Aaditya and Fry, Adam and Perelman, Adam and Tart, Adam and Ganesh, Adi and El-Kishky, Ahmed and McLaughlin, Aidan and Low, Aiden and Ostrow, AJ and Ananthram, Akhila and others , journal=. Open

work page
[33]

Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=

Transformer-based Hebrew NLP models for short answer scoring in biology , author=. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=

work page 2023
[34]

Findings of the association for computational linguistics: Acl 2023 , pages=

Similarity-based content scoring-a more classroom-suitable alternative to instance-based scoring? , author=. Findings of the association for computational linguistics: Acl 2023 , pages=

work page 2023
[35]

nswvtnvakgxpm

Don’t take “nswvtnvakgxpm” for an answer--The surprising vulnerability of automatic content scoring systems to adversarial input , author=. Proceedings of the 28th international conference on computational linguistics , pages=

work page
[36]

How learner control and explainable learn- ing analytics about skill mastery shape student desires to finish and avoid loss in tutored practice

Ferreira Mello, Rafael and Pereira Junior, Cleon and Rodrigues, Luiz and Pereira, Filipe Dwan and Cabral, Luciano and Costa, Newarney and Ramalho, Geber and Gasevic, Dragan , title =. 2025 , isbn =. doi:10.1145/3706468.3706481 , booktitle =

work page doi:10.1145/3706468.3706481 2025
[37]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Text classification via large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[38]

Computers and Education: Artificial Intelligence , volume=

Automatic assessment of text-based responses in post-secondary education: A systematic review , author=. Computers and Education: Artificial Intelligence , volume=. 2024 , publisher=

work page 2024
[39]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019
[40]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

work page
[41]

2024 , howpublished =

DeepSeek-V3 Technical Report , author =. 2024 , howpublished =

work page 2024
[42]

and Rapp, S

Yacobson, E. and Rapp, S. and Blonder, R. and Alexandron, G. , booktitle=. Human Experts vs. 2025 , url=

work page 2025
[43]

Discover Artificial Intelligence , volume=

Performance of the pre-trained large language model GPT-4 on automated short answer grading , author=. Discover Artificial Intelligence , volume=. 2024 , publisher=

work page 2024
[44]

International Conference on Artificial Intelligence in Education , pages=

Exploring automatic short answer grading as a tool to assist in human rating , author=. International Conference on Artificial Intelligence in Education , pages=. 2020 , organization=

work page 2020
[45]

LLM s in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches

Chamieh, Imran and Zesch, Torsten and Giebermann, Klaus. LLM s in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024

work page 2024
[46]

Machine learning , volume=

Support-vector networks , author=. Machine learning , volume=. 1995 , publisher=

work page 1995
[47]

2022 , publisher=

Automated essay scoring , author=. 2022 , publisher=

work page 2022
[48]

International Journal of Artificial Intelligence in Education , pages=

Algorithmic fairness in automatic short answer scoring , author=. International Journal of Artificial Intelligence in Education , pages=. 2025 , publisher=

work page 2025
[49]

Scientific Reports , volume=

Examining the responsible use of zero-shot AI approaches to scoring essays , author=. Scientific Reports , volume=. 2024 , publisher=

work page 2024
[50]

Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications , pages=

The many dimensions of algorithmic fairness in educational applications , author=. Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications , pages=

work page
[51]

fairness

Madaio, Michael and Blodgett, Su Lin and Mayfield, Elijah and Dixon-Rom. Beyond “fairness”: Structural (in) justice lenses on. The ethics of artificial intelligence in education , pages=. 2022 , publisher=

work page 2022
[52]

International journal of artificial intelligence in education , volume=

Algorithmic bias in education , author=. International journal of artificial intelligence in education , volume=. 2022 , publisher=

work page 2022
[53]

Proceedings of the aaai conference on artificial intelligence , volume=

Unveiling the tapestry of automated essay scoring: A comprehensive investigation of accuracy, fairness, and generalizability , author=. Proceedings of the aaai conference on artificial intelligence , volume=

work page
[54]

Uncovering measurement biases in

Gurin Schleifer, Abigail and Beigman Klebanov, Beata and Alexandron, Giora , journal=. Uncovering measurement biases in. 2025 , publisher=

work page 2025
[55]

arXiv preprint arXiv:2308.16687 , year=

Dictabert: A state-of-the-art bert suite for modern hebrew , author=. arXiv preprint arXiv:2308.16687 , year=

work page arXiv
[56]

Educational measurement , volume=

Test fairness , author=. Educational measurement , volume=

work page
[57]

Advancing natural language processing in educational assessment , pages=

Evaluating fairness of automated scoring in educational measurement , author=. Advancing natural language processing in educational assessment , pages=. 2023 , publisher=

work page 2023
[58]

Language testing , volume=

How do we go about investigating test fairness? , author=. Language testing , volume=. 2010 , publisher=

work page 2010
[59]

and Xi, Xiaoming and Breyer, F

Williamson, David M. and Xi, Xiaoming and Breyer, F. Jay , title =. Educational Measurement: Issues and Practice , volume =. doi:https://doi.org/10.1111/j.1745-3992.2011.00223.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1745-3992.2011.00223.x , year =

work page doi:10.1111/j.1745-3992.2011.00223.x 2011
[60]

Machine learning and

Ariely, Moriah and Nazaretsky, Tanya and Alexandron, Giora , journal=. Machine learning and. 2023 , publisher=

work page 2023
[61]

The unlocking spell on base

Lin, Bill Yuchen and Ravichander, Abhilasha and Lu, Ximing and Dziri, Nouha and Sclar, Melanie and Chandu, Khyathi and Bhagavatula, Chandra and Choi, Yejin , journal=. The unlocking spell on base

work page
[62]

2020 , publisher=

Fairlearn: A toolkit for assessing and improving fairness in AI , author=. 2020 , publisher=

work page 2020
[63]

, title =

Gorgun, Guher and Yildirim-Erbasli, Seyma N. , title =. Journal of Educational Measurement , volume =. doi:https://doi.org/10.1111/jedm.12420 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/jedm.12420 , abstract =

work page doi:10.1111/jedm.12420
[64]

Journal of Computer Assisted Learning , volume=

Semi-automatic coding of open-ended text responses in large-scale assessments , author=. Journal of Computer Assisted Learning , volume=. 2023 , publisher=

work page 2023
[65]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

A semantic feature-wise transformation relation network for automatic short answer grading , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021

[1] [1]

International Journal of Science Education , pages=

Ariely, Moriah and Salman, Asaf and Yarden, Anat and Alexandron, Giora , title=. International Journal of Science Education , pages=

work page

[2] [2]

and Saraf, P

Wu, X. and Saraf, P. P. and Lee, G. and others , journal =. Unveiling Scoring Processes: Dissecting the Differences Between. 2025 , doi =

work page 2025

[3] [3]

Leveraging prompt-based LLMs for automated scoring and feedback generation in higher education , journal =

Eman Mudhi AlGhamdi and Yuheng Li and Dragan Gašević and Guanliang Chen , keywords =. Leveraging prompt-based LLMs for automated scoring and feedback generation in higher education , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.compedu.2025.105511 , url =

work page doi:10.1016/j.compedu.2025.105511 2026

[4] [4]

International Journal of Artificial Intelligence in Education , volume=

Towards trustworthy autograding of short, multi-lingual, multi-type answers , author=. International Journal of Artificial Intelligence in Education , volume=. 2023 , publisher=

work page 2023

[5] [5]

Kortemeyer, Gerd , journal =. Toward. 2023 , month =. doi:10.1103/PhysRevPhysEducRes.19.020163 , url =

work page doi:10.1103/physrevphyseducres.19.020163 2023

[6] [6]

Gr. L. BMC Medical Education , volume=. 2024 , publisher=

work page 2024

[7] [7]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Automatic short answer grading for finnish with chatgpt , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[8] [8]

arXiv preprint arXiv:2501.06658 , year=

Comparing few-shot prompting of GPT-4 LLMs with BERT classifiers for open-response assessment in tutor equity training , author=. arXiv preprint arXiv:2501.06658 , year=

work page arXiv

[9] [9]

Journal of Research in Science Teaching , volume=

Causal-mechanical explanations in biology: Applying automated assessment for personalized learning in the science classroom , author=. Journal of Research in Science Teaching , volume=. 2024 , publisher=

work page 2024

[10] [10]

Assessment & Evaluation in Higher Education , pages=

Who grades best? Comparing ChatGPT, peer, and instructor evaluations across varying levels of student project quality , author=. Assessment & Evaluation in Higher Education , pages=. 2025 , publisher=

work page 2025

[11] [11]

Journal of the Learning Sciences , volume=

On the benefits of seeking (and avoiding) help in online problem-solving environments , author=. Journal of the Learning Sciences , volume=. 2014 , publisher=

work page 2014

[12] [12]

Review of educational research , volume=

Focus on formative feedback , author=. Review of educational research , volume=. 2008 , publisher=

work page 2008

[13] [13]

Do BERT -Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLM s?

Zhang, Junyan and Huang, Yiming and Liu, Shuliang and Gao, Yubo and Hu, Xuming. Do BERT -Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLM s?. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

work page 2025

[14] [14]

Proceedings of the Eleventh ACM Conference on Learning @ Scale , pages =

Henkel, Owen and Hills, Libby and Boxer, Adam and Roberts, Bill and Levonian, Zach , title =. Proceedings of the Eleventh ACM Conference on Learning @ Scale , pages =. 2024 , isbn =. doi:10.1145/3657604.3664693 , abstract =

work page doi:10.1145/3657604.3664693 2024

[15] [15]

Emergent Abilities in Large Language Models: A Survey,

Emergent abilities in large language models: A survey , author=. arXiv preprint arXiv:2503.05788 , year=

work page arXiv

[16] [16]

Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , pages=

Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT-4o , author=. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , pages=

work page 2025

[17] [17]

FairAIED: Navigating fairness, bias, and ethics in educational

Chinta, Sribala Vidyadhari and Wang, Zichong and Yin, Zhipeng and Hoang, Nhat and Gonzalez, Matthew and Quy, T Le and Zhang, Wenbin , journal=. FairAIED: Navigating fairness, bias, and ethics in educational

work page

[18] [18]

Journal of Educational Data Mining , volume=

Multi-dimensional performance analysis of large language models for classroom discussion assessment , author=. Journal of Educational Data Mining , volume=

work page

[19] [19]

Fairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on G erman Learner Essays from Secondary Education

Schaller, Nils-Jonathan and Ding, Yuning and Horbach, Andrea and Meyer, Jennifer and Jansen, Thorben. Fairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on G erman Learner Essays from Secondary Education. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024

work page 2024

[20] [20]

Building Better Open-Source Tools to Support Fairness in Automated Scoring

Madnani, Nitin and Loukina, Anastassia and von Davier, Alina and Burstein, Jill and Cahill, Aoife. Building Better Open-Source Tools to Support Fairness in Automated Scoring. Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. 2017. doi:10.18653/v1/W17-1605

work page doi:10.18653/v1/w17-1605 2017

[21] [21]

Don ' t take ``nswvtnvakgxpm'' for an answer -- The surprising vulnerability of automatic content scoring systems to adversarial input

Ding, Yuning and Riordan, Brian and Horbach, Andrea and Cahill, Aoife and Zesch, Torsten. Don ' t take ``nswvtnvakgxpm'' for an answer -- The surprising vulnerability of automatic content scoring systems to adversarial input. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.76

work page doi:10.18653/v1/2020.coling-main.76 2020

[22] [22]

Automatic Short-Answer Grading in Sustainability Education:

Emirtekin, Emrah and. Automatic Short-Answer Grading in Sustainability Education:. Journal of Computer Assisted Learning , volume=. 2026 , publisher=

work page 2026

[23] [23]

arXiv preprint arXiv:2204.03503 (2022)

Survey on automated short answer grading with deep learning: from word embeddings to transformers , author=. arXiv preprint arXiv:2204.03503 , year=

work page arXiv

[24] [24]

International cross-domain conference for machine learning and knowledge extraction , pages=

Automated short answer grading using deep learning: A survey , author=. International cross-domain conference for machine learning and knowledge extraction , pages=. 2021 , organization=

work page 2021

[25] [25]

Annual Review of Statistics and Its Application , volume=

Algorithmic Fairness: Choices, Assumptions, and Definitions , author=. Annual Review of Statistics and Its Application , volume=. 2021 , publisher=. doi:10.1146/annurev-statistics-042720-125902 , url=

work page doi:10.1146/annurev-statistics-042720-125902 2021

[26] [26]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

FinBERT2: A specialized bidirectional encoder for bridging the gap in finance-specific deployment of large language models , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

work page

[27] [27]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 , pages=

On the marriage of lp-norms and edit distance , author=. Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 , pages=

work page

[29] [29]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907

[30] [30]

2025 , month = nov, note =

Claude Opus 4.5 System Card , author =. 2025 , month = nov, note =

work page 2025

[31] [31]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Singh, Aaditya and Fry, Adam and Perelman, Adam and Tart, Adam and Ganesh, Adi and El-Kishky, Ahmed and McLaughlin, Aidan and Low, Aiden and Ostrow, AJ and Ananthram, Akhila and others , journal=. Open

work page

[33] [33]

Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=

Transformer-based Hebrew NLP models for short answer scoring in biology , author=. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=

work page 2023

[34] [34]

Findings of the association for computational linguistics: Acl 2023 , pages=

Similarity-based content scoring-a more classroom-suitable alternative to instance-based scoring? , author=. Findings of the association for computational linguistics: Acl 2023 , pages=

work page 2023

[35] [35]

nswvtnvakgxpm

Don’t take “nswvtnvakgxpm” for an answer--The surprising vulnerability of automatic content scoring systems to adversarial input , author=. Proceedings of the 28th international conference on computational linguistics , pages=

work page

[36] [36]

How learner control and explainable learn- ing analytics about skill mastery shape student desires to finish and avoid loss in tutored practice

Ferreira Mello, Rafael and Pereira Junior, Cleon and Rodrigues, Luiz and Pereira, Filipe Dwan and Cabral, Luciano and Costa, Newarney and Ramalho, Geber and Gasevic, Dragan , title =. 2025 , isbn =. doi:10.1145/3706468.3706481 , booktitle =

work page doi:10.1145/3706468.3706481 2025

[37] [37]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Text classification via large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023

[38] [38]

Computers and Education: Artificial Intelligence , volume=

Automatic assessment of text-based responses in post-secondary education: A systematic review , author=. Computers and Education: Artificial Intelligence , volume=. 2024 , publisher=

work page 2024

[39] [39]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019

[40] [40]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

work page

[41] [41]

2024 , howpublished =

DeepSeek-V3 Technical Report , author =. 2024 , howpublished =

work page 2024

[42] [42]

and Rapp, S

Yacobson, E. and Rapp, S. and Blonder, R. and Alexandron, G. , booktitle=. Human Experts vs. 2025 , url=

work page 2025

[43] [43]

Discover Artificial Intelligence , volume=

Performance of the pre-trained large language model GPT-4 on automated short answer grading , author=. Discover Artificial Intelligence , volume=. 2024 , publisher=

work page 2024

[44] [44]

International Conference on Artificial Intelligence in Education , pages=

Exploring automatic short answer grading as a tool to assist in human rating , author=. International Conference on Artificial Intelligence in Education , pages=. 2020 , organization=

work page 2020

[45] [45]

LLM s in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches

Chamieh, Imran and Zesch, Torsten and Giebermann, Klaus. LLM s in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024

work page 2024

[46] [46]

Machine learning , volume=

Support-vector networks , author=. Machine learning , volume=. 1995 , publisher=

work page 1995

[47] [47]

2022 , publisher=

Automated essay scoring , author=. 2022 , publisher=

work page 2022

[48] [48]

International Journal of Artificial Intelligence in Education , pages=

Algorithmic fairness in automatic short answer scoring , author=. International Journal of Artificial Intelligence in Education , pages=. 2025 , publisher=

work page 2025

[49] [49]

Scientific Reports , volume=

Examining the responsible use of zero-shot AI approaches to scoring essays , author=. Scientific Reports , volume=. 2024 , publisher=

work page 2024

[50] [50]

Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications , pages=

The many dimensions of algorithmic fairness in educational applications , author=. Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications , pages=

work page

[51] [51]

fairness

Madaio, Michael and Blodgett, Su Lin and Mayfield, Elijah and Dixon-Rom. Beyond “fairness”: Structural (in) justice lenses on. The ethics of artificial intelligence in education , pages=. 2022 , publisher=

work page 2022

[52] [52]

International journal of artificial intelligence in education , volume=

Algorithmic bias in education , author=. International journal of artificial intelligence in education , volume=. 2022 , publisher=

work page 2022

[53] [53]

Proceedings of the aaai conference on artificial intelligence , volume=

Unveiling the tapestry of automated essay scoring: A comprehensive investigation of accuracy, fairness, and generalizability , author=. Proceedings of the aaai conference on artificial intelligence , volume=

work page

[54] [54]

Uncovering measurement biases in

Gurin Schleifer, Abigail and Beigman Klebanov, Beata and Alexandron, Giora , journal=. Uncovering measurement biases in. 2025 , publisher=

work page 2025

[55] [55]

arXiv preprint arXiv:2308.16687 , year=

Dictabert: A state-of-the-art bert suite for modern hebrew , author=. arXiv preprint arXiv:2308.16687 , year=

work page arXiv

[56] [56]

Educational measurement , volume=

Test fairness , author=. Educational measurement , volume=

work page

[57] [57]

Advancing natural language processing in educational assessment , pages=

Evaluating fairness of automated scoring in educational measurement , author=. Advancing natural language processing in educational assessment , pages=. 2023 , publisher=

work page 2023

[58] [58]

Language testing , volume=

How do we go about investigating test fairness? , author=. Language testing , volume=. 2010 , publisher=

work page 2010

[59] [59]

and Xi, Xiaoming and Breyer, F

Williamson, David M. and Xi, Xiaoming and Breyer, F. Jay , title =. Educational Measurement: Issues and Practice , volume =. doi:https://doi.org/10.1111/j.1745-3992.2011.00223.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1745-3992.2011.00223.x , year =

work page doi:10.1111/j.1745-3992.2011.00223.x 2011

[60] [60]

Machine learning and

Ariely, Moriah and Nazaretsky, Tanya and Alexandron, Giora , journal=. Machine learning and. 2023 , publisher=

work page 2023

[61] [61]

The unlocking spell on base

Lin, Bill Yuchen and Ravichander, Abhilasha and Lu, Ximing and Dziri, Nouha and Sclar, Melanie and Chandu, Khyathi and Bhagavatula, Chandra and Choi, Yejin , journal=. The unlocking spell on base

work page

[62] [62]

2020 , publisher=

Fairlearn: A toolkit for assessing and improving fairness in AI , author=. 2020 , publisher=

work page 2020

[63] [63]

, title =

Gorgun, Guher and Yildirim-Erbasli, Seyma N. , title =. Journal of Educational Measurement , volume =. doi:https://doi.org/10.1111/jedm.12420 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/jedm.12420 , abstract =

work page doi:10.1111/jedm.12420

[64] [64]

Journal of Computer Assisted Learning , volume=

Semi-automatic coding of open-ended text responses in large-scale assessments , author=. Journal of Computer Assisted Learning , volume=. 2023 , publisher=

work page 2023

[65] [65]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

A semantic feature-wise transformation relation network for automatic short answer grading , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021