pith. sign in

arxiv: 2605.07647 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Pith reviewed 2026-05-11 01:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords automated short answer scoringLLM evaluationfew-shot learningscoring agreementmid-range qualitytask adaptationbiology assessmentfairness in AI scoring
0
0 comments X

The pith

AI models for scoring short answers agree well with experts on fully correct and incorrect responses but show major degradation on mid-range ones, with less degradation after more task-specific adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares AI models including few-shot LLMs and fine-tuned encoders against human experts in scoring open-ended biology short answers. All models perform reliably on responses that are completely right or completely wrong. Agreement falls substantially on responses in the middle of the quality range that need nuanced interpretation. This falloff is greatest when LLMs use only a few examples and lessens when more task-specific data is provided, with fine-tuned models showing the least degradation. The pattern suggests potential unfairness in evaluating students whose answers show partial but developing understanding.

Core claim

All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.

What carries the argument

Quality-conditioned scoring agreement, which tracks model-expert alignment separately for low-quality, mid-quality, and high-quality student responses to expose adaptation effects in the mid-range.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deploying few-shot LLMs in educational scoring without checks on mid-range responses risks inequitable outcomes for certain students.
  • Hybrid approaches that route ambiguous mid-range answers to humans could improve overall reliability.
  • The degradation pattern may appear in other domains involving partial credit or subjective judgment, such as essay assessment.
  • Increasing the number of human raters used for ground truth could test whether the observed drops reflect model shortcomings or reference variability.

Load-bearing premise

The ground-truth scores assigned by a single biology education expert accurately capture the nuanced interpretation required for mid-range responses and serve as a stable reference for measuring model agreement.

What would settle it

Re-scoring the same student responses with a second independent biology expert and checking whether the AI models still exhibit the same degree of mid-range degradation relative to the new reference scores.

Figures

Figures reproduced from arXiv: 2605.07647 by Abigail Victoria Gurin Schleifer, Asaf Salman, Beata Beigman Klebanov, Giora Alexandron, Moriah Ariely.

Figure 1
Figure 1. Figure 1: Human-Model Agreement With Respect To Student’s Response Quality For All The Models [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
read the original abstract

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical study on automated short answer scoring (ASAS) comparing few-shot large language models (GPT-5.2, GPT-4o, Claude Opus 4.5), a fine-tuned BERT-based encoder, and human experts on two biology open-ended items. Using several hundred student responses with ground-truth scores from a biology education expert, it claims that AI models achieve high agreement on fully correct and fully incorrect responses but show substantial degradation in agreement on mid-range responses. This degradation is most severe for few-shot LLMs with limited examples and is reduced with greater task-specific adaptation, with fine-tuned models performing best. Human-human agreement is reported as highest and stable across the quality spectrum. The authors suggest this may lead to inequitable evaluation of student responses indicating developing understanding.

Significance. If the central findings hold, this work is significant for highlighting quality-conditioned fairness issues in ASAS, particularly the vulnerability of mid-range scoring which is crucial for assessing partial student understanding. The systematic comparison across different levels of task-specific adaptation (few-shot vs. fine-tuned) provides actionable insights into how to improve model alignment on complex scoring tasks. The use of real student responses from biology items adds ecological validity to the results.

major comments (2)
  1. Methods section on data annotation: The ground-truth scores are assigned by a single biology education expert without reported inter-rater reliability (IRR) metrics or validation procedures, especially for the mid-range responses that require nuanced interpretation. Since the degradation claim is defined as reduced agreement with these scores, and the abstract notes that mid-range responses need nuanced interpretation, it is unclear whether the observed drop reflects model limitations or divergence from one rater's specific judgments. Clarifying whether human-human agreement uses independent raters or repeated scoring by the same expert on mid-range items is necessary to support the claim.
  2. Results section: The results from several hundred responses across two items are presented without statistical tests, confidence intervals, or exact agreement values per quality band (e.g., Cohen's kappa or percentage agreement for low/mid/high). This absence makes it difficult to evaluate the robustness and magnitude of the reported mid-range degradation and its conditioning on task-specific adaptation.
minor comments (2)
  1. Abstract: The abstract mentions 'several hundred student responses' but could specify the exact number and the two items for better context.
  2. Methods: Exact prompt templates used for the few-shot LLMs are not detailed, which would aid reproducibility given the known sensitivity of LLM outputs to prompting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important aspects of our methods and results presentation that we will address through revisions to improve clarity and rigor. We respond to each major comment below.

read point-by-point responses
  1. Referee: Methods section on data annotation: The ground-truth scores are assigned by a single biology education expert without reported inter-rater reliability (IRR) metrics or validation procedures, especially for the mid-range responses that require nuanced interpretation. Since the degradation claim is defined as reduced agreement with these scores, and the abstract notes that mid-range responses need nuanced interpretation, it is unclear whether the observed drop reflects model limitations or divergence from one rater's specific judgments. Clarifying whether human-human agreement uses independent raters or repeated scoring by the same expert on mid-range items is necessary to support the claim.

    Authors: We thank the referee for raising this critical point on annotation reliability. The ground-truth scores were assigned by one biology education expert using a rubric developed through iterative consultation with domain specialists. Human-human agreement was calculated between this expert and a second independent rater on a stratified subset of 100 responses (including mid-range items) to establish a baseline; the second rater received the same rubric and training materials. We will revise the Methods section to explicitly detail this two-rater procedure, the subset selection, and the resulting agreement values. This clarification will distinguish inter-rater stability from potential single-rater idiosyncrasies and better contextualize the AI degradation findings. Full IRR across the entire dataset was not collected due to practical constraints, but the reported human agreement supports the claim of stability across quality levels. revision: yes

  2. Referee: Results section: The results from several hundred responses across two items are presented without statistical tests, confidence intervals, or exact agreement values per quality band (e.g., Cohen's kappa or percentage agreement for low/mid/high). This absence makes it difficult to evaluate the robustness and magnitude of the reported mid-range degradation and its conditioning on task-specific adaptation.

    Authors: We agree that additional statistical detail is needed to strengthen the results. In the revised manuscript, we will report exact percentage agreement and Cohen's kappa for each quality band (low, mid, high) per model and item. We will add 95% confidence intervals computed via bootstrap resampling and include statistical tests (chi-square tests for differences in agreement proportions across bands, plus regression models testing the interaction between quality band and adaptation level). These will appear in updated tables and the Results text, allowing readers to assess the magnitude and significance of mid-range degradation and its reduction with greater task-specific adaptation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparison to external labels

full rationale

The paper is a direct empirical evaluation of model agreement with ground-truth scores supplied by an external biology education expert on student responses. No equations, derivations, fitted parameters, or predictions are present that could reduce to the inputs by construction. Claims about mid-range degradation are measured against these independent human labels rather than derived from model internals or self-referential assumptions. Human-human agreement is reported as a separate benchmark but does not serve as a load-bearing premise for the AI results. The study therefore contains no self-definitional, fitted-input, or self-citation circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on treating expert-assigned scores as reliable ground truth and on the assumption that the two chosen biology items are representative of the scoring challenges that produce mid-range responses.

axioms (1)
  • domain assumption Scores assigned by a single biology education expert constitute accurate ground truth for measuring model agreement.
    The paper uses these scores to compute all agreement metrics and to identify mid-range degradation.

pith-pipeline@v0.9.0 · 5574 in / 1216 out tokens · 32751 ms · 2026-05-11T01:49:18.189353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

  1. [1]

    International Journal of Science Education , pages=

    Ariely, Moriah and Salman, Asaf and Yarden, Anat and Alexandron, Giora , title=. International Journal of Science Education , pages=

  2. [2]

    and Saraf, P

    Wu, X. and Saraf, P. P. and Lee, G. and others , journal =. Unveiling Scoring Processes: Dissecting the Differences Between. 2025 , doi =

  3. [3]

    Leveraging prompt-based LLMs for automated scoring and feedback generation in higher education , journal =

    Eman Mudhi AlGhamdi and Yuheng Li and Dragan Gašević and Guanliang Chen , keywords =. Leveraging prompt-based LLMs for automated scoring and feedback generation in higher education , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.compedu.2025.105511 , url =

  4. [4]

    International Journal of Artificial Intelligence in Education , volume=

    Towards trustworthy autograding of short, multi-lingual, multi-type answers , author=. International Journal of Artificial Intelligence in Education , volume=. 2023 , publisher=

  5. [5]

    Kortemeyer, Gerd , journal =. Toward. 2023 , month =. doi:10.1103/PhysRevPhysEducRes.19.020163 , url =

  6. [6]

    Gr. L. BMC Medical Education , volume=. 2024 , publisher=

  7. [7]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Automatic short answer grading for finnish with chatgpt , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  8. [8]

    arXiv preprint arXiv:2501.06658 , year=

    Comparing few-shot prompting of GPT-4 LLMs with BERT classifiers for open-response assessment in tutor equity training , author=. arXiv preprint arXiv:2501.06658 , year=

  9. [9]

    Journal of Research in Science Teaching , volume=

    Causal-mechanical explanations in biology: Applying automated assessment for personalized learning in the science classroom , author=. Journal of Research in Science Teaching , volume=. 2024 , publisher=

  10. [10]

    Assessment & Evaluation in Higher Education , pages=

    Who grades best? Comparing ChatGPT, peer, and instructor evaluations across varying levels of student project quality , author=. Assessment & Evaluation in Higher Education , pages=. 2025 , publisher=

  11. [11]

    Journal of the Learning Sciences , volume=

    On the benefits of seeking (and avoiding) help in online problem-solving environments , author=. Journal of the Learning Sciences , volume=. 2014 , publisher=

  12. [12]

    Review of educational research , volume=

    Focus on formative feedback , author=. Review of educational research , volume=. 2008 , publisher=

  13. [13]

    Do BERT -Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLM s?

    Zhang, Junyan and Huang, Yiming and Liu, Shuliang and Gao, Yubo and Hu, Xuming. Do BERT -Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLM s?. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

  14. [14]

    Proceedings of the Eleventh ACM Conference on Learning @ Scale , pages =

    Henkel, Owen and Hills, Libby and Boxer, Adam and Roberts, Bill and Levonian, Zach , title =. Proceedings of the Eleventh ACM Conference on Learning @ Scale , pages =. 2024 , isbn =. doi:10.1145/3657604.3664693 , abstract =

  15. [15]

    Emergent Abilities in Large Language Models: A Survey,

    Emergent abilities in large language models: A survey , author=. arXiv preprint arXiv:2503.05788 , year=

  16. [16]

    Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , pages=

    Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT-4o , author=. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , pages=

  17. [17]

    FairAIED: Navigating fairness, bias, and ethics in educational

    Chinta, Sribala Vidyadhari and Wang, Zichong and Yin, Zhipeng and Hoang, Nhat and Gonzalez, Matthew and Quy, T Le and Zhang, Wenbin , journal=. FairAIED: Navigating fairness, bias, and ethics in educational

  18. [18]

    Journal of Educational Data Mining , volume=

    Multi-dimensional performance analysis of large language models for classroom discussion assessment , author=. Journal of Educational Data Mining , volume=

  19. [19]

    Fairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on G erman Learner Essays from Secondary Education

    Schaller, Nils-Jonathan and Ding, Yuning and Horbach, Andrea and Meyer, Jennifer and Jansen, Thorben. Fairness in Automated Essay Scoring: A Comparative Analysis of Algorithms on G erman Learner Essays from Secondary Education. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024

  20. [20]

    Building Better Open-Source Tools to Support Fairness in Automated Scoring

    Madnani, Nitin and Loukina, Anastassia and von Davier, Alina and Burstein, Jill and Cahill, Aoife. Building Better Open-Source Tools to Support Fairness in Automated Scoring. Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. 2017. doi:10.18653/v1/W17-1605

  21. [21]

    Don ' t take ``nswvtnvakgxpm'' for an answer -- The surprising vulnerability of automatic content scoring systems to adversarial input

    Ding, Yuning and Riordan, Brian and Horbach, Andrea and Cahill, Aoife and Zesch, Torsten. Don ' t take ``nswvtnvakgxpm'' for an answer -- The surprising vulnerability of automatic content scoring systems to adversarial input. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.76

  22. [22]

    Automatic Short-Answer Grading in Sustainability Education:

    Emirtekin, Emrah and. Automatic Short-Answer Grading in Sustainability Education:. Journal of Computer Assisted Learning , volume=. 2026 , publisher=

  23. [23]

    arXiv preprint arXiv:2204.03503 (2022)

    Survey on automated short answer grading with deep learning: from word embeddings to transformers , author=. arXiv preprint arXiv:2204.03503 , year=

  24. [24]

    International cross-domain conference for machine learning and knowledge extraction , pages=

    Automated short answer grading using deep learning: A survey , author=. International cross-domain conference for machine learning and knowledge extraction , pages=. 2021 , organization=

  25. [25]

    Annual Review of Statistics and Its Application , volume=

    Algorithmic Fairness: Choices, Assumptions, and Definitions , author=. Annual Review of Statistics and Its Application , volume=. 2021 , publisher=. doi:10.1146/annurev-statistics-042720-125902 , url=

  26. [26]

    Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    FinBERT2: A specialized bidirectional encoder for bridging the gap in finance-specific deployment of large language models , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

  27. [27]

    A Survey of Large Language Models

    A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

  28. [28]

    Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 , pages=

    On the marriage of lp-norms and edit distance , author=. Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 , pages=

  29. [29]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

  30. [30]

    2025 , month = nov, note =

    Claude Opus 4.5 System Card , author =. 2025 , month = nov, note =

  31. [31]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  32. [32]

    Singh, Aaditya and Fry, Adam and Perelman, Adam and Tart, Adam and Ganesh, Adi and El-Kishky, Ahmed and McLaughlin, Aidan and Low, Aiden and Ostrow, AJ and Ananthram, Akhila and others , journal=. Open

  33. [33]

    Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=

    Transformer-based Hebrew NLP models for short answer scoring in biology , author=. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=

  34. [34]

    Findings of the association for computational linguistics: Acl 2023 , pages=

    Similarity-based content scoring-a more classroom-suitable alternative to instance-based scoring? , author=. Findings of the association for computational linguistics: Acl 2023 , pages=

  35. [35]

    nswvtnvakgxpm

    Don’t take “nswvtnvakgxpm” for an answer--The surprising vulnerability of automatic content scoring systems to adversarial input , author=. Proceedings of the 28th international conference on computational linguistics , pages=

  36. [36]

    How learner control and explainable learn- ing analytics about skill mastery shape student desires to finish and avoid loss in tutored practice

    Ferreira Mello, Rafael and Pereira Junior, Cleon and Rodrigues, Luiz and Pereira, Filipe Dwan and Cabral, Luciano and Costa, Newarney and Ramalho, Geber and Gasevic, Dragan , title =. 2025 , isbn =. doi:10.1145/3706468.3706481 , booktitle =

  37. [37]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Text classification via large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  38. [38]

    Computers and Education: Artificial Intelligence , volume=

    Automatic assessment of text-based responses in post-secondary education: A systematic review , author=. Computers and Education: Artificial Intelligence , volume=. 2024 , publisher=

  39. [39]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

  40. [40]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

  41. [41]

    2024 , howpublished =

    DeepSeek-V3 Technical Report , author =. 2024 , howpublished =

  42. [42]

    and Rapp, S

    Yacobson, E. and Rapp, S. and Blonder, R. and Alexandron, G. , booktitle=. Human Experts vs. 2025 , url=

  43. [43]

    Discover Artificial Intelligence , volume=

    Performance of the pre-trained large language model GPT-4 on automated short answer grading , author=. Discover Artificial Intelligence , volume=. 2024 , publisher=

  44. [44]

    International Conference on Artificial Intelligence in Education , pages=

    Exploring automatic short answer grading as a tool to assist in human rating , author=. International Conference on Artificial Intelligence in Education , pages=. 2020 , organization=

  45. [45]

    LLM s in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches

    Chamieh, Imran and Zesch, Torsten and Giebermann, Klaus. LLM s in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024

  46. [46]

    Machine learning , volume=

    Support-vector networks , author=. Machine learning , volume=. 1995 , publisher=

  47. [47]

    2022 , publisher=

    Automated essay scoring , author=. 2022 , publisher=

  48. [48]

    International Journal of Artificial Intelligence in Education , pages=

    Algorithmic fairness in automatic short answer scoring , author=. International Journal of Artificial Intelligence in Education , pages=. 2025 , publisher=

  49. [49]

    Scientific Reports , volume=

    Examining the responsible use of zero-shot AI approaches to scoring essays , author=. Scientific Reports , volume=. 2024 , publisher=

  50. [50]

    Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications , pages=

    The many dimensions of algorithmic fairness in educational applications , author=. Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications , pages=

  51. [51]

    fairness

    Madaio, Michael and Blodgett, Su Lin and Mayfield, Elijah and Dixon-Rom. Beyond “fairness”: Structural (in) justice lenses on. The ethics of artificial intelligence in education , pages=. 2022 , publisher=

  52. [52]

    International journal of artificial intelligence in education , volume=

    Algorithmic bias in education , author=. International journal of artificial intelligence in education , volume=. 2022 , publisher=

  53. [53]

    Proceedings of the aaai conference on artificial intelligence , volume=

    Unveiling the tapestry of automated essay scoring: A comprehensive investigation of accuracy, fairness, and generalizability , author=. Proceedings of the aaai conference on artificial intelligence , volume=

  54. [54]

    Uncovering measurement biases in

    Gurin Schleifer, Abigail and Beigman Klebanov, Beata and Alexandron, Giora , journal=. Uncovering measurement biases in. 2025 , publisher=

  55. [55]

    arXiv preprint arXiv:2308.16687 , year=

    Dictabert: A state-of-the-art bert suite for modern hebrew , author=. arXiv preprint arXiv:2308.16687 , year=

  56. [56]

    Educational measurement , volume=

    Test fairness , author=. Educational measurement , volume=

  57. [57]

    Advancing natural language processing in educational assessment , pages=

    Evaluating fairness of automated scoring in educational measurement , author=. Advancing natural language processing in educational assessment , pages=. 2023 , publisher=

  58. [58]

    Language testing , volume=

    How do we go about investigating test fairness? , author=. Language testing , volume=. 2010 , publisher=

  59. [59]

    and Xi, Xiaoming and Breyer, F

    Williamson, David M. and Xi, Xiaoming and Breyer, F. Jay , title =. Educational Measurement: Issues and Practice , volume =. doi:https://doi.org/10.1111/j.1745-3992.2011.00223.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1745-3992.2011.00223.x , year =

  60. [60]

    Machine learning and

    Ariely, Moriah and Nazaretsky, Tanya and Alexandron, Giora , journal=. Machine learning and. 2023 , publisher=

  61. [61]

    The unlocking spell on base

    Lin, Bill Yuchen and Ravichander, Abhilasha and Lu, Ximing and Dziri, Nouha and Sclar, Melanie and Chandu, Khyathi and Bhagavatula, Chandra and Choi, Yejin , journal=. The unlocking spell on base

  62. [62]

    2020 , publisher=

    Fairlearn: A toolkit for assessing and improving fairness in AI , author=. 2020 , publisher=

  63. [63]

    , title =

    Gorgun, Guher and Yildirim-Erbasli, Seyma N. , title =. Journal of Educational Measurement , volume =. doi:https://doi.org/10.1111/jedm.12420 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/jedm.12420 , abstract =

  64. [64]

    Journal of Computer Assisted Learning , volume=

    Semi-automatic coding of open-ended text responses in large-scale assessments , author=. Journal of Computer Assisted Learning , volume=. 2023 , publisher=

  65. [65]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    A semantic feature-wise transformation relation network for automatic short answer grading , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=