arxiv: 2605.00238 · v2 · submitted 2026-04-30 · 💻 cs.CL

Recognition: unknown

Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

Longwei Cong , Sonja Hahn , Sebastian Gombert , Leon Camus , Hendrik Drachsler , Ulf Kroehne

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords automatic short answer gradingitem response theoryLLM evaluationgrading difficultyresponse-level analysissemantic correlateserror patterns

0 comments

The pith

Item response theory models LLM grading as a function of ability and response difficulty, revealing performance variations that aggregate metrics miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that common aggregate scores like macro-F1 fail to show how LLM graders handle student answers of different difficulty levels. It applies item response theory to treat grading correctness as depending on each model's latent ability and each response's latent difficulty, which produces response-level diagnostics of where graders succeed or fail. This matters because it exposes robustness differences across models that look equivalent on overall accuracy. The analysis of seventeen LLMs on two benchmarks finds that accuracy declines at different rates with rising difficulty and that errors on hard responses cluster on the partially correct label. Difficult responses also show weaker semantic match to references, stronger contradictions, and greater isolation in embedding space.

Core claim

The central claim is that modeling LLM-based automatic short answer grading with item response theory, where the probability of correct grading follows a logistic function of grader ability minus response difficulty, yields finer-grained evaluation than aggregate metrics. On the SciEntsBank and Beetle datasets this reveals that models with comparable overall performance differ markedly in how sharply accuracy drops as difficulty increases, that mistakes concentrate on the partially_correct_incomplete category for hard items, and that estimated difficulty correlates with weaker semantic alignment, stronger contradiction signals, and greater embedding isolation.

What carries the argument

The item response theory logistic model that expresses grading correctness as a function of latent grader ability minus latent response difficulty.

If this is right

Models with matching aggregate F1 scores still differ substantially in how quickly grading accuracy falls as response difficulty rises.
Errors on difficult responses concentrate on the partially_correct_incomplete label rather than spreading evenly across categories.
Higher estimated difficulty tracks weaker semantic alignment to the reference answer, stronger contradiction signals, and greater isolation in embedding space.
The framework supplies response-level diagnostics that aggregate metrics alone cannot provide.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Grader selection for real deployments could incorporate expected response difficulty rather than overall accuracy alone.
Training or fine-tuning loops could prioritize examples whose difficulty parameters are high to improve robustness on ambiguous answers.
The same modeling approach may apply to other subjective NLP tasks such as summarization evaluation or open-ended question scoring where difficulty is not uniform.

Load-bearing premise

The standard logistic item response model fits the LLM grading data without substantial misspecification and the estimated difficulty values reflect genuine response ambiguity.

What would settle it

Finding no statistically significant difference in the rate of accuracy decline across models when difficulty parameters are estimated from the same data, or observing zero correlation between those difficulty estimates and independent measures of semantic alignment or contradiction.

Figures

Figures reproduced from arXiv: 2605.00238 by Hendrik Drachsler, Leon Camus, Longwei Cong, Sebastian Gombert, Sonja Hahn, Ulf Kroehne.

**Figure 1.** Figure 1: Model accuracy across ordered responsedifficulty bins on (a) SciEntsBank and (b) Beetle. Here, m denotes the slope obtained by linearly regressing model accuracy on the order of the IRT-based difficulty bins. grained multi-class discrimination. At the same time, difficult PCI responses are often misclassified as correct, suggesting that higher difficulty induces both collapse toward an intermediate cate… view at source ↗

**Figure 2.** Figure 2: Confusion matrices on SciEntsBank across five bins of IRT-derived response difficulty [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Model accuracy across ordered IRT-based difficulty bins for all evaluated models on (a) SciEntsBank and [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Confusion matrices on Beetle across five bins of IRT-derived response difficulty [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader ability and response grading difficulty. This formulation enables response-level analysis of where LLM graders succeed or fail and reveals robustness differences that are not visible from aggregate scores alone. We apply the framework to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks. The results show that even models with similar overall performance differ substantially in how sharply their grading accuracy declines as response difficulty increases. In addition, confusion patterns show that errors on difficult responses concentrate disproportionately on the \texttt{partially\_correct\_incomplete} label, indicating a tendency toward intermediate-label collapse under ambiguity. To characterize difficult responses, we further analyze semantic and linguistic correlates of estimated difficulty. Across both datasets, higher difficulty is associated with weaker semantic alignment to the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space. Overall, these results show that item response theory offers a useful framework for evaluating LLM-based ASAG beyond aggregate performance measures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses IRT to show that LLM graders with similar aggregate scores differ in robustness to hard responses, with errors clustering on partial labels, but provides no checks on whether the model actually fits the data.

read the letter

The main thing to know is that this work applies item response theory to LLM short-answer grading and finds that models with close overall F1 or kappa scores still show different rates of accuracy drop as response difficulty increases, plus a clear bias toward partial-credit errors on tough items. It also ties higher difficulty to weaker semantic alignment and stronger contradictions in the student answers. That response-level view is the real addition over standard benchmarks. The paper does a clean job of running the same setup across 17 open models on SciEntsBank and Beetle, then linking the estimated difficulties to linguistic features. Those patterns are worth seeing because they point to where aggregate metrics are too coarse. The soft spot is the missing validation for the IRT model itself. Nothing in the abstract or description shows item-fit statistics, residual checks, or tests against a simpler model that ignores difficulty. Since the task has at least three labels and the errors concentrate on the partial category, treating correctness as binary for the logistic link could easily misspecify the data. Without those diagnostics the difficulty parameters are hard to trust as reflecting real ambiguity rather than model assumptions. This is aimed at researchers who build or evaluate automated assessment tools and want something beyond macro scores. Anyone already working on LLM robustness in education settings will get concrete takeaways from the error patterns and semantic correlates. It deserves peer review because the framing is practical and the empirical differences look real, even though the authors will need to add model-fit evidence before the claims land solidly.

Referee Report

2 major / 2 minor

Summary. The paper introduces an Item Response Theory (IRT) framework for evaluating LLM-based automatic short answer grading (ASAG) by modeling grading correctness as a function of latent grader ability and response difficulty. Applied to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks, it reports robustness differences across models not visible in aggregate metrics such as macro-F1 and Cohen's kappa, shows that errors on difficult responses concentrate on the partially_correct_incomplete label, and identifies semantic correlates of estimated difficulty including weaker reference alignment and greater embedding isolation.

Significance. If the IRT logistic model provides an adequate description of the observed grading outcomes, the framework supplies a response-level lens for ASAG evaluation that can guide model selection and highlight failure modes under ambiguity. The reported differences in accuracy decline with difficulty and the semantic correlates would constitute a concrete advance over aggregate-only reporting.

major comments (2)

[§4] §4 (Experimental results): No item-fit statistics, residual plots, likelihood-ratio tests against a saturated or null model, or parameter-recovery simulations are reported for the fitted IRT logistic model. Because correctness is binarized from a three-class grading scheme and errors concentrate on the partially_correct_incomplete class, the logistic link may be misspecified; without these diagnostics the downstream claims about robustness differences and semantic correlates rest on unvalidated parameters.
[§3.2] §3.2 (IRT formulation): The paper assumes the standard 1PL logistic form P(correct | ability, difficulty) = 1 / (1 + exp(-(ability - difficulty))) adequately captures LLM grading behavior. No comparison to a model that retains the three-class structure (e.g., graded response model) or to a null model that ignores difficulty is provided, leaving open whether the estimated difficulty parameters reflect response properties or modeling artifacts.

minor comments (2)

[Table 1] Table 1 or equivalent: list the 17 LLMs explicitly with their parameter counts and base models so readers can assess coverage of the open-weight space.
[§5] §5 (Semantic correlates): clarify the exact embedding model and distance metric used to compute semantic isolation so the correlation analysis is reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of model validation for the IRT framework. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: §4 (Experimental results): No item-fit statistics, residual plots, likelihood-ratio tests against a saturated or null model, or parameter-recovery simulations are reported for the fitted IRT logistic model. Because correctness is binarized from a three-class grading scheme and errors concentrate on the partially_correct_incomplete class, the logistic link may be misspecified; without these diagnostics the downstream claims about robustness differences and semantic correlates rest on unvalidated parameters.

Authors: We agree that the manuscript would be strengthened by additional model diagnostics. In the revised version we will add item-fit statistics (infit and outfit mean-square values), residual plots, and parameter-recovery simulations on synthetic data generated from the fitted parameters. We will also report a likelihood-ratio test against a null model with constant success probability. Regarding potential misspecification from binarization, we will expand the discussion in §4 to note the concentration of errors on the partial-credit label and include a sensitivity check that treats the three classes separately where feasible. revision: yes
Referee: §3.2 (IRT formulation): The paper assumes the standard 1PL logistic form P(correct | ability, difficulty) = 1 / (1 + exp(-(ability - difficulty))) adequately captures LLM grading behavior. No comparison to a model that retains the three-class structure (e.g., graded response model) or to a null model that ignores difficulty is provided, leaving open whether the estimated difficulty parameters reflect response properties or modeling artifacts.

Authors: The 1PL formulation was chosen for its direct interpretability of response difficulty on a shared scale with grader ability. We will revise §3.2 to include an explicit likelihood-ratio comparison against a null model that ignores difficulty (constant probability per grader). We will also add a brief discussion of why a graded-response model was not adopted: it would require treating the three labels as ordered and estimating multiple thresholds per response, which complicates the primary goal of obtaining a single difficulty parameter per response. We view the binarized approach as a reasonable first-order approximation for overall grading correctness and will note this modeling choice as a limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: standard IRT applied to observed LLM grading outcomes without self-referential definitions or fitted predictions

full rationale

The paper applies the established item response theory logistic model to binary correctness outcomes from 17 LLMs grading responses in the SciEntsBank and Beetle datasets. Grading correctness is modeled as a function of latent grader ability and response difficulty using standard IRT estimation on the observed data; no parameters are defined in terms of the downstream semantic correlates or robustness differences being analyzed. There are no self-citations load-bearing the central framework, no fitted inputs renamed as predictions, and no ansatz or uniqueness claims that reduce the derivation to its own inputs. The post-estimation analyses of confusion patterns and linguistic correlates are independent of the IRT fitting step itself, leaving the framework self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of the standard IRT model to LLM grading outcomes; two latent parameters are estimated per item and per grader.

free parameters (2)

grader ability
Latent continuous parameter estimated from observed grading correctness on the benchmark items.
response difficulty
Latent continuous parameter estimated from observed grading correctness on the benchmark items.

axioms (1)

domain assumption Probability of correct grading follows a logistic function of (ability minus difficulty)
Core modeling assumption of the 1PL or 2PL IRT model invoked to link latent traits to observed binary outcomes.

pith-pipeline@v0.9.0 · 5552 in / 1229 out tokens · 40457 ms · 2026-05-14T20:51:07.973637+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 14 canonical work pages · 1 internal anchor

[1]

and Wu, Hao and Yu, Hong

Lalor, John P. and Wu, Hao and Yu, Hong. Building an Evaluation Scale using Item Response Theory. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1062

work page doi:10.18653/v1/d16-1062 2016
[2]

and Wu, Hao and Munkhdalai, Tsendsuren and Yu, Hong

Lalor, John P. and Wu, Hao and Munkhdalai, Tsendsuren and Yu, Hong. Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1500

work page doi:10.18653/v1/d18-1500 2018
[3]

Automated Scoring: Beyond Natural Language Processing

Madnani, Nitin and Cahill, Aoife. Automated Scoring: Beyond Natural Language Processing. Proceedings of the 27th International Conference on Computational Linguistics. 2018

2018
[4]

arXiv preprint arXiv:2505.15055 , year=

Lost in benchmarks? rethinking large language model benchmarking with item response theory , author=. arXiv preprint arXiv:2505.15055 , year=

work page arXiv
[5]

International journal of artificial intelligence in education , volume=

The eras and trends of automatic short answer grading , author=. International journal of artificial intelligence in education , volume=. 2015 , publisher=

2015
[6]

2025 , eprint=

A Survey on LLM-as-a-Judge , author=. 2025 , eprint=

2025
[7]

S em E val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge

Dzikovska, Myroslava and Nielsen, Rodney and Brew, Chris and Leacock, Claudia and Giampiccolo, Danilo and Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Dang, Hoa Trang. S em E val-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. Second Joint Conference on Lexical and Computational Semantics (* SEM ...

2013
[8]

2007 , publisher=

Testlet response theory and its applications , author=. 2007 , publisher=

2007
[9]

Mathematical programming , volume=

On the limited memory BFGS method for large scale optimization , author=. Mathematical programming , volume=. 1989 , publisher=

1989
[10]

Educational and Psychological Measurement , volume=

Analyzing the results of Monte Carlo studies in item response theory , author=. Educational and Psychological Measurement , volume=. 1997 , publisher=

1997
[11]

Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) , pages=

Text-to-text semantic similarity for automatic short answer grading , author=. Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) , pages=

2009
[12]

Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

2019
[13]

Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference

Warner, Benjamin and Chaffin, Antoine and Clavi. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.127

work page doi:10.18653/v1/2025.acl-long.127 2025
[14]

Journal of the Royal statistical society: series B (Methodological) , volume=

Controlling the false discovery rate: a practical and powerful approach to multiple testing , author=. Journal of the Royal statistical society: series B (Methodological) , volume=. 1995 , publisher=

1995
[15]

International Journal of Artificial Intelligence in Education , volume=

A survey of current machine learning approaches to student free-text evaluation for intelligent tutoring , author=. International Journal of Artificial Intelligence in Education , volume=. 2023 , publisher=

2023
[16]

, author=

Vector Based Techniques for Short Answer Grading. , author=. FLAIRS , pages=
[17]

Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments

Mohler, Michael and Bunescu, Razvan and Mihalcea, Rada. Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011

2011
[18]

Journal of Computer Assisted Learning , volume=

Coding energy knowledge in constructed responses with explainable NLP models , author=. Journal of Computer Assisted Learning , volume=. 2023 , publisher=

2023
[19]

International conference on artificial intelligence in education , pages=

Investigating transformers for automatic short answer grading , author=. International conference on artificial intelligence in education , pages=. 2020 , organization=

2020
[20]

Proceedings of the 15th international learning analytics and knowledge conference , pages=

Automatic short answer grading in the LLM era: Does GPT-4 with prompt engineering beat traditional models? , author=. Proceedings of the 15th international learning analytics and knowledge conference , pages=
[21]

International Conference on Artificial Intelligence in Education , pages=

Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design , author=. International Conference on Artificial Intelligence in Education , pages=. 2025 , organization=

2025
[22]

In: Proceedings of the 16th International Learning Analytics and Knowledge Conference

Cong, Longwei and Hammerla, Leon and Hahn, Sonja and Gombert, Sebastian and Drachsler, Hendrik and Kroehne, Ulf , title =. 2026 , publisher =. doi:10.1145/3785022.3785031 , booktitle =

work page doi:10.1145/3785022.3785031 2026
[23]

International conference on artificial intelligence in education , pages=

Balancing cost and quality: an exploration of human-in-the-loop frameworks for automated short answer scoring , author=. International conference on artificial intelligence in education , pages=. 2022 , organization=

2022
[24]

Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024) , pages=

Scoring with Confidence?--Exploring High-confidence Scoring for Saving Manual Grading Effort , author=. Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024) , pages=

2024
[25]

2016 , publisher=

Handbook of item response theory , author=. 2016 , publisher=

2016
[26]

2013 , publisher=

Item response theory: Principles and applications , author=. 2013 , publisher=

2013
[27]

arXiv preprint arXiv:2602.00521 , year=

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory , author=. arXiv preprint arXiv:2602.00521 , year=

work page arXiv
[28]

arXiv preprint arXiv:2510.00844 , year=

Learning Compact Representations of LLM Abilities via Item Response Theory , author=. arXiv preprint arXiv:2510.00844 , year=

work page arXiv
[29]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Smart: Simulated students aligned with item response theory for question difficulty prediction , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[30]

Fast and Easy Short Answer Grading with High Accuracy

Sultan, Md Arafat and Salazar, Cristobal and Sumner, Tamara. Fast and Easy Short Answer Grading with High Accuracy. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. doi:10.18653/v1/N16-1123

work page doi:10.18653/v1/n16-1123 2016
[31]

Pre-Training BERT on Domain Resources for Short Answer Grading

Sung, Chul and Dhamecha, Tejas and Saha, Swarnadeep and Ma, Tengfei and Reddy, Vinay and Arora, Rishi. Pre-Training BERT on Domain Resources for Short Answer Grading. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10....

work page doi:10.18653/v1/d19-1628 2019
[32]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

International cross-domain conference for machine learning and knowledge extraction , pages=

Automated short answer grading using deep learning: A survey , author=. International cross-domain conference for machine learning and knowledge extraction , pages=. 2021 , organization=

2021
[34]

and Jia, Robin and Boyd-Graber, Jordan

Rodriguez, Pedro and Barrow, Joe and Hoyle, Alexander and Lalor, John P. and Jia, Robin and Boyd-Graber, Jordan. Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language P...

work page doi:10.18653/v1/2021.acl-long.346 2021
[35]

arXiv preprint arXiv:2204.03503 (2022)

Survey on automated short answer grading with deep learning: from word embeddings to transformers , author=. arXiv preprint arXiv:2204.03503 , year=

work page arXiv
[36]

ACM Computing Surveys , volume=

A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions , author=. ACM Computing Surveys , volume=. 2025 , publisher=

2025
[37]

Inject Rubrics into Short Answer Grading System

Wang, Tianqi and Inoue, Naoya and Ouchi, Hiroki and Mizumoto, Tomoya and Inui, Kentaro. Inject Rubrics into Short Answer Grading System. Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019). 2019. doi:10.18653/v1/D19-6119

work page doi:10.18653/v1/d19-6119 2019
[38]

Proceedings of the 16th International Learning Analytics and Knowledge Conference , year =

Are Rubrics All You Need? Towards Rubric-Based Automatic Short Answer Scoring via Guided Rubric--Answer Alignment , author =. Proceedings of the 16th International Learning Analytics and Knowledge Conference , year =
[39]

From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge

Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan. From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge. Proceedings of the 2025 Conference on Empirical Methods ...

work page doi:10.18653/v1/2025.emnlp-main.138 2025
[40]

2026 , eprint=

Confidence Estimation in Automatic Short Answer Grading with LLMs , author=. 2026 , eprint=

2026