Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.
ACM Computing Surveys , volume=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
citing papers explorer
-
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.
- ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents