GIM is a new benchmark of 820 problems testing integrated cognitive skills on accessible knowledge, with IRT-calibrated evaluations across 28 models showing test-time compute choices matter as much as model selection.
Comparing test sets with item response theory
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AI 2representative citing papers
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
citing papers explorer
-
GIM: Evaluating models via tasks that integrate multiple cognitive domains
GIM is a new benchmark of 820 problems testing integrated cognitive skills on accessible knowledge, with IRT-calibrated evaluations across 28 models showing test-time compute choices matter as much as model selection.
-
Position: AI Evaluations Should be Grounded on a Theory of Capability
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.