GIM is a new benchmark of 820 problems testing integrated cognitive skills on accessible knowledge, with IRT-calibrated evaluations across 28 models showing test-time compute choices matter as much as model selection.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
GIM: Evaluating models via tasks that integrate multiple cognitive domains
GIM is a new benchmark of 820 problems testing integrated cognitive skills on accessible knowledge, with IRT-calibrated evaluations across 28 models showing test-time compute choices matter as much as model selection.