Comparing test sets with item response theory

Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, Samuel R Bowman · 2021 · arXiv 2106.00840

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

GIM: Evaluating models via tasks that integrate multiple cognitive domains

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

GIM is a new benchmark of 820 problems testing integrated cognitive skills on accessible knowledge, with IRT-calibrated evaluations across 28 models showing test-time compute choices matter as much as model selection.

Position: AI Evaluations Should be Grounded on a Theory of Capability

cs.AI · 2025-09-23 · conditional · novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

citing papers explorer

Showing 2 of 2 citing papers.

GIM: Evaluating models via tasks that integrate multiple cognitive domains cs.AI · 2026-05-18 · unverdicted · none · ref 1
GIM is a new benchmark of 820 problems testing integrated cognitive skills on accessible knowledge, with IRT-calibrated evaluations across 28 models showing test-time compute choices matter as much as model selection.
Position: AI Evaluations Should be Grounded on a Theory of Capability cs.AI · 2025-09-23 · conditional · none · ref 50
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

Comparing test sets with item response theory

fields

years

verdicts

representative citing papers

citing papers explorer