GIM: Evaluating models via tasks that integrate multiple cognitive domains

Alexandre Rezende; Rohit Patel; Steven McClain

arxiv: 2605.18663 · v1 · pith:Z7ZAMPUDnew · submitted 2026-05-18 · 💻 cs.AI · cs.CL· cs.LG

GIM: Evaluating models via tasks that integrate multiple cognitive domains

Rohit Patel , Alexandre Rezende , Steven McClain This is my paper

Pith reviewed 2026-05-20 10:21 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LLM benchmarkcognitive integrationitem response theorytest-time computemodel evaluationreasoning assessmentcontamination detection

0 comments

The pith

A new benchmark shows that choices like thinking budget and quantization affect LLM performance as much as selecting a different model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Grounded Integration Measure as a benchmark of 820 original problems whose difficulty comes from requiring models to combine multiple cognitive operations such as constraint satisfaction, state tracking, epistemic vigilance, and audience calibration over common knowledge. This design keeps reasoning grounded in realistic tasks without demanding rare expertise or reducing problems to pure abstraction. Calibration of a 2-parameter logistic IRT model on more than 200,000 prompt-response pairs across 28 models produces ability estimates that correctly rank test configurations even when raw scores are noisy or incomplete. Analysis of 22 models and 47 configurations reveals that within-family adjustments in thinking budget and quantization shift results comparably to differences between model families. The public-private split supplies a direct check for training-data contamination.

Core claim

The Grounded Integration Measure establishes a benchmark where each of the 820 problems requires models to coordinate several cognitive operations, including constraint satisfaction, state tracking, epistemic vigilance, and audience calibration, over broadly accessible knowledge. This integration creates difficulty without needing specialized facts or abstract isolation. A 2-parameter logistic IRT model fitted to over 200,000 prompt-response pairs yields ability estimates that order models and configurations robustly. The resulting leaderboard and analysis indicate that within-family variations in thinking budget and quantization are comparable in effect to differences between model families

What carries the argument

The Grounded Integration Measure (GIM) benchmark, a collection of 820 expert-authored problems that each demand coordination of multiple cognitive operations over everyday knowledge to assess integrated reasoning.

Load-bearing premise

The problems derive their difficulty primarily from the requirement to integrate several cognitive operations rather than from hidden knowledge demands or construction artifacts, and the public-private split reliably detects contamination.

What would settle it

If performance gaps between models disappear when the integration requirements are removed from the problems while leaving the knowledge elements intact, or if the private-set scores diverge sharply from public-set predictions without other explanation.

read the original abstract

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GIM gives a practical integration benchmark with a large IRT calibration and extensive test-time compute data, but the claim that integration is the main difficulty driver lacks direct validation.

read the letter

The main thing to know is that this paper introduces GIM, a set of 820 original problems where difficulty is supposed to come from coordinating multiple cognitive operations over common knowledge, backed by over 200k responses and a 2PL IRT model that produces ability estimates even with messy data. They also ran the biggest published sweep of thinking budgets and quantizations across 11 models and 35 configs, and they release the public problems plus the fitted parameters.

Referee Report

3 major / 3 minor

Summary. The paper introduces the Grounded Integration Measure (GIM), a benchmark of 820 original expert-authored problems (615 public, 205 private) whose difficulty is intended to stem from the integration of multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, and audience calibration—over broadly accessible knowledge. The authors collect >200k prompt-response pairs across 28 models, apply majority rubric scoring (median 6 criteria), fit a continuous-response 2PL IRT model to produce ability estimates, and present a leaderboard for 22 models and 47 test configurations. They report that within-family choices such as thinking budget and quantization affect performance comparably to model selection, and release the public problems, calibrated IRT parameters, and evaluation framework.

Significance. If the key assumptions hold, the work provides a useful addition to LLM evaluation by focusing on grounded cognitive integration rather than knowledge escalation or purely abstract tasks. The scale of the test-time compute study (11 models across 35 configurations) and the release of the full framework plus IRT parameters are clear strengths that support reproducibility and future work. The public-private split offers a built-in contamination check.

major comments (3)

[Methods] Methods / problem construction section: the manuscript provides no pilot solvability tests, knowledge audits, or correlations between rubric criteria and IRT difficulty (b) parameters to confirm that variance is driven primarily by cognitive integration demands rather than hidden knowledge or construction artifacts. This validation is load-bearing for interpreting the leaderboard and the within-family configuration results.
[Results] IRT modeling and results sections: no model-fit diagnostics (e.g., item characteristic curves, residual analysis, or goodness-of-fit statistics) or inter-rater reliability coefficients for the rubric scoring are reported. These are necessary to support the claim that the 2PL ability estimates are robust and correctly order configurations even with missing data or errors.
[Results] Leaderboard and configuration analysis (likely §5 or equivalent): the observation that thinking budget and quantization matter as much as model selection rests on the assumption that IRT parameters isolate integration capability; without the missing validations above, model-specific knowledge or prompt sensitivity remain plausible confounds.

minor comments (3)

[Methods] Clarify the exact parameterization of the continuous-response 2PL IRT model and how rubric scores are mapped to the response variable.
[Methods] Add explicit reporting of the distribution of rubric criteria counts across the 820 problems and any inter-rater agreement statistics even if preliminary.
[Results] Ensure all leaderboard figures include error bars or uncertainty estimates derived from the IRT model.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the validation of problem construction and IRT modeling, which we have addressed through targeted revisions. We respond to each major comment below.

read point-by-point responses

Referee: [Methods] Methods / problem construction section: the manuscript provides no pilot solvability tests, knowledge audits, or correlations between rubric criteria and IRT difficulty (b) parameters to confirm that variance is driven primarily by cognitive integration demands rather than hidden knowledge or construction artifacts. This validation is load-bearing for interpreting the leaderboard and the within-family configuration results.

Authors: We agree that these elements strengthen the interpretation that difficulty arises from cognitive integration. In the revised manuscript we have added a dedicated subsection in Methods describing the multi-expert pilot review process used to verify solvability and accessibility of knowledge, along with the knowledge audit procedure that cross-checked content against standard reference sources. We have also computed and reported Pearson correlations between per-criterion rubric scores and the fitted IRT b parameters; these appear in a new Appendix and show that criteria requiring multiple operations (e.g., joint constraint satisfaction and audience calibration) are the strongest predictors of item difficulty. revision: yes
Referee: [Results] IRT modeling and results sections: no model-fit diagnostics (e.g., item characteristic curves, residual analysis, or goodness-of-fit statistics) or inter-rater reliability coefficients for the rubric scoring are reported. These are necessary to support the claim that the 2PL ability estimates are robust and correctly order configurations even with missing data or errors.

Authors: We accept that explicit fit diagnostics and reliability metrics were omitted from the original submission. The revised version includes a new appendix with item characteristic curves, standardized residual plots, and chi-square goodness-of-fit statistics for the continuous-response 2PL model. For the rubric scoring we now report both pairwise Cohen’s kappa and average intraclass correlation coefficients across the median-six criteria; these values support the stability of the majority-vote scores used to generate the response data for IRT calibration. revision: yes
Referee: [Results] Leaderboard and configuration analysis (likely §5 or equivalent): the observation that thinking budget and quantization matter as much as model selection rests on the assumption that IRT parameters isolate integration capability; without the missing validations above, model-specific knowledge or prompt sensitivity remain plausible confounds.

Authors: We acknowledge that the comparative claim is more persuasive once the validations are in place. With the additions described in the responses to the first two comments, the revised discussion section now explicitly ties the configuration results to the new evidence that IRT difficulty aligns with integration demands and that the model fits the data adequately. This reduces the likelihood that the observed effects are driven primarily by model-specific knowledge or prompt artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs an empirical benchmark of 820 original expert-authored problems and applies a standard 2PL IRT model to >200k collected prompt-response pairs to produce ability estimates and a leaderboard. The central claims (within-family configs mattering comparably to model choice) are direct comparisons of these fitted estimates across 22 models and 47 configurations; they do not reduce by construction to prior inputs, self-definitions, or self-citations. No equations or steps rename fitted parameters as independent predictions, import uniqueness theorems from the authors' prior work, or smuggle ansatzes via citation. The public-private split and rubric scoring supply external grounding rather than circular reinforcement. This is a self-contained empirical study whose derivation chain remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that the constructed problems validly isolate cognitive integration and that the 2PL IRT model produces stable ability estimates across the observed response patterns; no new physical or mathematical entities are postulated.

free parameters (1)

2PL IRT item parameters
Discrimination and difficulty parameters for each of the 820 items are estimated from the >200k response data.

axioms (2)

domain assumption The 2PL logistic model adequately describes the relationship between latent ability and observed responses on these tasks.
Invoked when fitting the IRT model to produce ability estimates that 'correctly order test-configurations even when raw accuracy is distorted'.
domain assumption Majority of problems have rubric-decomposed scoring with independently judged criteria.
Stated as the scoring method for the benchmark problems.

pith-pipeline@v0.9.0 · 5843 in / 1567 out tokens · 33498 ms · 2026-05-20T10:21:51.714116+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages

[1]

Comparing test sets with item response theory

arXiv:2106.00840. Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024. Lev S. Vygotsky.Mind in Society: The Development of Higher Psycho...

work page arXiv 2024
[2]

by excluding prior knowledge. The recently released ARC-AGI-3 (ARC Prize Foundation, 2026) extends this lineage to interactive, agentic, language-free environments, reporting humans at 100% and frontier AI below 1% as of March 2026 and demonstrating that the abstraction axis is far from saturated. BIG-Bench (Srivastava et al., 2023) and BIG-Bench Hard (Su...

work page 2026
[3]

Difficulty through cognitive demand.GIM problems require solvers to coordinate multiple cognitive operations—parsing ambiguous specifications, satisfying interacting constraints, tracking state across sequential steps, evaluating the reliability of presented information, and calibrating responses to context. Some problems concentrate difficulty in the num...

work page
[4]

Some problems test practical interaction directly: calibrating a technical explanation for an expert audience, or evaluating the internal consistency of a professional document

Practical grounding.Problems are drawn from realistic scenarios—analysis, decision-making, synthesis, planning, professional communication—that reflect the tasks users actually apply LLMs to. Some problems test practical interaction directly: calibrating a technical explanation for an expert audience, or evaluating the internal consistency of a profession...

work page
[5]

Quality through originality and multi-round review.Every one of the 820 problems is an original composition authored specifically for GIM by a domain expert, with no prompts copied from prior deliverables, competitions, or external corpora. An additional 100 of the 820 problems were contributed by the CRAG-MM (Wang et al., 2025) and WearVQA (Chang et al.,...

work page 2025
[6]

the current president

Rubric-first, partial-credit grading.The majority of GIM problems (528 of 820, 64%) are graded by decomposing the ideal response into independently judged criteria (median 6, mean 7, range 2–80). Each criterion is scored separately by an LLM judge under confidence-weighted aggregation (Wei et al., 2024; Patel, 2025), extending the rubric-decomposed gradin...

work page 2024
[7]

Rules out crossing the river in a single turn with the wolf, goat, cabbage, and dove due to weight restrictions

work page
[8]

Rules out crossing the river in two turns, since you would end up on the wrong side of the river

work page
[9]

Demonstrates crossing the river in three turns by transporting either the goat alone or the wolf, cabbage, and dove together on the first turn, returning alone on the second turn, and taking the remaining group across on the third turn

work page
[10]

Commentary:The classic wolf-goat-cabbage puzzle is among the most heavily represented reasoning problems in training data

Concludes that the minimum number of river crossings required to get everyone across is three. Commentary:The classic wolf-goat-cabbage puzzle is among the most heavily represented reasoning problems in training data. This variant adds weight constraints and a dove that provides lift, fundamentally changing the solution structure: the traditional incompat...

work page 1955
[11]

Notes that ZIP codes did not exist in 1955

work page 1955
[12]

Notes that the letter is likely a forgery or otherwise not authentic

work page
[13]

24 Commentary:The letterhead includes a five-digit ZIP code (10271), but the U.S

Notes that a historian cannot rely on this meeting having taken place. 24 Commentary:The letterhead includes a five-digit ZIP code (10271), but the U.S. ZIP code system was not introduced until 1963. A letter dated 1955 containing a ZIP code is anachronistic, indicating the document is not genuine. The correct response is to flag the inconsistency and dec...

work page 1963
[14]

Identifies the first decision as TURN RIGHT

work page
[15]

Identifies the second decision as TURN LEFT

work page
[16]

Identifies the third decision as TURN RIGHT

work page
[17]

Identifies the fourth decision as TURN RIGHT

work page
[18]

Identifies the fifth decision as TURN RIGHT

work page
[19]

Formats the decisions in a numbered list. Commentary:The solver must parse the maze image, distinguish T-junctions (where decisions occur) from corners (where the path simply bends), mentally simulate the character’s orientation as it changes with each turn, and maintain a consistent left/right frame of reference throughout. Partial credit from the rubric...

work page
[20]

Recognizes that crossing the International Date Line eastward gains a calendar day

work page
[21]

Recognizes that even on the shortest possible crossing (e.g., Apia, Samoa to Pago Pago, American Samoa), the flight takes about 30 minutes and you gain 24 hours on the calendar

work page
[22]

Recognizes that the actual clock time would be approximately 12:07pm on Dec 31, not 11:37am—you cannot arrive at an earlier clock time than you departed on a 30-minute flight

work page
[23]

Commentary:The problem is designed to look like a lateral-thinking puzzle with a clever geographic answer

Concludes that the scenario is therefore impossible. Commentary:The problem is designed to look like a lateral-thinking puzzle with a clever geographic answer. Models readily identify the International Date Line mechanism and propose specific routes, but fail to notice that the clock times are inconsistent: a 30-minute flight must advance local time by ap...

work page
[24]

Avoids focusing on scientific basics or unnecessary detail

work page
[25]

block-copolymer,

Avoids unnecessary references comparing chemistry issues to the friend’s brief coal-mining job. Commentary:Both individuals were offered PhDs in Polymer Chemistry—the friend is a peer who chose a different career, not a layperson. Models consistently patronize: they explain what ATRP is, define “block-copolymer,” and draw strained analogies to mining. The...

work page
[26]

Recognizes that the coin toss in a cricket game usually confers a starting advantage to one of the two teams

work page
[27]

Determines that all that is necessary to confer a starting advantage randomly is that neither captain knows which of heads or tails is more likely prior to the toss

work page
[28]

Infers that the biased die simulating a biased coin confers advantage randomly to one of Bill and Will’s teams

work page
[29]

Concludes that the easiest way to ensure a fair simulated coin toss is to proceed as normal. Commentary:Models overwhelmingly propose elaborate fairness procedures (von Neumann extraction, multiple rolls, etc.) rather than recognizing that since neither captain knows the bias, the existing procedure is already fair. C.7 Epistemic Vigilance Category:World ...

work page
[30]

Notes that Tiger’s Nest is approximately 3,000 ft, not 5,000 ft, above the Paro Valley

work page
[31]

Notes that Tiger’s Nest is northeast of the midpoint cafeteria, or otherwise argues for why the sunset over the monastery could not be viewed from the cafeteria given their relative locations

work page
[32]

Commentary:The passage is fluent and evocative—the kind of polished prose that discourages close scrutiny

Notes that the sun setting directly behind the monastery is impossible from the described vantage point. Commentary:The passage is fluent and evocative—the kind of polished prose that discourages close scrutiny. Models tend to accept the prose at face value or flag only superficial style issues, failing to cross-reference the spatial claims against real-w...

work page
[33]

Recognizes that on hallways/corridors Hecarim closes the distance at 1m/s (3− 2 = 1), while on staircases (same direction) the time gap does not change; therefore Ashe’s goal is to maximize time on staircases

work page
[34]

Deduces that Hecarim’s optimal first move is to descend from floor 9 to floor 8

work page
[35]

Computes that if Ashe runs toward an end staircase of the middle school building, she is caught on floor 8 at t= 55s

work page
[36]

Computes that Ashe can reach the far elementary staircase on floor 8 before Hecarim does. 27

work page
[37]

Infers that after Ashe enters that staircase, her only relevant continuations are going all the way down to floor 1 or all the way up to floor 10

work page
[38]

Computes that going down yields capture att= 70s

work page
[39]

Computes that going up yields capture att= 75s

work page
[40]

Concludes that optimal play yields a capture time of 75 seconds. Commentary:The problem requires building a spatial model of the environment, reasoning about relative velocities across different movement modes, and then solving a minimax pursuit problem with discrete staircase constraints. Each step individually is tractable; the integration is what defea...

work page
[41]

Explains the theorem to a child in exactly 100 words

work page
[42]

Explains the theorem to a teenager in exactly 100 words

work page
[43]

Explains the theorem to a college student in exactly 100 words

work page
[44]

Explains the theorem to a graduate student in exactly 100 words

work page
[45]

I can’t believe they didn’t confirm this sooner! It’s going to take at least 30 minutes to get those earrings from the storage unit and be back here

Explains the theorem to an expert in exactly 100 words. Commentary:The problem requires simultaneously satisfying two orthogonal demands: adapting content sophistication across five distinct audience levelsandhitting an exact word count for each. Models typically satisfy one constraint at the expense of the other—producing well-calibrated explanations tha...

work page
[46]

Recognizes that Daphne did not fly with John, Katherine, Toby, and Bay to Puerto Rico

work page
[47]

Acknowledges that Daphne is not an eligible sibling in the contest to hug Angelo first

work page
[48]

States that Toby won the sibling contest

work page
[49]

States that Toby presents an item to Abby

work page
[50]

States that Abby receives grandma’s bracelet

work page
[51]

States that grandma’s pearl earrings are in the purple bag

work page
[52]

States that grandma’s bracelet is in the blue bag

work page
[53]

States that Bay grabs the blue bag from the storage unit

work page
[54]

Commentary:Information about which bag to retrieve is corrupted as it passes through a telephone-game chain

Avoids stating that Bay saw Katherine’s text reminding her the earrings are in the purple bag. Commentary:Information about which bag to retrieve is corrupted as it passes through a telephone-game chain. The reader must recognize that Daphne is on a semester abroad (implied, never stated outright) and therefore not on the flight, making her ineligible des...

work page
[55]

dawn” starts with a consonant, therefore makes no change to the 11th word “his

Identifies that the 10th word “dawn” starts with a consonant, therefore makes no change to the 11th word “his”

work page
[56]

had” starts with a consonant, therefore makes no change to the 21st word “reached

Identifies that the 20th word “had” starts with a consonant, therefore makes no change to the 21st word “reached”

work page
[57]

in” starts with a vowel, therefore replaces the 31st word “peril

Identifies that the 30th word “in” starts with a vowel, therefore replaces the 31st word “peril” with an Old English equivalent such as “fær”

work page
[58]

the” starts with a consonant, therefore makes no change to the 41st word “fields

Identifies that the 40th word “the” starts with a consonant, therefore makes no change to the 41st word “fields”

work page
[59]

compassionate

Identifies that the 50th word “compassionate” starts with a consonant, therefore makes no change to the 51st word “insisted”

work page
[60]

people’s

Identifies that the 60th word “people’s” starts with a consonant, therefore makes no change to the 61st word “welfare”

work page
[61]

Preserves all other text unchanged from the original passage. Commentary:The solver must chain three operations—accurate word counting, vowel/consonant classification of the preceding word, and Old English translation—where an error in any step cascades. The rubric samples specific positions across the passage (words 10–11, 20–21, 30–31, etc.) to detect s...

work page 2024
[62]

Each rubric item is scored by the LLM judge as a(si, ci)pair, where si ∈ [0, 1] is the per-criterion score andci ∈ [0, 1]is the judge’s self-reported confidence

Rubric-graded scoring.Problems with structured rubrics are decomposed into independently assessable criteria (Min et al., 2023). Each rubric item is scored by the LLM judge as a(si, ci)pair, where si ∈ [0, 1] is the per-criterion score andci ∈ [0, 1]is the judge’s self-reported confidence. The per-sample score is the confidence-weighted mean 1 n nX i=1 si...

work page 2023
[63]

D.9 LLM-as-judge implementation Both scoring strategies use an LLM judge operating under structured output constraints (Zheng et al., 2023)

Exact-answer scoring.Problems with a definitive golden answer are scored by comparing the model’s output against the target, accounting for representational equivalence (e.g.,0.5 = 1 2), format variations, and null equivalences (e.g.,N/A≡empty≡none). D.9 LLM-as-judge implementation Both scoring strategies use an LLM judge operating under structured output...

work page 2023
[64]

Question: The first input is the question

work page
[65]

Gold target: The second input is the expected answer

work page
[66]

Final answer: the final answer is

Predicted answer: The third input is the answer to verify, which may contain reasoning and should end with a specific format (e.g., "Final answer: the final answer is", "Answer:", or "Answer is"). Task Requirements Your task is to:

work page
[67]

Check if a real answer is generated

work page
[68]

If it does, extract the final answer and compare it with the golden target

work page
[69]

Consider answers as the same if they are represented in different formats (e.g., 0.5 and 1/2) or have an absolute difference of less than 0.01 (e.g., sqrt(2) and 1.41)

work page
[70]

Handle non-numeric answers, such as booleans or lists of strings, where order matters

work page
[71]

Account for choice questions where the expected answer is a letter (A, B, C, D) and the real answer is a string mentioned in the question

work page
[72]

Treat N/A as equivalent to null, none, or empty in program analysis outputs

work page
[73]

the answer is something shown above

Identify answers that involve context, such as "the answer is something shown above". Here is a new example. Grade the predicted answer as one of: CORRECT, INCORRECT. ‘‘‘ Question: {question} Gold target: {answer} Predicted answer: {predicted_answer} ‘‘‘ The gold target is the ground truth -- do not question its correctness. Even if the predicted answer a...

work page 2020
[74]

Cross-slice comparable scores with information-weighted standard errors.A raw mean is also closed-form, but means on different subsets of the bank are not comparable to each other (a0.7on an easy slice is not the same ability as a0.7on a hard slice), and itss/√n SE treats every item as equally informative. The IRT scorer puts every slice on the sameθ scal...

work page
[75]

image-disabled models skip image prompts; partially-run configurations skip late prompts)

Missing-data robustness.Different (model, thinking-level) configurations cover slightly different subsets of the bank (e.g. image-disabled models skip image prompts; partially-run configurations skip late prompts). The IRT scorer down-weights or omits unobserved cells coherently, whereas a raw mean would silently change its support

work page
[76]

Item-level diagnostics.The fitted item difficultybj and discrimination aj become first-class objects, used in Section 6 to characterize the bank, in Section 5.3 to compare cognitive dimensions, and as an ongoing tool for retiring saturated items

work page
[77]

A calibrated logit scale.The transform stretches differences at the frontier (where high-aj items concen- trate) and compresses differences in the middle, which is precisely the property we want from an ability scale on a benchmark designed to remain unsaturated

work page
[78]

quantitative reasoning

Robustness to inference-failure noise.Higher-end models at higher thinking budgets fail more often outright (longer chains of thought→ more timeouts and truncated responses), and these zero/missing cells can drag a naive raw mean below a less-thinking sibling whose surviving answers are uniformly better. For example, GPT 5.4 fails on2.3%of X-High attempts...

work page 2006

[1] [1]

Comparing test sets with item response theory

arXiv:2106.00840. Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024. Lev S. Vygotsky.Mind in Society: The Development of Higher Psycho...

work page arXiv 2024

[2] [2]

by excluding prior knowledge. The recently released ARC-AGI-3 (ARC Prize Foundation, 2026) extends this lineage to interactive, agentic, language-free environments, reporting humans at 100% and frontier AI below 1% as of March 2026 and demonstrating that the abstraction axis is far from saturated. BIG-Bench (Srivastava et al., 2023) and BIG-Bench Hard (Su...

work page 2026

[3] [3]

Difficulty through cognitive demand.GIM problems require solvers to coordinate multiple cognitive operations—parsing ambiguous specifications, satisfying interacting constraints, tracking state across sequential steps, evaluating the reliability of presented information, and calibrating responses to context. Some problems concentrate difficulty in the num...

work page

[4] [4]

Some problems test practical interaction directly: calibrating a technical explanation for an expert audience, or evaluating the internal consistency of a professional document

Practical grounding.Problems are drawn from realistic scenarios—analysis, decision-making, synthesis, planning, professional communication—that reflect the tasks users actually apply LLMs to. Some problems test practical interaction directly: calibrating a technical explanation for an expert audience, or evaluating the internal consistency of a profession...

work page

[5] [5]

Quality through originality and multi-round review.Every one of the 820 problems is an original composition authored specifically for GIM by a domain expert, with no prompts copied from prior deliverables, competitions, or external corpora. An additional 100 of the 820 problems were contributed by the CRAG-MM (Wang et al., 2025) and WearVQA (Chang et al.,...

work page 2025

[6] [6]

the current president

Rubric-first, partial-credit grading.The majority of GIM problems (528 of 820, 64%) are graded by decomposing the ideal response into independently judged criteria (median 6, mean 7, range 2–80). Each criterion is scored separately by an LLM judge under confidence-weighted aggregation (Wei et al., 2024; Patel, 2025), extending the rubric-decomposed gradin...

work page 2024

[7] [7]

Rules out crossing the river in a single turn with the wolf, goat, cabbage, and dove due to weight restrictions

work page

[8] [8]

Rules out crossing the river in two turns, since you would end up on the wrong side of the river

work page

[9] [9]

Demonstrates crossing the river in three turns by transporting either the goat alone or the wolf, cabbage, and dove together on the first turn, returning alone on the second turn, and taking the remaining group across on the third turn

work page

[10] [10]

Commentary:The classic wolf-goat-cabbage puzzle is among the most heavily represented reasoning problems in training data

Concludes that the minimum number of river crossings required to get everyone across is three. Commentary:The classic wolf-goat-cabbage puzzle is among the most heavily represented reasoning problems in training data. This variant adds weight constraints and a dove that provides lift, fundamentally changing the solution structure: the traditional incompat...

work page 1955

[11] [11]

Notes that ZIP codes did not exist in 1955

work page 1955

[12] [12]

Notes that the letter is likely a forgery or otherwise not authentic

work page

[13] [13]

24 Commentary:The letterhead includes a five-digit ZIP code (10271), but the U.S

Notes that a historian cannot rely on this meeting having taken place. 24 Commentary:The letterhead includes a five-digit ZIP code (10271), but the U.S. ZIP code system was not introduced until 1963. A letter dated 1955 containing a ZIP code is anachronistic, indicating the document is not genuine. The correct response is to flag the inconsistency and dec...

work page 1963

[14] [14]

Identifies the first decision as TURN RIGHT

work page

[15] [15]

Identifies the second decision as TURN LEFT

work page

[16] [16]

Identifies the third decision as TURN RIGHT

work page

[17] [17]

Identifies the fourth decision as TURN RIGHT

work page

[18] [18]

Identifies the fifth decision as TURN RIGHT

work page

[19] [19]

Formats the decisions in a numbered list. Commentary:The solver must parse the maze image, distinguish T-junctions (where decisions occur) from corners (where the path simply bends), mentally simulate the character’s orientation as it changes with each turn, and maintain a consistent left/right frame of reference throughout. Partial credit from the rubric...

work page

[20] [20]

Recognizes that crossing the International Date Line eastward gains a calendar day

work page

[21] [21]

Recognizes that even on the shortest possible crossing (e.g., Apia, Samoa to Pago Pago, American Samoa), the flight takes about 30 minutes and you gain 24 hours on the calendar

work page

[22] [22]

Recognizes that the actual clock time would be approximately 12:07pm on Dec 31, not 11:37am—you cannot arrive at an earlier clock time than you departed on a 30-minute flight

work page

[23] [23]

Commentary:The problem is designed to look like a lateral-thinking puzzle with a clever geographic answer

Concludes that the scenario is therefore impossible. Commentary:The problem is designed to look like a lateral-thinking puzzle with a clever geographic answer. Models readily identify the International Date Line mechanism and propose specific routes, but fail to notice that the clock times are inconsistent: a 30-minute flight must advance local time by ap...

work page

[24] [24]

Avoids focusing on scientific basics or unnecessary detail

work page

[25] [25]

block-copolymer,

Avoids unnecessary references comparing chemistry issues to the friend’s brief coal-mining job. Commentary:Both individuals were offered PhDs in Polymer Chemistry—the friend is a peer who chose a different career, not a layperson. Models consistently patronize: they explain what ATRP is, define “block-copolymer,” and draw strained analogies to mining. The...

work page

[26] [26]

Recognizes that the coin toss in a cricket game usually confers a starting advantage to one of the two teams

work page

[27] [27]

Determines that all that is necessary to confer a starting advantage randomly is that neither captain knows which of heads or tails is more likely prior to the toss

work page

[28] [28]

Infers that the biased die simulating a biased coin confers advantage randomly to one of Bill and Will’s teams

work page

[29] [29]

Concludes that the easiest way to ensure a fair simulated coin toss is to proceed as normal. Commentary:Models overwhelmingly propose elaborate fairness procedures (von Neumann extraction, multiple rolls, etc.) rather than recognizing that since neither captain knows the bias, the existing procedure is already fair. C.7 Epistemic Vigilance Category:World ...

work page

[30] [30]

Notes that Tiger’s Nest is approximately 3,000 ft, not 5,000 ft, above the Paro Valley

work page

[31] [31]

Notes that Tiger’s Nest is northeast of the midpoint cafeteria, or otherwise argues for why the sunset over the monastery could not be viewed from the cafeteria given their relative locations

work page

[32] [32]

Commentary:The passage is fluent and evocative—the kind of polished prose that discourages close scrutiny

Notes that the sun setting directly behind the monastery is impossible from the described vantage point. Commentary:The passage is fluent and evocative—the kind of polished prose that discourages close scrutiny. Models tend to accept the prose at face value or flag only superficial style issues, failing to cross-reference the spatial claims against real-w...

work page

[33] [33]

Recognizes that on hallways/corridors Hecarim closes the distance at 1m/s (3− 2 = 1), while on staircases (same direction) the time gap does not change; therefore Ashe’s goal is to maximize time on staircases

work page

[34] [34]

Deduces that Hecarim’s optimal first move is to descend from floor 9 to floor 8

work page

[35] [35]

Computes that if Ashe runs toward an end staircase of the middle school building, she is caught on floor 8 at t= 55s

work page

[36] [36]

Computes that Ashe can reach the far elementary staircase on floor 8 before Hecarim does. 27

work page

[37] [37]

Infers that after Ashe enters that staircase, her only relevant continuations are going all the way down to floor 1 or all the way up to floor 10

work page

[38] [38]

Computes that going down yields capture att= 70s

work page

[39] [39]

Computes that going up yields capture att= 75s

work page

[40] [40]

Concludes that optimal play yields a capture time of 75 seconds. Commentary:The problem requires building a spatial model of the environment, reasoning about relative velocities across different movement modes, and then solving a minimax pursuit problem with discrete staircase constraints. Each step individually is tractable; the integration is what defea...

work page

[41] [41]

Explains the theorem to a child in exactly 100 words

work page

[42] [42]

Explains the theorem to a teenager in exactly 100 words

work page

[43] [43]

Explains the theorem to a college student in exactly 100 words

work page

[44] [44]

Explains the theorem to a graduate student in exactly 100 words

work page

[45] [45]

I can’t believe they didn’t confirm this sooner! It’s going to take at least 30 minutes to get those earrings from the storage unit and be back here

Explains the theorem to an expert in exactly 100 words. Commentary:The problem requires simultaneously satisfying two orthogonal demands: adapting content sophistication across five distinct audience levelsandhitting an exact word count for each. Models typically satisfy one constraint at the expense of the other—producing well-calibrated explanations tha...

work page

[46] [46]

Recognizes that Daphne did not fly with John, Katherine, Toby, and Bay to Puerto Rico

work page

[47] [47]

Acknowledges that Daphne is not an eligible sibling in the contest to hug Angelo first

work page

[48] [48]

States that Toby won the sibling contest

work page

[49] [49]

States that Toby presents an item to Abby

work page

[50] [50]

States that Abby receives grandma’s bracelet

work page

[51] [51]

States that grandma’s pearl earrings are in the purple bag

work page

[52] [52]

States that grandma’s bracelet is in the blue bag

work page

[53] [53]

States that Bay grabs the blue bag from the storage unit

work page

[54] [54]

Commentary:Information about which bag to retrieve is corrupted as it passes through a telephone-game chain

Avoids stating that Bay saw Katherine’s text reminding her the earrings are in the purple bag. Commentary:Information about which bag to retrieve is corrupted as it passes through a telephone-game chain. The reader must recognize that Daphne is on a semester abroad (implied, never stated outright) and therefore not on the flight, making her ineligible des...

work page

[55] [55]

dawn” starts with a consonant, therefore makes no change to the 11th word “his

Identifies that the 10th word “dawn” starts with a consonant, therefore makes no change to the 11th word “his”

work page

[56] [56]

had” starts with a consonant, therefore makes no change to the 21st word “reached

Identifies that the 20th word “had” starts with a consonant, therefore makes no change to the 21st word “reached”

work page

[57] [57]

in” starts with a vowel, therefore replaces the 31st word “peril

Identifies that the 30th word “in” starts with a vowel, therefore replaces the 31st word “peril” with an Old English equivalent such as “fær”

work page

[58] [58]

the” starts with a consonant, therefore makes no change to the 41st word “fields

Identifies that the 40th word “the” starts with a consonant, therefore makes no change to the 41st word “fields”

work page

[59] [59]

compassionate

Identifies that the 50th word “compassionate” starts with a consonant, therefore makes no change to the 51st word “insisted”

work page

[60] [60]

people’s

Identifies that the 60th word “people’s” starts with a consonant, therefore makes no change to the 61st word “welfare”

work page

[61] [61]

Preserves all other text unchanged from the original passage. Commentary:The solver must chain three operations—accurate word counting, vowel/consonant classification of the preceding word, and Old English translation—where an error in any step cascades. The rubric samples specific positions across the passage (words 10–11, 20–21, 30–31, etc.) to detect s...

work page 2024

[62] [62]

Each rubric item is scored by the LLM judge as a(si, ci)pair, where si ∈ [0, 1] is the per-criterion score andci ∈ [0, 1]is the judge’s self-reported confidence

Rubric-graded scoring.Problems with structured rubrics are decomposed into independently assessable criteria (Min et al., 2023). Each rubric item is scored by the LLM judge as a(si, ci)pair, where si ∈ [0, 1] is the per-criterion score andci ∈ [0, 1]is the judge’s self-reported confidence. The per-sample score is the confidence-weighted mean 1 n nX i=1 si...

work page 2023

[63] [63]

D.9 LLM-as-judge implementation Both scoring strategies use an LLM judge operating under structured output constraints (Zheng et al., 2023)

Exact-answer scoring.Problems with a definitive golden answer are scored by comparing the model’s output against the target, accounting for representational equivalence (e.g.,0.5 = 1 2), format variations, and null equivalences (e.g.,N/A≡empty≡none). D.9 LLM-as-judge implementation Both scoring strategies use an LLM judge operating under structured output...

work page 2023

[64] [64]

Question: The first input is the question

work page

[65] [65]

Gold target: The second input is the expected answer

work page

[66] [66]

Final answer: the final answer is

Predicted answer: The third input is the answer to verify, which may contain reasoning and should end with a specific format (e.g., "Final answer: the final answer is", "Answer:", or "Answer is"). Task Requirements Your task is to:

work page

[67] [67]

Check if a real answer is generated

work page

[68] [68]

If it does, extract the final answer and compare it with the golden target

work page

[69] [69]

Consider answers as the same if they are represented in different formats (e.g., 0.5 and 1/2) or have an absolute difference of less than 0.01 (e.g., sqrt(2) and 1.41)

work page

[70] [70]

Handle non-numeric answers, such as booleans or lists of strings, where order matters

work page

[71] [71]

Account for choice questions where the expected answer is a letter (A, B, C, D) and the real answer is a string mentioned in the question

work page

[72] [72]

Treat N/A as equivalent to null, none, or empty in program analysis outputs

work page

[73] [73]

the answer is something shown above

Identify answers that involve context, such as "the answer is something shown above". Here is a new example. Grade the predicted answer as one of: CORRECT, INCORRECT. ‘‘‘ Question: {question} Gold target: {answer} Predicted answer: {predicted_answer} ‘‘‘ The gold target is the ground truth -- do not question its correctness. Even if the predicted answer a...

work page 2020

[74] [74]

Cross-slice comparable scores with information-weighted standard errors.A raw mean is also closed-form, but means on different subsets of the bank are not comparable to each other (a0.7on an easy slice is not the same ability as a0.7on a hard slice), and itss/√n SE treats every item as equally informative. The IRT scorer puts every slice on the sameθ scal...

work page

[75] [75]

image-disabled models skip image prompts; partially-run configurations skip late prompts)

Missing-data robustness.Different (model, thinking-level) configurations cover slightly different subsets of the bank (e.g. image-disabled models skip image prompts; partially-run configurations skip late prompts). The IRT scorer down-weights or omits unobserved cells coherently, whereas a raw mean would silently change its support

work page

[76] [76]

Item-level diagnostics.The fitted item difficultybj and discrimination aj become first-class objects, used in Section 6 to characterize the bank, in Section 5.3 to compare cognitive dimensions, and as an ongoing tool for retiring saturated items

work page

[77] [77]

A calibrated logit scale.The transform stretches differences at the frontier (where high-aj items concen- trate) and compresses differences in the middle, which is precisely the property we want from an ability scale on a benchmark designed to remain unsaturated

work page

[78] [78]

quantitative reasoning

Robustness to inference-failure noise.Higher-end models at higher thinking budgets fail more often outright (longer chains of thought→ more timeouts and truncated responses), and these zero/missing cells can drag a naive raw mean below a less-thinking sibling whose surviving answers are uniformly better. For example, GPT 5.4 fails on2.3%of X-High attempts...

work page 2006