GIM: Evaluating models via tasks that integrate multiple cognitive domains
Pith reviewed 2026-05-20 10:21 UTC · model grok-4.3
The pith
A new benchmark shows that choices like thinking budget and quantization affect LLM performance as much as selecting a different model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Grounded Integration Measure establishes a benchmark where each of the 820 problems requires models to coordinate several cognitive operations, including constraint satisfaction, state tracking, epistemic vigilance, and audience calibration, over broadly accessible knowledge. This integration creates difficulty without needing specialized facts or abstract isolation. A 2-parameter logistic IRT model fitted to over 200,000 prompt-response pairs yields ability estimates that order models and configurations robustly. The resulting leaderboard and analysis indicate that within-family variations in thinking budget and quantization are comparable in effect to differences between model families
What carries the argument
The Grounded Integration Measure (GIM) benchmark, a collection of 820 expert-authored problems that each demand coordination of multiple cognitive operations over everyday knowledge to assess integrated reasoning.
Load-bearing premise
The problems derive their difficulty primarily from the requirement to integrate several cognitive operations rather than from hidden knowledge demands or construction artifacts, and the public-private split reliably detects contamination.
What would settle it
If performance gaps between models disappear when the integration requirements are removed from the problems while leaving the knowledge elements intact, or if the private-set scores diverge sharply from public-set predictions without other explanation.
read the original abstract
As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Grounded Integration Measure (GIM), a benchmark of 820 original expert-authored problems (615 public, 205 private) whose difficulty is intended to stem from the integration of multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, and audience calibration—over broadly accessible knowledge. The authors collect >200k prompt-response pairs across 28 models, apply majority rubric scoring (median 6 criteria), fit a continuous-response 2PL IRT model to produce ability estimates, and present a leaderboard for 22 models and 47 test configurations. They report that within-family choices such as thinking budget and quantization affect performance comparably to model selection, and release the public problems, calibrated IRT parameters, and evaluation framework.
Significance. If the key assumptions hold, the work provides a useful addition to LLM evaluation by focusing on grounded cognitive integration rather than knowledge escalation or purely abstract tasks. The scale of the test-time compute study (11 models across 35 configurations) and the release of the full framework plus IRT parameters are clear strengths that support reproducibility and future work. The public-private split offers a built-in contamination check.
major comments (3)
- [Methods] Methods / problem construction section: the manuscript provides no pilot solvability tests, knowledge audits, or correlations between rubric criteria and IRT difficulty (b) parameters to confirm that variance is driven primarily by cognitive integration demands rather than hidden knowledge or construction artifacts. This validation is load-bearing for interpreting the leaderboard and the within-family configuration results.
- [Results] IRT modeling and results sections: no model-fit diagnostics (e.g., item characteristic curves, residual analysis, or goodness-of-fit statistics) or inter-rater reliability coefficients for the rubric scoring are reported. These are necessary to support the claim that the 2PL ability estimates are robust and correctly order configurations even with missing data or errors.
- [Results] Leaderboard and configuration analysis (likely §5 or equivalent): the observation that thinking budget and quantization matter as much as model selection rests on the assumption that IRT parameters isolate integration capability; without the missing validations above, model-specific knowledge or prompt sensitivity remain plausible confounds.
minor comments (3)
- [Methods] Clarify the exact parameterization of the continuous-response 2PL IRT model and how rubric scores are mapped to the response variable.
- [Methods] Add explicit reporting of the distribution of rubric criteria counts across the 820 problems and any inter-rater agreement statistics even if preliminary.
- [Results] Ensure all leaderboard figures include error bars or uncertainty estimates derived from the IRT model.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the validation of problem construction and IRT modeling, which we have addressed through targeted revisions. We respond to each major comment below.
read point-by-point responses
-
Referee: [Methods] Methods / problem construction section: the manuscript provides no pilot solvability tests, knowledge audits, or correlations between rubric criteria and IRT difficulty (b) parameters to confirm that variance is driven primarily by cognitive integration demands rather than hidden knowledge or construction artifacts. This validation is load-bearing for interpreting the leaderboard and the within-family configuration results.
Authors: We agree that these elements strengthen the interpretation that difficulty arises from cognitive integration. In the revised manuscript we have added a dedicated subsection in Methods describing the multi-expert pilot review process used to verify solvability and accessibility of knowledge, along with the knowledge audit procedure that cross-checked content against standard reference sources. We have also computed and reported Pearson correlations between per-criterion rubric scores and the fitted IRT b parameters; these appear in a new Appendix and show that criteria requiring multiple operations (e.g., joint constraint satisfaction and audience calibration) are the strongest predictors of item difficulty. revision: yes
-
Referee: [Results] IRT modeling and results sections: no model-fit diagnostics (e.g., item characteristic curves, residual analysis, or goodness-of-fit statistics) or inter-rater reliability coefficients for the rubric scoring are reported. These are necessary to support the claim that the 2PL ability estimates are robust and correctly order configurations even with missing data or errors.
Authors: We accept that explicit fit diagnostics and reliability metrics were omitted from the original submission. The revised version includes a new appendix with item characteristic curves, standardized residual plots, and chi-square goodness-of-fit statistics for the continuous-response 2PL model. For the rubric scoring we now report both pairwise Cohen’s kappa and average intraclass correlation coefficients across the median-six criteria; these values support the stability of the majority-vote scores used to generate the response data for IRT calibration. revision: yes
-
Referee: [Results] Leaderboard and configuration analysis (likely §5 or equivalent): the observation that thinking budget and quantization matter as much as model selection rests on the assumption that IRT parameters isolate integration capability; without the missing validations above, model-specific knowledge or prompt sensitivity remain plausible confounds.
Authors: We acknowledge that the comparative claim is more persuasive once the validations are in place. With the additions described in the responses to the first two comments, the revised discussion section now explicitly ties the configuration results to the new evidence that IRT difficulty aligns with integration demands and that the model fits the data adequately. This reduces the likelihood that the observed effects are driven primarily by model-specific knowledge or prompt artifacts. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper constructs an empirical benchmark of 820 original expert-authored problems and applies a standard 2PL IRT model to >200k collected prompt-response pairs to produce ability estimates and a leaderboard. The central claims (within-family configs mattering comparably to model choice) are direct comparisons of these fitted estimates across 22 models and 47 configurations; they do not reduce by construction to prior inputs, self-definitions, or self-citations. No equations or steps rename fitted parameters as independent predictions, import uniqueness theorems from the authors' prior work, or smuggle ansatzes via citation. The public-private split and rubric scoring supply external grounding rather than circular reinforcement. This is a self-contained empirical study whose derivation chain remains independent of its own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- 2PL IRT item parameters
axioms (2)
- domain assumption The 2PL logistic model adequately describes the relationship between latent ability and observed responses on these tasks.
- domain assumption Majority of problems have rubric-decomposed scoring with independently judged criteria.
Reference graph
Works this paper leans on
-
[1]
Comparing test sets with item response theory
arXiv:2106.00840. Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024. Lev S. Vygotsky.Mind in Society: The Development of Higher Psycho...
-
[2]
by excluding prior knowledge. The recently released ARC-AGI-3 (ARC Prize Foundation, 2026) extends this lineage to interactive, agentic, language-free environments, reporting humans at 100% and frontier AI below 1% as of March 2026 and demonstrating that the abstraction axis is far from saturated. BIG-Bench (Srivastava et al., 2023) and BIG-Bench Hard (Su...
work page 2026
-
[3]
Difficulty through cognitive demand.GIM problems require solvers to coordinate multiple cognitive operations—parsing ambiguous specifications, satisfying interacting constraints, tracking state across sequential steps, evaluating the reliability of presented information, and calibrating responses to context. Some problems concentrate difficulty in the num...
-
[4]
Practical grounding.Problems are drawn from realistic scenarios—analysis, decision-making, synthesis, planning, professional communication—that reflect the tasks users actually apply LLMs to. Some problems test practical interaction directly: calibrating a technical explanation for an expert audience, or evaluating the internal consistency of a profession...
-
[5]
Quality through originality and multi-round review.Every one of the 820 problems is an original composition authored specifically for GIM by a domain expert, with no prompts copied from prior deliverables, competitions, or external corpora. An additional 100 of the 820 problems were contributed by the CRAG-MM (Wang et al., 2025) and WearVQA (Chang et al.,...
work page 2025
-
[6]
Rubric-first, partial-credit grading.The majority of GIM problems (528 of 820, 64%) are graded by decomposing the ideal response into independently judged criteria (median 6, mean 7, range 2–80). Each criterion is scored separately by an LLM judge under confidence-weighted aggregation (Wei et al., 2024; Patel, 2025), extending the rubric-decomposed gradin...
work page 2024
-
[7]
Rules out crossing the river in a single turn with the wolf, goat, cabbage, and dove due to weight restrictions
-
[8]
Rules out crossing the river in two turns, since you would end up on the wrong side of the river
-
[9]
Demonstrates crossing the river in three turns by transporting either the goat alone or the wolf, cabbage, and dove together on the first turn, returning alone on the second turn, and taking the remaining group across on the third turn
-
[10]
Concludes that the minimum number of river crossings required to get everyone across is three. Commentary:The classic wolf-goat-cabbage puzzle is among the most heavily represented reasoning problems in training data. This variant adds weight constraints and a dove that provides lift, fundamentally changing the solution structure: the traditional incompat...
work page 1955
-
[11]
Notes that ZIP codes did not exist in 1955
work page 1955
-
[12]
Notes that the letter is likely a forgery or otherwise not authentic
-
[13]
24 Commentary:The letterhead includes a five-digit ZIP code (10271), but the U.S
Notes that a historian cannot rely on this meeting having taken place. 24 Commentary:The letterhead includes a five-digit ZIP code (10271), but the U.S. ZIP code system was not introduced until 1963. A letter dated 1955 containing a ZIP code is anachronistic, indicating the document is not genuine. The correct response is to flag the inconsistency and dec...
work page 1963
-
[14]
Identifies the first decision as TURN RIGHT
-
[15]
Identifies the second decision as TURN LEFT
-
[16]
Identifies the third decision as TURN RIGHT
-
[17]
Identifies the fourth decision as TURN RIGHT
-
[18]
Identifies the fifth decision as TURN RIGHT
-
[19]
Formats the decisions in a numbered list. Commentary:The solver must parse the maze image, distinguish T-junctions (where decisions occur) from corners (where the path simply bends), mentally simulate the character’s orientation as it changes with each turn, and maintain a consistent left/right frame of reference throughout. Partial credit from the rubric...
-
[20]
Recognizes that crossing the International Date Line eastward gains a calendar day
-
[21]
Recognizes that even on the shortest possible crossing (e.g., Apia, Samoa to Pago Pago, American Samoa), the flight takes about 30 minutes and you gain 24 hours on the calendar
-
[22]
Recognizes that the actual clock time would be approximately 12:07pm on Dec 31, not 11:37am—you cannot arrive at an earlier clock time than you departed on a 30-minute flight
-
[23]
Concludes that the scenario is therefore impossible. Commentary:The problem is designed to look like a lateral-thinking puzzle with a clever geographic answer. Models readily identify the International Date Line mechanism and propose specific routes, but fail to notice that the clock times are inconsistent: a 30-minute flight must advance local time by ap...
-
[24]
Avoids focusing on scientific basics or unnecessary detail
-
[25]
Avoids unnecessary references comparing chemistry issues to the friend’s brief coal-mining job. Commentary:Both individuals were offered PhDs in Polymer Chemistry—the friend is a peer who chose a different career, not a layperson. Models consistently patronize: they explain what ATRP is, define “block-copolymer,” and draw strained analogies to mining. The...
-
[26]
Recognizes that the coin toss in a cricket game usually confers a starting advantage to one of the two teams
-
[27]
Determines that all that is necessary to confer a starting advantage randomly is that neither captain knows which of heads or tails is more likely prior to the toss
-
[28]
Infers that the biased die simulating a biased coin confers advantage randomly to one of Bill and Will’s teams
-
[29]
Concludes that the easiest way to ensure a fair simulated coin toss is to proceed as normal. Commentary:Models overwhelmingly propose elaborate fairness procedures (von Neumann extraction, multiple rolls, etc.) rather than recognizing that since neither captain knows the bias, the existing procedure is already fair. C.7 Epistemic Vigilance Category:World ...
-
[30]
Notes that Tiger’s Nest is approximately 3,000 ft, not 5,000 ft, above the Paro Valley
-
[31]
Notes that Tiger’s Nest is northeast of the midpoint cafeteria, or otherwise argues for why the sunset over the monastery could not be viewed from the cafeteria given their relative locations
-
[32]
Notes that the sun setting directly behind the monastery is impossible from the described vantage point. Commentary:The passage is fluent and evocative—the kind of polished prose that discourages close scrutiny. Models tend to accept the prose at face value or flag only superficial style issues, failing to cross-reference the spatial claims against real-w...
-
[33]
Recognizes that on hallways/corridors Hecarim closes the distance at 1m/s (3− 2 = 1), while on staircases (same direction) the time gap does not change; therefore Ashe’s goal is to maximize time on staircases
-
[34]
Deduces that Hecarim’s optimal first move is to descend from floor 9 to floor 8
-
[35]
Computes that if Ashe runs toward an end staircase of the middle school building, she is caught on floor 8 at t= 55s
-
[36]
Computes that Ashe can reach the far elementary staircase on floor 8 before Hecarim does. 27
-
[37]
Infers that after Ashe enters that staircase, her only relevant continuations are going all the way down to floor 1 or all the way up to floor 10
-
[38]
Computes that going down yields capture att= 70s
-
[39]
Computes that going up yields capture att= 75s
-
[40]
Concludes that optimal play yields a capture time of 75 seconds. Commentary:The problem requires building a spatial model of the environment, reasoning about relative velocities across different movement modes, and then solving a minimax pursuit problem with discrete staircase constraints. Each step individually is tractable; the integration is what defea...
-
[41]
Explains the theorem to a child in exactly 100 words
-
[42]
Explains the theorem to a teenager in exactly 100 words
-
[43]
Explains the theorem to a college student in exactly 100 words
-
[44]
Explains the theorem to a graduate student in exactly 100 words
-
[45]
Explains the theorem to an expert in exactly 100 words. Commentary:The problem requires simultaneously satisfying two orthogonal demands: adapting content sophistication across five distinct audience levelsandhitting an exact word count for each. Models typically satisfy one constraint at the expense of the other—producing well-calibrated explanations tha...
-
[46]
Recognizes that Daphne did not fly with John, Katherine, Toby, and Bay to Puerto Rico
-
[47]
Acknowledges that Daphne is not an eligible sibling in the contest to hug Angelo first
-
[48]
States that Toby won the sibling contest
-
[49]
States that Toby presents an item to Abby
-
[50]
States that Abby receives grandma’s bracelet
-
[51]
States that grandma’s pearl earrings are in the purple bag
-
[52]
States that grandma’s bracelet is in the blue bag
-
[53]
States that Bay grabs the blue bag from the storage unit
-
[54]
Avoids stating that Bay saw Katherine’s text reminding her the earrings are in the purple bag. Commentary:Information about which bag to retrieve is corrupted as it passes through a telephone-game chain. The reader must recognize that Daphne is on a semester abroad (implied, never stated outright) and therefore not on the flight, making her ineligible des...
-
[55]
dawn” starts with a consonant, therefore makes no change to the 11th word “his
Identifies that the 10th word “dawn” starts with a consonant, therefore makes no change to the 11th word “his”
-
[56]
had” starts with a consonant, therefore makes no change to the 21st word “reached
Identifies that the 20th word “had” starts with a consonant, therefore makes no change to the 21st word “reached”
-
[57]
in” starts with a vowel, therefore replaces the 31st word “peril
Identifies that the 30th word “in” starts with a vowel, therefore replaces the 31st word “peril” with an Old English equivalent such as “fær”
-
[58]
the” starts with a consonant, therefore makes no change to the 41st word “fields
Identifies that the 40th word “the” starts with a consonant, therefore makes no change to the 41st word “fields”
-
[59]
Identifies that the 50th word “compassionate” starts with a consonant, therefore makes no change to the 51st word “insisted”
- [60]
-
[61]
Preserves all other text unchanged from the original passage. Commentary:The solver must chain three operations—accurate word counting, vowel/consonant classification of the preceding word, and Old English translation—where an error in any step cascades. The rubric samples specific positions across the passage (words 10–11, 20–21, 30–31, etc.) to detect s...
work page 2024
-
[62]
Rubric-graded scoring.Problems with structured rubrics are decomposed into independently assessable criteria (Min et al., 2023). Each rubric item is scored by the LLM judge as a(si, ci)pair, where si ∈ [0, 1] is the per-criterion score andci ∈ [0, 1]is the judge’s self-reported confidence. The per-sample score is the confidence-weighted mean 1 n nX i=1 si...
work page 2023
-
[63]
Exact-answer scoring.Problems with a definitive golden answer are scored by comparing the model’s output against the target, accounting for representational equivalence (e.g.,0.5 = 1 2), format variations, and null equivalences (e.g.,N/A≡empty≡none). D.9 LLM-as-judge implementation Both scoring strategies use an LLM judge operating under structured output...
work page 2023
-
[64]
Question: The first input is the question
-
[65]
Gold target: The second input is the expected answer
-
[66]
Final answer: the final answer is
Predicted answer: The third input is the answer to verify, which may contain reasoning and should end with a specific format (e.g., "Final answer: the final answer is", "Answer:", or "Answer is"). Task Requirements Your task is to:
-
[67]
Check if a real answer is generated
-
[68]
If it does, extract the final answer and compare it with the golden target
-
[69]
Consider answers as the same if they are represented in different formats (e.g., 0.5 and 1/2) or have an absolute difference of less than 0.01 (e.g., sqrt(2) and 1.41)
-
[70]
Handle non-numeric answers, such as booleans or lists of strings, where order matters
-
[71]
Account for choice questions where the expected answer is a letter (A, B, C, D) and the real answer is a string mentioned in the question
-
[72]
Treat N/A as equivalent to null, none, or empty in program analysis outputs
-
[73]
the answer is something shown above
Identify answers that involve context, such as "the answer is something shown above". Here is a new example. Grade the predicted answer as one of: CORRECT, INCORRECT. ‘‘‘ Question: {question} Gold target: {answer} Predicted answer: {predicted_answer} ‘‘‘ The gold target is the ground truth -- do not question its correctness. Even if the predicted answer a...
work page 2020
-
[74]
Cross-slice comparable scores with information-weighted standard errors.A raw mean is also closed-form, but means on different subsets of the bank are not comparable to each other (a0.7on an easy slice is not the same ability as a0.7on a hard slice), and itss/√n SE treats every item as equally informative. The IRT scorer puts every slice on the sameθ scal...
-
[75]
image-disabled models skip image prompts; partially-run configurations skip late prompts)
Missing-data robustness.Different (model, thinking-level) configurations cover slightly different subsets of the bank (e.g. image-disabled models skip image prompts; partially-run configurations skip late prompts). The IRT scorer down-weights or omits unobserved cells coherently, whereas a raw mean would silently change its support
-
[76]
Item-level diagnostics.The fitted item difficultybj and discrimination aj become first-class objects, used in Section 6 to characterize the bank, in Section 5.3 to compare cognitive dimensions, and as an ongoing tool for retiring saturated items
-
[77]
A calibrated logit scale.The transform stretches differences at the frontier (where high-aj items concen- trate) and compresses differences in the middle, which is precisely the property we want from an ability scale on a benchmark designed to remain unsaturated
-
[78]
Robustness to inference-failure noise.Higher-end models at higher thinking budgets fail more often outright (longer chains of thought→ more timeouts and truncated responses), and these zero/missing cells can drag a naive raw mean below a less-thinking sibling whose surviving answers are uniformly better. For example, GPT 5.4 fails on2.3%of X-High attempts...
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.