Interactive Benchmarks

Baoqing Yue; Brian Fan; Hufei Yang; Jichen Feng; Mengdi Wang; Qian Sun; Yifan Zhang; Yutong Han; Zihan Zhu

arxiv: 2603.04737 · v4 · pith:WQZ6GTIBnew · submitted 2026-03-05 · 💻 cs.AI · cs.CL· cs.LG

Interactive Benchmarks

Baoqing Yue , Zihan Zhu , Yutong Han , Brian Fan , Qian Sun , Jichen Feng , Hufei Yang , Yifan Zhang

show 1 more author

Mengdi Wang

This is my paper

Pith reviewed 2026-05-21 12:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords interactive benchmarksAI reasoning evaluationmulti-turn interactionbudgeted interactionlogic proofsstrategic gamesmodel intelligence assessment

0 comments

The pith

Interactive benchmarks evaluate AI reasoning through budgeted multi-turn interactions with objective feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fixed benchmarks for AI reasoning are becoming saturated and easy to contaminate, while preference-based tests depend on subjective judgments. The paper claims a core part of intelligence is the capacity to decide what information to gather and how to apply it under constraints. It introduces Interactive Benchmarks as a single framework that measures this through multi-turn exchanges where a model interacts with a judge under a fixed budget. Tests cover logic proofs, UI conversion, math problems, and strategic games. The results indicate current models still have large gaps in handling these adaptive, interactive settings.

Core claim

Interactive Benchmarks form a unified evaluation paradigm that measures reasoning ability by having models engage in budgeted multi-turn interactions, either with an objective judge in proof-style tasks or by maximizing long-horizon utility in games, thereby providing a more robust signal than saturated fixed sets or subjective preferences.

What carries the argument

Interactive Benchmarks, a paradigm that runs budgeted multi-turn interactions between the model and a judge or environment to test the ability to acquire and apply information effectively.

If this is right

Provides an objective alternative to subjective preference judgments for assessing reasoning.
Highlights substantial remaining room for improvement in models' interactive capabilities.
Applies uniformly to proof-style tasks like logic, UI2Html, and mathematics as well as to strategic games.
Reduces vulnerability to benchmark contamination compared with static fixed sets.
Supports evaluation under explicit resource budgets that mirror practical constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This setup could be extended to train models directly on interactive traces rather than static examples.
It points toward hybrid benchmarks that combine interaction with human-like partial observability in other domains such as robotics or scientific inquiry.
Longer budget horizons might expose planning deficits that short interactions hide.

Load-bearing premise

The ability to decide what information to acquire and how to use it effectively is a core aspect of intelligence that can be validly measured through budgeted multi-turn interaction with objective judge feedback.

What would settle it

If models that score highest on these interactive tasks show no advantage over lower-scoring models when deployed in open-ended real-world settings that require gathering information over multiple steps, such as debugging code with limited tool calls or negotiating with limited queries, the measurement would be undermined.

read the original abstract

Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes budgeted multi-turn interactions as a way to test information-acquisition reasoning, but the abstract gives too little on protocols to back the robustness claim.

read the letter

The main thing to know is that this paper tries to fix saturation in fixed benchmarks and subjectivity in preference evals by shifting to budgeted multi-turn interactions. Models have to decide what to ask for or how to act within limits, with objective judge feedback in one track and strategic game play in the other. The unified framing for Interactive Proofs on logic, UI2Html, and math plus Interactive Games is the clearest new piece here. It directly targets a plausible core skill—gathering and using information effectively—and the abstract is right that current methods miss this slice. That part of the argument lands cleanly and gives the work a coherent direction. The results are described as showing more robust measurement and clear room for model improvement, which fits the motivation. The soft spot is the missing mechanics. The abstract does not spell out how budgets are set or enforced across models, what counts as termination, or exactly how the judge feedback stays objective without leaking structure. If those rules vary by task or let some models exploit the limits differently, the claimed robustness could be an artifact of the setup rather than a better signal of the targeted ability. The stress-test note on uniform budget enforcement and non-leading feedback still applies based on what is visible. This is for people building or critiquing reasoning benchmarks, especially those moving toward agentic or interactive systems. A reader who wants concrete alternatives to saturated tests would get value from the idea and the two settings. The paper shows straightforward engagement with the literature problems it names. It deserves peer review so the methods and any quantitative comparisons can be checked in detail.

Referee Report

2 major / 1 minor

Summary. The paper argues that fixed benchmarks are saturated and vulnerable to contamination while preference-based evaluations are subjective. It proposes Interactive Benchmarks as a unified paradigm to assess reasoning via budgeted multi-turn interaction with objective judge feedback, claiming this better measures the core intelligence aspect of deciding what information to acquire and how to use it. Evaluations are described in two settings: Interactive Proofs (Logic, UI2Html, Mathematics tasks) and Interactive Games (strategic long-horizon utility maximization). The authors conclude that the approach yields a more robust assessment and reveals substantial room for improvement in interactive scenarios.

Significance. If the budgeted interaction protocol can be shown to isolate information-acquisition decisions more reliably than static tests, the framework could address key limitations in current reasoning benchmarks and provide a contamination-resistant evaluation method. The proposal is timely given saturation concerns in existing suites, but the manuscript supplies no quantitative results, model details, or protocol specifications, so the practical significance remains prospective rather than demonstrated.

major comments (2)

Abstract: the central claim that 'interactive benchmarks provide a more robust assessment' and 'reveal substantial room for improvement' is unsupported by any methodology details, quantitative findings, error analysis, or comparison data, which is load-bearing for the paper's assertion of superiority over fixed benchmarks.
Interactive Proofs and Interactive Games sections: the manuscript provides no explicit protocol for budget allocation (turns/queries/cost), termination conditions, or generation of strictly non-leading objective feedback, leaving open the possibility that apparent robustness differences arise from interaction rules rather than the targeted intelligence dimension.

minor comments (1)

The abstract would be strengthened by naming the specific models tested and at least one key quantitative result to make the 'results show' statement concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We agree that greater specificity is required to substantiate the central claims and will revise the manuscript to address the points raised.

read point-by-point responses

Referee: Abstract: the central claim that 'interactive benchmarks provide a more robust assessment' and 'reveal substantial room for improvement' is unsupported by any methodology details, quantitative findings, error analysis, or comparison data, which is load-bearing for the paper's assertion of superiority over fixed benchmarks.

Authors: The abstract summarizes results presented in the body of the manuscript, where quantitative evaluations on Interactive Proofs (Logic, UI2Html, Mathematics) and Interactive Games demonstrate clear performance gaps relative to static benchmarks. To make these claims more self-contained, we will revise the abstract to include a concise summary of key quantitative findings, error patterns, and direct comparisons. revision: yes
Referee: Interactive Proofs and Interactive Games sections: the manuscript provides no explicit protocol for budget allocation (turns/queries/cost), termination conditions, or generation of strictly non-leading objective feedback, leaving open the possibility that apparent robustness differences arise from interaction rules rather than the targeted intelligence dimension.

Authors: We agree that explicit protocol details are necessary for reproducibility and to isolate the targeted reasoning dimension. We will add a dedicated subsection specifying budget allocation (maximum turns, per-query costs), termination conditions (judge-determined completion or budget exhaustion), and the generation of objective, non-leading feedback. This revision will clarify that observed differences arise from models' information-acquisition decisions. revision: yes

Circularity Check

0 steps flagged

No circularity: new evaluation paradigm introduced without self-referential reductions

full rationale

The paper proposes Interactive Benchmarks as a new evaluation paradigm for assessing reasoning via budgeted multi-turn interaction, supported by descriptions of Interactive Proofs and Interactive Games settings. No derivation chain, equations, fitted parameters, or self-citations are present that reduce claims to inputs by construction. The central argument that interactive benchmarks provide a more robust assessment rests on empirical evaluation and the stated motivation about acquiring information, without any load-bearing step that equates outputs to prior author-defined results or renames known patterns via ansatz. The framework is self-contained as an independent proposal for new benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests primarily on the domain assumption that information-acquisition ability is core to intelligence and on the new benchmark framework itself; no free parameters or external entities are introduced.

axioms (1)

domain assumption A core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively.
Directly stated in abstract as motivation for the interactive evaluation paradigm.

invented entities (1)

Interactive Benchmarks no independent evidence
purpose: Unified evaluation paradigm using budgeted multi-turn interaction to assess reasoning.
Newly proposed framework without independent external validation or falsifiable predictions outside the paper.

pith-pipeline@v0.9.0 · 5669 in / 1379 out tokens · 67119 ms · 2026-05-21T12:13:11.888701+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks... under budget constraints.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Interactive Proofs... models interact with a judge... under objective feedback... budget B.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
cs.CL 2026-05 unverdicted novelty 7.0

LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
Interactive Evaluation Requires a Design Science
cs.AI 2026-05 unverdicted novelty 5.0

Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axi...