Interactive Benchmarks

Baoqing Yue , Zihan Zhu , Yutong Han , Qian Sun , Jichen Feng , Hufei Yang , Yifan Zhang , Mengdi Wang

Authors on Pith no claims yet

classification 💻 cs.AI cs.CLcs.LG

keywords interactivebenchmarksmodelsabilityevaluationintelligencemodelreasoning

read the original abstract

Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
cs.CL 2026-05 unverdicted novelty 7.0

LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.