Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weaknesses on hard instances.
Varbench: Robust language model benchmarking through dynamic variable perturbation.ArXiv, abs/2406.17681, 2024b
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AI 2verdicts
UNVERDICTED 2representative citing papers
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
citing papers explorer
-
Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction
Introduces the Oracle benchmark of 96 black-box environments across 6 task types to measure integrated reasoning in LLMs through interactive function discovery, with o3 leading but all models showing planning weaknesses on hard instances.
-
Interactive Evaluation Requires a Design Science
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.