arXiv preprint arXiv:2508.10142 , year=

Multi-turn puzzles: Evaluating interactive reasoning, strategic dialogue in llms , author= · arXiv 2508.10142

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.

citing papers explorer

Showing 1 of 1 citing paper.

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games cs.AI · 2026-05-26 · unverdicted · none · ref 16
The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.

arXiv preprint arXiv:2508.10142 , year=

fields

years

verdicts

representative citing papers

citing papers explorer