EnactToM benchmark reveals frontier AI models achieve 0% on functional Theory of Mind task completion in embodied multi-agent settings despite 45% average on literal belief probes.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.AI 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
citing papers explorer
-
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
EnactToM benchmark reveals frontier AI models achieve 0% on functional Theory of Mind task completion in embodied multi-agent settings despite 45% average on literal belief probes.
-
The Generalized Turing Test: A Foundation for Comparing Intelligence
The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.