EnactToM benchmark reveals frontier AI models achieve 0% on functional Theory of Mind task completion in embodied multi-agent settings despite 45% average on literal belief probes.
Revisiting the evaluation of theory of mind through question answering
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-organization.
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
LLMs identify mental states in dialogues well but mostly fail to forecast state-consistent future trajectories, except Gemini 3 Pro, with only weak overlap to human inferences.
PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.
citing papers explorer
-
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
EnactToM benchmark reveals frontier AI models achieve 0% on functional Theory of Mind task completion in embodied multi-agent settings despite 45% average on literal belief probes.
-
Cognifold: Always-On Proactive Memory via Cognitive Folding
Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-organization.
-
Instructions Shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
-
DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories
LLMs identify mental states in dialogues well but mostly fail to forecast state-consistent future trajectories, except Gemini 3 Pro, with only weak overlap to human inferences.
-
PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking
PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.