OmniToM is a new benchmark for Theory of Mind in LLMs that evaluates explicit belief extraction and seven-dimensional labeling from 895 stories, revealing an actor-specific belief-tracking bottleneck.
Revisiting the evaluation of theory of mind through question answering.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 5872–5877
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 2polarities
background 2representative citing papers
PerspectiveGap benchmark shows LLMs achieve only 14.9% average pass rate on multi-agent orchestration prompting tasks, with GPT-5.5 at 62%.
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.
CogniFold extends Complementary Learning Systems theory to three layers with a prefrontal intent layer and uses graph self-organization to build proactive agent memory from continuous event streams.
citing papers explorer
-
OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
OmniToM is a new benchmark for Theory of Mind in LLMs that evaluates explicit belief extraction and seven-dimensional labeling from 895 stories, revealing an actor-specific belief-tracking bottleneck.
-
PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting
PerspectiveGap benchmark shows LLMs achieve only 14.9% average pass rate on multi-agent orchestration prompting tasks, with GPT-5.5 at 62%.
-
Instructions Shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
-
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
-
PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking
PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.
-
CogniFold: Always-On Proactive Memory via Cognitive Folding
CogniFold extends Complementary Learning Systems theory to three layers with a prefrontal intent layer and uses graph self-organization to build proactive agent memory from continuous event streams.
- DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories