OmniToM is a new benchmark for Theory of Mind in LLMs that evaluates explicit belief extraction and seven-dimensional labeling from 895 stories, revealing an actor-specific belief-tracking bottleneck.
Revisiting the evaluation of theory of mind through question answering.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 5872–5877
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 8roles
background 2polarities
background 2representative citing papers
Empirical evaluation on the PLUME benchmark shows steering vectors vary widely in trait expressibility, degrade on task transfer, and lose effectiveness when multiple vectors are composed.
PerspectiveGap benchmark shows LLMs achieve only 14.9% average pass rate on multi-agent orchestration prompting tasks, with GPT-5.5 at 62%.
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.
CogniFold extends Complementary Learning Systems theory to three layers with a prefrontal intent layer and uses graph self-organization to build proactive agent memory from continuous event streams.
citing papers explorer
No citing papers match the current filters.