Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.
Towards dynamic theory of mind: Evaluating llm adaptation to temporal evolution of human states.arXiv:2505.17663
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Multi-agent LLM systems can be steered via prompt design from mere aggregates to higher-order collectives with identity-linked differentiation and goal-directed complementarity, as measured by partial information decomposition of time-delayed mutual information.
AgencyBench is a new benchmark with 138 tasks in 32 scenarios that measures autonomous agent performance on extended real-world problems using simulated feedback and sandboxed assessment.
citing papers explorer
-
Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action
Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.
-
Emergent Coordination in Multi-Agent Language Models
Multi-agent LLM systems can be steered via prompt design from mere aggregates to higher-order collectives with identity-linked differentiation and goal-directed complementarity, as measured by partial information decomposition of time-delayed mutual information.
-
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
AgencyBench is a new benchmark with 138 tasks in 32 scenarios that measures autonomous agent performance on extended real-world problems using simulated feedback and sandboxed assessment.