Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.
Towards dynamic theory of mind: Evaluating llm adaptation to temporal evolution of human states.arXiv:2505.17663
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
representative citing papers
Multi-agent LLM systems can be steered via prompt design from mere aggregates to higher-order collectives with identity-linked differentiation and goal-directed complementarity, as measured by partial information decomposition of time-delayed mutual information.
AgencyBench is a new benchmark with 138 tasks in 32 scenarios that measures autonomous agent performance on extended real-world problems using simulated feedback and sandboxed assessment.
citing papers explorer
-
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
AgencyBench is a new benchmark with 138 tasks in 32 scenarios that measures autonomous agent performance on extended real-world problems using simulated feedback and sandboxed assessment.