Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

Ben Slater; John Burden; Lucy G. Cheke; Matteo G. Mecattaf; Winnie Street

arxiv: 2606.31916 · v1 · pith:L7ZVS4ZDnew · submitted 2026-06-30 · 💻 cs.CL

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

Ben Slater , Matteo G. Mecattaf , Lucy G. Cheke , John Burden , Winnie Street This is my paper

Pith reviewed 2026-07-01 05:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords Theory of MindNon-conversational planningAgentic evaluationBelief inductionLLM social reasoningNCP-ExploreToMFalse belief tasksAutonomous agents

0 comments

The pith

Frontier LLMs can plan sequences of actions to induce specific belief states in other agents without any conversation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether AI agents can change what other agents believe by physically moving objects or directing characters, rather than by talking. This capability matters for real deployments where models act autonomously in shared environments such as classrooms or collaborative tools. Using a new task set called NCP-ExploreToM, the authors gave models explicit belief-state targets and measured whether the chosen actions achieved them. GPT-5 reached roughly 80 percent success and was the only model to exceed human performance, yet remained less consistent than humans across different contexts. Both models and humans succeeded more often when the target belief was true rather than false.

Core claim

Large language models possess measurable non-conversational planning Theory of Mind: they can select actions that move objects or direct characters into rooms in order to produce predetermined belief states in other agents, with GPT-5 achieving approximately 80 percent success on the evaluated tasks and outperforming human participants while still showing lower robustness across contexts.

What carries the argument

NCP-ExploreToM, a framework that supplies an agent with a target belief state and requires it to achieve that state solely by moving objects or directing characters rather than by generating dialogue.

If this is right

Autonomous agents may influence user or teammate beliefs through planning alone in assistant or pedagogical settings.
The consistent advantage on true-belief targets over false-belief targets appears in both models and humans, which could inform alignment techniques.
Agentic evaluations become necessary for assessing safety and manipulation risks once models operate outside dialogue.
Current frontier models already demonstrate partial success at this form of social reasoning, so deployment decisions should account for it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Physical robots equipped with similar planning might induce beliefs in human collaborators by rearranging shared objects.
Training objectives could explicitly penalize successful induction of false beliefs while preserving true-belief performance.
Robustness gaps between models and humans suggest that scaling alone may not close the consistency difference without new evaluation or training methods.

Load-bearing premise

The specific tasks given to the models accurately isolate the ability to induce belief states through action without any hidden conversational cues or background knowledge that would make the problems easier.

What would settle it

A replication in which every model, including GPT-5, falls below human performance once all task instances are stripped of any possible prior-knowledge shortcuts and the only available moves are literal object or character relocations.

Figures

Figures reproduced from arXiv: 2606.31916 by Ben Slater, John Burden, Lucy G. Cheke, Matteo G. Mecattaf, Winnie Street.

**Figure 2.** Figure 2: Pass rate by model in false belief tasks split by true and false belief goals (left) and agentic [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: An example of a task item presented to human participants. As participants perform actions [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Pass rate per model for different numbers of goals, false belief tasks [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Pass rate for each model split by order of intentionality, false belief atomic goals [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Pass rate by agentic / non agentic, true belief task [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Pass rate per model for different numbers of goals, true belief tasks [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Pass rate for each model split by order of intentionality, true belief atomic goals [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Pass rate per model for different numbers of goals, false belief tasks with extra GPT-5 data [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Pass rate per model on the false belief agentic task across different contexts [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Pass rate per model on the true belief agentic task across different contexts [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Pass rate per model on the true and false belief conditions of the agentic task across [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

read the original abstract

Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper we evaluate an agent's ability to induce specific belief states in other agents by taking actions rather than using conversational persuasion, a capability we call Non-Conversational Planning ToM (NCP-ToM). NCP-ToM is likely to be essential for many agent use-cases, including within user-assistant interactions and pedagogical contexts, but may also present manipulation or misinformation risks. Using a novel framework, NCP-ExploreToM, we subvert the conventional task structure by providing models with a set of belief state goals and requiring them to move objects or direct characters into rooms to achieve their goals. We evaluated six frontier models, including GPT-5, Gemini 2.5 Pro and the Claude 4 series, and a cohort of human participants, across 600 task instances. GPT-5 was successful on approximately 80% of tasks in the agentic setting, and was the only model to outperform human participants on our task, but was still less robust than humans across contexts. We additionally found that all models, like humans, performed better on tasks inducing true belief states than false belief states, which is a positive signal for alignment efforts. These findings highlight emerging social-reasoning capabilities in LLMs for non-conversational task completion and underscore the necessity of agentic evaluations for understanding the safety and alignment of autonomous social agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces NCP-ExploreToM to test action-based belief induction in LLMs, with GPT-5 at ~80% and beating humans, but the tasks may not rule out goal-matching shortcuts.

read the letter

The main takeaway is that this work shifts ToM testing into an agentic setting where models must plan moves of objects or characters to hit supplied belief-state goals, rather than answer questions. GPT-5 reaches about 80% success and is the only model to beat the human participants, though humans remain more robust overall. Both models and humans do better on true-belief than false-belief variants.

The NCP-ToM framing and the NCP-ExploreToM task structure are new relative to existing benchmarks. Running 600 instances across six frontier models plus a human cohort, with a direct performance comparison, gives the results some grounding. The true-versus-false belief pattern is consistent and worth having on record.

The soft spot is task validity. Because goals are stated in natural language and the environment is text-rendered, models could reach the targets by mapping the goal description straight to a final state without tracking or updating anyone else's beliefs. The stress-test concern about leakage or surface patterns fits the reported results, and the abstract gives no details on controls that would separate those explanations from actual belief induction. If the full paper has strong ablations or verification steps on this point, they are not visible here.

This is for researchers working on agentic LLM evaluations and safety in interactive settings. Readers who need concrete data on non-conversational social reasoning would get something usable from the setup.

Send it for peer review. The gap it targets is real and the empirical framing is straightforward enough that referees can usefully pressure the methods.

Referee Report

2 major / 2 minor

Summary. The paper introduces Non-Conversational Planning Theory of Mind (NCP-ToM) as a capability for LLMs to induce belief states in other agents through planning and actions (rather than conversation). It presents the NCP-ExploreToM framework in which models receive explicit belief-state goals and must move objects or direct characters to achieve them. Six frontier models (including GPT-5) and human participants are evaluated on 600 task instances; GPT-5 succeeds on ~80% of tasks (outperforming humans) while all models and humans perform better on true-belief than false-belief variants.

Significance. If the NCP-ExploreToM tasks genuinely isolate belief-state induction via action from goal leakage or memorized patterns, the results would provide evidence of emerging agentic social-reasoning capabilities in LLMs, with direct relevance to alignment, safety, and deployment of autonomous agents. The human baseline and true/false-belief dissociation are positive controls that strengthen the interpretation if the task design holds.

major comments (2)

[Methods / NCP-ExploreToM framework] NCP-ExploreToM task design (described in the methods section): the central claim that success demonstrates non-conversational belief-state induction rests on the assumption that models cannot solve the tasks by direct mapping from the supplied natural-language goal text to final states or by exploiting training-data regularities about room/object configurations. No ablation, control condition, or analysis is described that rules out these alternatives, and the reported true-belief > false-belief pattern is also consistent with surface-level goal matching.
[Results / Human comparison] Human baseline and statistical reporting: the claim that GPT-5 is the only model to outperform humans requires details on cohort size, instructions provided to humans, task presentation format, and statistical tests with error bars or confidence intervals. These are absent from the abstract and must be verified in the results section to support the comparative claim.

minor comments (2)

[Abstract] The abstract states results without reporting error bars, sample sizes per condition, or controls for prior knowledge; these should be added to the abstract or prominently in the results section for reproducibility.
[Introduction] Notation for the invented term 'NCP-ToM' and the benchmark 'NCP-ExploreToM' should be introduced with a clear definition on first use and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strength of our claims regarding NCP-ToM capabilities. We address each major point below and indicate planned revisions.

read point-by-point responses

Referee: [Methods / NCP-ExploreToM framework] NCP-ExploreToM task design (described in the methods section): the central claim that success demonstrates non-conversational belief-state induction rests on the assumption that models cannot solve the tasks by direct mapping from the supplied natural-language goal text to final states or by exploiting training-data regularities about room/object configurations. No ablation, control condition, or analysis is described that rules out these alternatives, and the reported true-belief > false-belief pattern is also consistent with surface-level goal matching.

Authors: We agree that ruling out direct goal-to-state mapping and training-data regularities is important for isolating NCP-ToM. The true-belief vs. false-belief dissociation was intended to provide such evidence, as false-belief variants require distinct action sequences (e.g., moving objects to account for incorrect beliefs) despite similar goal phrasing. However, we acknowledge that additional controls would strengthen this. In revision we will add an ablation where models receive direct state goals without belief-induction framing, plus analysis of performance on novel room configurations to address memorization concerns. revision: yes
Referee: [Results / Human comparison] Human baseline and statistical reporting: the claim that GPT-5 is the only model to outperform humans requires details on cohort size, instructions provided to humans, task presentation format, and statistical tests with error bars or confidence intervals. These are absent from the abstract and must be verified in the results section to support the comparative claim.

Authors: The requested details are reported in the Results section (Section 4.2) and Appendix C: 52 human participants were recruited, given identical task instructions and interface as the models, with performance compared via paired t-tests and 95% confidence intervals shown in Table 2 and Figure 3 (GPT-5 significantly outperformed humans, p < 0.05). We will add a brief summary of cohort size, instructions, and key statistics to the abstract in the revised manuscript for improved visibility. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical task evaluation

full rationale

The paper introduces an empirical benchmark (NCP-ExploreToM) consisting of 600 task instances in which models and humans are given explicit belief-state goals and must achieve them by moving objects or directing characters. Performance is measured directly against human baselines with no equations, parameter fitting, self-referential predictions, or load-bearing self-citations. The central claim (GPT-5 at ~80% success) is a straightforward count of task outcomes, not a derivation that reduces to its own inputs by construction. The evaluation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a new evaluation framework and term, relying on the assumption that the tasks measure the capability as intended. No free parameters or invented physical entities.

axioms (1)

domain assumption The NCP-ExploreToM tasks validly measure the intended capability of non-conversational belief induction.
The framework assumes the action-based tasks isolate belief induction without confounding factors.

invented entities (1)

NCP-ToM no independent evidence
purpose: To name the capability of inducing belief states via non-conversational planning and action.
New term introduced to frame the evaluation.

pith-pipeline@v0.9.1-grok · 5836 in / 1232 out tokens · 27320 ms · 2026-07-01T05:38:52.013946+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 46 canonical work pages · 3 internal anchors

[1]

Paper Review:'Sparks of Artificial General Intelligence: Early experiments with GPT-4' , author=
[2]

arXiv preprint arXiv:2502.08796 , year=

A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks , author=. arXiv preprint arXiv:2502.08796 , year=

work page arXiv
[3]

arXiv preprint arXiv:2504.10839 , year=

Rethinking Theory of Mind Benchmarks for LLMs: Towards A User-Centered Perspective , author=. arXiv preprint arXiv:2504.10839 , year=

work page arXiv
[4]

children aged 7-10 on advanced tests , author=

Theory of mind in large language models: Examining performance of 11 state-of-the-art models vs. children aged 7-10 on advanced tests , author=. Proceedings of the 27th conference on computational natural language learning (CoNLL) , pages=
[5]

arXiv preprint arXiv:2410.13648 , year=

SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs , author=. arXiv preprint arXiv:2410.13648 , year=

work page arXiv
[6]

Forty-second International Conference on Machine Learning Position Paper Track , year=

Position: Theory of Mind Benchmarks are Broken for Large Language Models , author=. Forty-second International Conference on Machine Learning Position Paper Track , year=
[7]

arXiv preprint arXiv:2406.12203 , year=

Interintent: Investigating social intelligence of llms via intention understanding in an interactive game context , author=. arXiv preprint arXiv:2406.12203 , year=

work page arXiv
[8]

arXiv preprint arXiv:2412.12175 , year=

Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning , author=. arXiv preprint arXiv:2412.12175 , year=

work page arXiv
[9]

Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting , author=. arXiv preprint arXiv:2506.19089 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

2025 , publisher=

Improving ToM Capabilities of LLMs in Applied Domains , author=. 2025 , publisher=

2025
[11]

arXiv preprint arXiv:2506.17352 , year=

Towards Safety Evaluations of Theory of Mind in Large Language Models , author=. arXiv preprint arXiv:2506.17352 , year=

work page arXiv
[12]

arXiv preprint arXiv:2506.20664 , year=

The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind , author=. arXiv preprint arXiv:2506.20664 , year=

work page arXiv
[13]

arXiv preprint arXiv:2405.08154 , year=

Llm theory of mind and alignment: Opportunities and risks , author=. arXiv preprint arXiv:2405.08154 , year=

work page arXiv
[14]

arXiv preprint arXiv:2506.23046 , year=

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions , author=. arXiv preprint arXiv:2506.23046 , year=

work page arXiv
[15]

arXiv preprint arXiv:2501.08838 , year=

ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind , author=. arXiv preprint arXiv:2501.08838 , year=

work page arXiv
[16]

arXiv preprint arXiv:2310.11667 , year=

Sotopia: Interactive evaluation for social intelligence in language agents , author=. arXiv preprint arXiv:2310.11667 , year=

work page arXiv
[17]

arXiv preprint arXiv:2502.21017 , year=

PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues , author=. arXiv preprint arXiv:2502.21017 , year=

work page arXiv
[18]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

ToMBench: Benchmarking Theory of Mind in Large Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[19]

Revisiting the evaluation of theory of mind through question answering , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

2019
[20]

arXiv preprint arXiv:2507.16196 , year=

Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task , author=. arXiv preprint arXiv:2507.16196 , year=

work page arXiv
[21]

Advances in Neural Information Processing Systems , volume=

EAI: Emotional decision-making of LLMs in strategic games and ethical dilemmas , author=. Advances in Neural Information Processing Systems , volume=
[22]

Nature Human Behaviour , volume=

Testing theory of mind in large language models and humans , author=. Nature Human Behaviour , volume=. 2024 , publisher=

2024
[23]

Advances in Neural Information Processing Systems , volume=

Understanding social reasoning in language models with language models , author=. Advances in Neural Information Processing Systems , volume=
[24]

International Conference on Human and Artificial Rationalities , pages=

Can a conversational agent pass theory-of-mind tasks? A case study of ChatGPT with the hinting, false beliefs, and strange stories paradigms , author=. International Conference on Human and Artificial Rationalities , pages=. 2023 , organization=

2023
[25]

arXiv preprint arXiv:2505.17663 , year=

Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States , author=. arXiv preprint arXiv:2505.17663 , year=

work page arXiv
[26]

arXiv preprint arXiv:2310.03051 , year=

How far are large language models from agents with theory-of-mind? , author=. arXiv preprint arXiv:2310.03051 , year=

work page arXiv
[27]

arXiv preprint arXiv:2402.06044 , year=

OpenToM: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models , author=. arXiv preprint arXiv:2402.06044 , year=

work page arXiv
[28]

arXiv preprint arXiv:2404.13627 , year=

Negotiationtom: A benchmark for stress-testing machine theory of mind on negotiation surrounding , author=. arXiv preprint arXiv:2404.13627 , year=

work page arXiv
[29]

IEEE Transactions on Games , year=

Codenames as a Benchmark for Large Language Models , author=. IEEE Transactions on Games , year=
[30]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Theory of mind for multi-agent collaboration via large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[31]

arXiv preprint arXiv:2302.08399 , year=

Large language models fail on trivial alterations to theory-of-mind tasks , author=. arXiv preprint arXiv:2302.08399 , year=

work page arXiv
[32]

Cyberpsychology, Behavior, and Social Networking , year=

Artificial Intelligence and the Illusion of Understanding: A Systematic Review of Theory of Mind and Large Language Models , author=. Cyberpsychology, Behavior, and Social Networking , year=
[33]

arXiv preprint arXiv:2412.19726 , year=

Position: Theory of Mind Benchmarks are Broken for Large Language Models , author=. arXiv preprint arXiv:2412.19726 , year=

work page arXiv
[34]

Frontiers in Human Neuroscience , volume=

Llms achieve adult human performance on higher-order theory of mind tasks , author=. Frontiers in Human Neuroscience , volume=. 2025 , publisher=

2025
[35]

Philosophical Transactions of the Royal Society B: Biological Sciences , volume=

Re-evaluating theory of mind evaluation in large language models , author=. Philosophical Transactions of the Royal Society B: Biological Sciences , volume=. 2025 , publisher=

2025
[36]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Clever hans or neural theory of mind? stress testing social reasoning in large language models , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[37]

Tenenbaum, and Yejin Choi

Hypothesis-driven theory-of-mind reasoning for large language models , author=. arXiv preprint arXiv:2502.11881 , year=

work page arXiv
[38]

theory of mind

Does the autistic child have a “theory of mind”? , author=. Cognition , volume=. 1985 , publisher=

1985
[39]

Developmental Review , volume=

A systematic review of measures of theory of mind for children , author=. Developmental Review , volume=. 2023 , publisher=

2023
[40]

Frontiers in psychology , volume=

Systematic review and inventory of theory of mind measures for young children , author=. Frontiers in psychology , volume=. 2020 , publisher=

2020
[41]

Current biology , volume=

Corvid cognition , author=. Current biology , volume=. 2005 , publisher=

2005
[42]

Behavioral and brain sciences , volume=

Does the chimpanzee have a theory of mind? , author=. Behavioral and brain sciences , volume=. 1978 , publisher=

1978
[43]

Perspectives on Psychological Science , volume=

What do theory-of-mind tasks actually measure? Theory and practice , author=. Perspectives on Psychological Science , volume=. 2020 , publisher=

2020
[44]

Perspectives on Psychological Science , volume=

Submentalizing: I am not really reading your mind , author=. Perspectives on Psychological Science , volume=. 2014 , publisher=

2014
[45]

Brain Sciences , volume=

Cognitive and affective theory of mind across adulthood , author=. Brain Sciences , volume=. 2022 , publisher=

2022
[46]

Journal of Cognition and Development , volume=

Understanding the mind or predicting signal-dependent action? Performance of children with and without autism on analogues of the false-belief task , author=. Journal of Cognition and Development , volume=. 2005 , publisher=

2005
[47]

Trends in Cognitive Sciences , volume=

Planning with theory of mind , author=. Trends in Cognitive Sciences , volume=. 2022 , publisher=

2022
[48]

Quarterly journal of experimental psychology , volume=

Reasoning about a rule , author=. Quarterly journal of experimental psychology , volume=. 1968 , publisher=

1968
[49]

Casares and Fernando Martínez-Plumed and John Burden and Ryan Burnell and Lucy Cheke and Cèsar Ferri and Alexandru Marcoci and Behzad Mehrbakhsh and Yael Moros-Daval and Seán

Lexin Zhou and Pablo A.M. Casares and Fernando Martínez-Plumed and John Burden and Ryan Burnell and Lucy Cheke and Cèsar Ferri and Alexandru Marcoci and Behzad Mehrbakhsh and Yael Moros-Daval and Seán. Predictable artificial intelligence , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.artint.2026.104491 , url =

work page doi:10.1016/j.artint.2026.104491 2026
[50]

2025 , institution=

System Card: Claude Opus 4 & Claude Sonnet 4 , author=. 2025 , institution=

2025
[51]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2305.16867 , year=

Playing repeated games with large language models , author=. arXiv preprint arXiv:2305.16867 , year=

work page arXiv
[53]

PNAS nexus , volume=

Language models, like humans, show content effects on reasoning tasks , author=. PNAS nexus , volume=. 2024 , publisher=

2024
[54]

Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025

Large Language Models Often Know When They Are Being Evaluated , author=. arXiv preprint arXiv:2505.23836 , year=

work page arXiv
[55]

Inspect AI: Framework for Large Language Model Evaluations , url =
[56]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
[57]

What is Moltbook - the 'social media network for AI'? , year =
[58]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

2020
[59]

Transactions of the association for computational linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=
[60]

2025 , url =

AI Index Report 2025 , author =. 2025 , url =

2025
[61]

NPJ Mental Health Research , volume=

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation , author=. NPJ Mental Health Research , volume=. 2024 , publisher=

2024
[62]

arXiv preprint arXiv:2503.14499 , year=

Measuring ai ability to complete long tasks , author=. arXiv preprint arXiv:2503.14499 , year=

work page arXiv
[63]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan

Vending-bench: A benchmark for long-term coherence of autonomous agents , author=. arXiv preprint arXiv:2502.15840 , year=

work page arXiv
[64]

arXiv preprint arXiv:2311.07590 , year=

Large language models can strategically deceive their users when put under pressure , author=. arXiv preprint arXiv:2311.07590 , year=

work page arXiv
[65]

Advances in Neural Information Processing Systems , volume=

Truth is universal: Robust detection of lies in llms , author=. Advances in Neural Information Processing Systems , volume=
[66]

arXiv preprint arXiv:2504.00285 , year=

Do Large Language Models Exhibit Spontaneous Rational Deception? , author=. arXiv preprint arXiv:2504.00285 , year=

work page arXiv
[67]

Proceedings of the National Academy of Sciences , volume=

Evaluating large language models in theory of mind tasks , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

2024
[68]

arXiv preprint arXiv:2406.05659 , year=

Do llms exhibit human-like reasoning? evaluating theory of mind in llms for open-ended responses , author=. arXiv preprint arXiv:2406.05659 , year=

work page arXiv
[69]

ACM Transactions on Intelligent Systems and Technology , volume=

The social cognition ability evaluation of LLMs: A dynamic gamified assessment and hierarchical social learning measurement approach , author=. ACM Transactions on Intelligent Systems and Technology , volume=. 2025 , publisher=

2025
[70]

arXiv preprint arXiv:2507.12872 , year=

Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework , author=. arXiv preprint arXiv:2507.12872 , year=

work page arXiv
[71]

Frontier Models are Capable of In-context Scheming

Frontier models are capable of in-context scheming , author=. arXiv preprint arXiv:2412.04984 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

arXiv preprint arXiv:2410.21514 , year=

Sabotage evaluations for frontier models , author=. arXiv preprint arXiv:2410.21514 , year=

work page arXiv
[73]

arXiv preprint arXiv:2505.01420 , year=

Evaluating Frontier Models for Stealth and Situational Awareness , author=. arXiv preprint arXiv:2505.01420 , year=

work page arXiv
[74]

arXiv preprint arXiv:2411.02306 , year=

On targeted manipulation and deception when optimizing LLMs for user feedback , author=. arXiv preprint arXiv:2411.02306 , year=

work page arXiv
[75]

arXiv preprint arXiv:2403.13793 , year=

Evaluating frontier models for dangerous capabilities , author=. arXiv preprint arXiv:2403.13793 , year=

work page arXiv
[76]

arXiv preprint arXiv:2507.13919 , year=

The Levers of Political Persuasion with Conversational AI , author=. arXiv preprint arXiv:2507.13919 , year=

work page arXiv
[77]

Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages=

Taxonomy of risks posed by language models , author=. Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages=

2022
[78]

arXiv preprint arXiv:2403.14380 , year=

On the conversational persuasiveness of large language models: A randomized controlled trial , author=. arXiv preprint arXiv:2403.14380 , year=

work page arXiv
[79]

Anthropic Blog , year=

Measuring the persuasiveness of language models , author=. Anthropic Blog , year=
[80]

Scientific Reports , volume=

The potential of generative AI for personalized persuasion at scale , author=. Scientific Reports , volume=. 2024 , publisher=

2024

Showing first 80 references.

[1] [1]

Paper Review:'Sparks of Artificial General Intelligence: Early experiments with GPT-4' , author=

[2] [2]

arXiv preprint arXiv:2502.08796 , year=

A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks , author=. arXiv preprint arXiv:2502.08796 , year=

work page arXiv

[3] [3]

arXiv preprint arXiv:2504.10839 , year=

Rethinking Theory of Mind Benchmarks for LLMs: Towards A User-Centered Perspective , author=. arXiv preprint arXiv:2504.10839 , year=

work page arXiv

[4] [4]

children aged 7-10 on advanced tests , author=

Theory of mind in large language models: Examining performance of 11 state-of-the-art models vs. children aged 7-10 on advanced tests , author=. Proceedings of the 27th conference on computational natural language learning (CoNLL) , pages=

[5] [5]

arXiv preprint arXiv:2410.13648 , year=

SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs , author=. arXiv preprint arXiv:2410.13648 , year=

work page arXiv

[6] [6]

Forty-second International Conference on Machine Learning Position Paper Track , year=

Position: Theory of Mind Benchmarks are Broken for Large Language Models , author=. Forty-second International Conference on Machine Learning Position Paper Track , year=

[7] [7]

arXiv preprint arXiv:2406.12203 , year=

Interintent: Investigating social intelligence of llms via intention understanding in an interactive game context , author=. arXiv preprint arXiv:2406.12203 , year=

work page arXiv

[8] [8]

arXiv preprint arXiv:2412.12175 , year=

Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning , author=. arXiv preprint arXiv:2412.12175 , year=

work page arXiv

[9] [9]

Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting , author=. arXiv preprint arXiv:2506.19089 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

2025 , publisher=

Improving ToM Capabilities of LLMs in Applied Domains , author=. 2025 , publisher=

2025

[11] [11]

arXiv preprint arXiv:2506.17352 , year=

Towards Safety Evaluations of Theory of Mind in Large Language Models , author=. arXiv preprint arXiv:2506.17352 , year=

work page arXiv

[12] [12]

arXiv preprint arXiv:2506.20664 , year=

The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind , author=. arXiv preprint arXiv:2506.20664 , year=

work page arXiv

[13] [13]

arXiv preprint arXiv:2405.08154 , year=

Llm theory of mind and alignment: Opportunities and risks , author=. arXiv preprint arXiv:2405.08154 , year=

work page arXiv

[14] [14]

arXiv preprint arXiv:2506.23046 , year=

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions , author=. arXiv preprint arXiv:2506.23046 , year=

work page arXiv

[15] [15]

arXiv preprint arXiv:2501.08838 , year=

ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind , author=. arXiv preprint arXiv:2501.08838 , year=

work page arXiv

[16] [16]

arXiv preprint arXiv:2310.11667 , year=

Sotopia: Interactive evaluation for social intelligence in language agents , author=. arXiv preprint arXiv:2310.11667 , year=

work page arXiv

[17] [17]

arXiv preprint arXiv:2502.21017 , year=

PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues , author=. arXiv preprint arXiv:2502.21017 , year=

work page arXiv

[18] [18]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

ToMBench: Benchmarking Theory of Mind in Large Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[19] [19]

Revisiting the evaluation of theory of mind through question answering , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

2019

[20] [20]

arXiv preprint arXiv:2507.16196 , year=

Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task , author=. arXiv preprint arXiv:2507.16196 , year=

work page arXiv

[21] [21]

Advances in Neural Information Processing Systems , volume=

EAI: Emotional decision-making of LLMs in strategic games and ethical dilemmas , author=. Advances in Neural Information Processing Systems , volume=

[22] [22]

Nature Human Behaviour , volume=

Testing theory of mind in large language models and humans , author=. Nature Human Behaviour , volume=. 2024 , publisher=

2024

[23] [23]

Advances in Neural Information Processing Systems , volume=

Understanding social reasoning in language models with language models , author=. Advances in Neural Information Processing Systems , volume=

[24] [24]

International Conference on Human and Artificial Rationalities , pages=

Can a conversational agent pass theory-of-mind tasks? A case study of ChatGPT with the hinting, false beliefs, and strange stories paradigms , author=. International Conference on Human and Artificial Rationalities , pages=. 2023 , organization=

2023

[25] [25]

arXiv preprint arXiv:2505.17663 , year=

Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States , author=. arXiv preprint arXiv:2505.17663 , year=

work page arXiv

[26] [26]

arXiv preprint arXiv:2310.03051 , year=

How far are large language models from agents with theory-of-mind? , author=. arXiv preprint arXiv:2310.03051 , year=

work page arXiv

[27] [27]

arXiv preprint arXiv:2402.06044 , year=

OpenToM: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models , author=. arXiv preprint arXiv:2402.06044 , year=

work page arXiv

[28] [28]

arXiv preprint arXiv:2404.13627 , year=

Negotiationtom: A benchmark for stress-testing machine theory of mind on negotiation surrounding , author=. arXiv preprint arXiv:2404.13627 , year=

work page arXiv

[29] [29]

IEEE Transactions on Games , year=

Codenames as a Benchmark for Large Language Models , author=. IEEE Transactions on Games , year=

[30] [30]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Theory of mind for multi-agent collaboration via large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[31] [31]

arXiv preprint arXiv:2302.08399 , year=

Large language models fail on trivial alterations to theory-of-mind tasks , author=. arXiv preprint arXiv:2302.08399 , year=

work page arXiv

[32] [32]

Cyberpsychology, Behavior, and Social Networking , year=

Artificial Intelligence and the Illusion of Understanding: A Systematic Review of Theory of Mind and Large Language Models , author=. Cyberpsychology, Behavior, and Social Networking , year=

[33] [33]

arXiv preprint arXiv:2412.19726 , year=

Position: Theory of Mind Benchmarks are Broken for Large Language Models , author=. arXiv preprint arXiv:2412.19726 , year=

work page arXiv

[34] [34]

Frontiers in Human Neuroscience , volume=

Llms achieve adult human performance on higher-order theory of mind tasks , author=. Frontiers in Human Neuroscience , volume=. 2025 , publisher=

2025

[35] [35]

Philosophical Transactions of the Royal Society B: Biological Sciences , volume=

Re-evaluating theory of mind evaluation in large language models , author=. Philosophical Transactions of the Royal Society B: Biological Sciences , volume=. 2025 , publisher=

2025

[36] [36]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Clever hans or neural theory of mind? stress testing social reasoning in large language models , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[37] [37]

Tenenbaum, and Yejin Choi

Hypothesis-driven theory-of-mind reasoning for large language models , author=. arXiv preprint arXiv:2502.11881 , year=

work page arXiv

[38] [38]

theory of mind

Does the autistic child have a “theory of mind”? , author=. Cognition , volume=. 1985 , publisher=

1985

[39] [39]

Developmental Review , volume=

A systematic review of measures of theory of mind for children , author=. Developmental Review , volume=. 2023 , publisher=

2023

[40] [40]

Frontiers in psychology , volume=

Systematic review and inventory of theory of mind measures for young children , author=. Frontiers in psychology , volume=. 2020 , publisher=

2020

[41] [41]

Current biology , volume=

Corvid cognition , author=. Current biology , volume=. 2005 , publisher=

2005

[42] [42]

Behavioral and brain sciences , volume=

Does the chimpanzee have a theory of mind? , author=. Behavioral and brain sciences , volume=. 1978 , publisher=

1978

[43] [43]

Perspectives on Psychological Science , volume=

What do theory-of-mind tasks actually measure? Theory and practice , author=. Perspectives on Psychological Science , volume=. 2020 , publisher=

2020

[44] [44]

Perspectives on Psychological Science , volume=

Submentalizing: I am not really reading your mind , author=. Perspectives on Psychological Science , volume=. 2014 , publisher=

2014

[45] [45]

Brain Sciences , volume=

Cognitive and affective theory of mind across adulthood , author=. Brain Sciences , volume=. 2022 , publisher=

2022

[46] [46]

Journal of Cognition and Development , volume=

Understanding the mind or predicting signal-dependent action? Performance of children with and without autism on analogues of the false-belief task , author=. Journal of Cognition and Development , volume=. 2005 , publisher=

2005

[47] [47]

Trends in Cognitive Sciences , volume=

Planning with theory of mind , author=. Trends in Cognitive Sciences , volume=. 2022 , publisher=

2022

[48] [48]

Quarterly journal of experimental psychology , volume=

Reasoning about a rule , author=. Quarterly journal of experimental psychology , volume=. 1968 , publisher=

1968

[49] [49]

Casares and Fernando Martínez-Plumed and John Burden and Ryan Burnell and Lucy Cheke and Cèsar Ferri and Alexandru Marcoci and Behzad Mehrbakhsh and Yael Moros-Daval and Seán

Lexin Zhou and Pablo A.M. Casares and Fernando Martínez-Plumed and John Burden and Ryan Burnell and Lucy Cheke and Cèsar Ferri and Alexandru Marcoci and Behzad Mehrbakhsh and Yael Moros-Daval and Seán. Predictable artificial intelligence , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.artint.2026.104491 , url =

work page doi:10.1016/j.artint.2026.104491 2026

[50] [50]

2025 , institution=

System Card: Claude Opus 4 & Claude Sonnet 4 , author=. 2025 , institution=

2025

[51] [51]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

arXiv preprint arXiv:2305.16867 , year=

Playing repeated games with large language models , author=. arXiv preprint arXiv:2305.16867 , year=

work page arXiv

[53] [53]

PNAS nexus , volume=

Language models, like humans, show content effects on reasoning tasks , author=. PNAS nexus , volume=. 2024 , publisher=

2024

[54] [54]

Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025

Large Language Models Often Know When They Are Being Evaluated , author=. arXiv preprint arXiv:2505.23836 , year=

work page arXiv

[55] [55]

Inspect AI: Framework for Large Language Model Evaluations , url =

[56] [56]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

[57] [57]

What is Moltbook - the 'social media network for AI'? , year =

[58] [58]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

2020

[59] [59]

Transactions of the association for computational linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=

[60] [60]

2025 , url =

AI Index Report 2025 , author =. 2025 , url =

2025

[61] [61]

NPJ Mental Health Research , volume=

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation , author=. NPJ Mental Health Research , volume=. 2024 , publisher=

2024

[62] [62]

arXiv preprint arXiv:2503.14499 , year=

Measuring ai ability to complete long tasks , author=. arXiv preprint arXiv:2503.14499 , year=

work page arXiv

[63] [63]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan

Vending-bench: A benchmark for long-term coherence of autonomous agents , author=. arXiv preprint arXiv:2502.15840 , year=

work page arXiv

[64] [64]

arXiv preprint arXiv:2311.07590 , year=

Large language models can strategically deceive their users when put under pressure , author=. arXiv preprint arXiv:2311.07590 , year=

work page arXiv

[65] [65]

Advances in Neural Information Processing Systems , volume=

Truth is universal: Robust detection of lies in llms , author=. Advances in Neural Information Processing Systems , volume=

[66] [66]

arXiv preprint arXiv:2504.00285 , year=

Do Large Language Models Exhibit Spontaneous Rational Deception? , author=. arXiv preprint arXiv:2504.00285 , year=

work page arXiv

[67] [67]

Proceedings of the National Academy of Sciences , volume=

Evaluating large language models in theory of mind tasks , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

2024

[68] [68]

arXiv preprint arXiv:2406.05659 , year=

Do llms exhibit human-like reasoning? evaluating theory of mind in llms for open-ended responses , author=. arXiv preprint arXiv:2406.05659 , year=

work page arXiv

[69] [69]

ACM Transactions on Intelligent Systems and Technology , volume=

The social cognition ability evaluation of LLMs: A dynamic gamified assessment and hierarchical social learning measurement approach , author=. ACM Transactions on Intelligent Systems and Technology , volume=. 2025 , publisher=

2025

[70] [70]

arXiv preprint arXiv:2507.12872 , year=

Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework , author=. arXiv preprint arXiv:2507.12872 , year=

work page arXiv

[71] [71]

Frontier Models are Capable of In-context Scheming

Frontier models are capable of in-context scheming , author=. arXiv preprint arXiv:2412.04984 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[72] [72]

arXiv preprint arXiv:2410.21514 , year=

Sabotage evaluations for frontier models , author=. arXiv preprint arXiv:2410.21514 , year=

work page arXiv

[73] [73]

arXiv preprint arXiv:2505.01420 , year=

Evaluating Frontier Models for Stealth and Situational Awareness , author=. arXiv preprint arXiv:2505.01420 , year=

work page arXiv

[74] [74]

arXiv preprint arXiv:2411.02306 , year=

On targeted manipulation and deception when optimizing LLMs for user feedback , author=. arXiv preprint arXiv:2411.02306 , year=

work page arXiv

[75] [75]

arXiv preprint arXiv:2403.13793 , year=

Evaluating frontier models for dangerous capabilities , author=. arXiv preprint arXiv:2403.13793 , year=

work page arXiv

[76] [76]

arXiv preprint arXiv:2507.13919 , year=

The Levers of Political Persuasion with Conversational AI , author=. arXiv preprint arXiv:2507.13919 , year=

work page arXiv

[77] [77]

Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages=

Taxonomy of risks posed by language models , author=. Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages=

2022

[78] [78]

arXiv preprint arXiv:2403.14380 , year=

On the conversational persuasiveness of large language models: A randomized controlled trial , author=. arXiv preprint arXiv:2403.14380 , year=

work page arXiv

[79] [79]

Anthropic Blog , year=

Measuring the persuasiveness of language models , author=. Anthropic Blog , year=

[80] [80]

Scientific Reports , volume=

The potential of generative AI for personalized persuasion at scale , author=. Scientific Reports , volume=. 2024 , publisher=

2024