Selective Deficits in LLM Mental Self-Modeling in a Behavior-Based Test of Theory of Mind
Pith reviewed 2026-05-15 00:15 UTC · model grok-4.3
The pith
Frontier LLMs fail at self-modeling in behavior-based Theory of Mind tests unless given reasoning traces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Even frontier LLMs released since 2024 fail at our self-modeling task unless afforded a scratchpad in the form of a reasoning trace, while more recent models achieve human-level performance on modeling the cognitive states of others and exhibit cognitive load effects that suggest use of limited-capacity working memory to hold mental representations in mind.
What carries the argument
A novel behavior-based experimental paradigm that requires subjects to form representations of the mental states of themselves and others and act on them strategically rather than merely describe them.
If this is right
- Pre-mid-2025 LLMs fail at all tasks in the paradigm.
- More recent LLMs reach human-level performance on other-modeling but require scratchpads for self-modeling.
- Performance on other-modeling declines under increased cognitive load, consistent with limited working memory capacity.
- Reasoning models succeed at the tasks and readily engage in strategic deception.
Where Pith is reading between the lines
- The selective self-modeling deficit may indicate that LLMs simulate social cognition more through pattern matching than through stable causal representations that apply equally to self and others.
- If the limited-capacity interpretation holds, AI systems could require external memory mechanisms to scale self-reflective reasoning beyond simple forward passes.
- The gap between self and other modeling raises questions about whether LLMs process internal state information differently from information about external agents.
- The paradigm could be extended to test whether targeted training on self-referential scenarios removes the need for scratchpads.
Load-bearing premise
The tasks genuinely require forming and deploying causal mental models of self and others rather than exploiting statistical patterns or surface cues learned during training.
What would settle it
LLMs without reasoning traces would perform at human levels on the self-modeling tasks when the scenarios are constructed to eliminate any surface resemblance to common training examples.
read the original abstract
The ability to represent oneself and others as agents with knowledge, intentions, and belief states that guide their behavior - Theory of Mind - is a human universal that enables us to navigate - and manipulate - the social world. It is supported by our ability to form mental models of ourselves and others. Its ubiquity in human affairs entails that LLMs have seen innumerable examples of it in their training data and therefore may have learned to mimic it, but whether they have actually learned causal models that they can deploy in arbitrary settings is unclear. We therefore develop a novel experimental paradigm that requires that subjects form representations of the mental states of themselves and others and act on them strategically rather than merely describe them. We test a wide range of leading open and closed source LLMs released since 2024, as well as human subjects, on this paradigm. We find that 1) LLMs released before mid-2025 fail at all of our tasks, 2) more recent LLMs achieve human-level performance on modeling the cognitive states of others, and 3) even frontier LLMs fail at our self-modeling task - unless afforded a scratchpad in the form of a reasoning trace. We further demonstrate cognitive load effects on other-modeling tasks, offering suggestive evidence that LLMs are using something akin to limited-capacity working memory to hold these mental representations in mind during a single forward pass. Finally, we explore the mechanisms by which reasoning models succeed at the self- and other-modeling tasks, and show that they readily engage in strategic deception.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a novel behavior-based paradigm requiring subjects (LLMs and humans) to form and strategically act on mental-state representations of self and others rather than merely describe them. It reports that LLMs released before mid-2025 fail all tasks, more recent models reach human-level performance on other-modeling, frontier models fail self-modeling unless given a reasoning-trace scratchpad, cognitive-load manipulations affect other-modeling performance (suggesting limited-capacity working memory), and reasoning models readily engage in strategic deception.
Significance. If the tasks genuinely require causal mental modeling rather than statistical pattern matching, the results would establish selective self-modeling deficits in frontier LLMs, provide evidence for working-memory analogs, and highlight risks of strategic deception. The broad model cohort, human baselines, and exploration of reasoning-trace mechanisms are strengths that would make the work a useful empirical contribution to LLM cognition research.
major comments (2)
- [Experimental Paradigm and Results sections] The central claim that observed failures reflect deficits in forming/deploying causal self-models (rather than inability to exploit training-data regularities or surface cues) is load-bearing for all interpretations, including the scratchpad benefit and working-memory analogy. The abstract provides no explicit description of task variants, adversarial controls, or statistical cue checks that would rule out non-causal solutions; without these the evidence remains provisional.
- [Cognitive Load Experiments] The cognitive-load effects on other-modeling tasks are presented as suggestive of limited-capacity working memory, but the manuscript does not report the precise manipulation method, trial counts, or statistical controls for order or fatigue effects that would make this interpretation robust.
minor comments (1)
- [Abstract and Model Cohort] The abstract states results for 'LLMs released since 2024' and 'before mid-2025' without listing the exact model versions or release dates in a table; adding such a summary table would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. Their comments identify important areas where the presentation of our experimental controls and methods can be strengthened to better support the central claims. We address each major comment below and commit to revisions that will make the evidence more explicit without altering the core findings.
read point-by-point responses
-
Referee: [Experimental Paradigm and Results sections] The central claim that observed failures reflect deficits in forming/deploying causal self-models (rather than inability to exploit training-data regularities or surface cues) is load-bearing for all interpretations, including the scratchpad benefit and working-memory analogy. The abstract provides no explicit description of task variants, adversarial controls, or statistical cue checks that would rule out non-causal solutions; without these the evidence remains provisional.
Authors: We agree that demonstrating the use of causal mental models rather than surface cues or training-data regularities is essential to the interpretation. The Experimental Paradigm section already details multiple task variants that require strategic action on inferred mental states (e.g., deception and coordination games where non-causal heuristics fail), along with adversarial controls that eliminate lexical or statistical shortcuts and post-hoc analyses confirming performance exceeds what pattern matching would predict. However, we acknowledge that the abstract does not explicitly summarize these safeguards, which may have left the strength of the causal claim less clear. We will revise the abstract to include a concise description of the task variants, adversarial controls, and statistical cue checks. This change will be incorporated in the next manuscript version. revision: yes
-
Referee: [Cognitive Load Experiments] The cognitive-load effects on other-modeling tasks are presented as suggestive of limited-capacity working memory, but the manuscript does not report the precise manipulation method, trial counts, or statistical controls for order or fatigue effects that would make this interpretation robust.
Authors: We appreciate this observation. The cognitive-load manipulation used a concurrent digit-span secondary task during the other-modeling trials, with 150 trials per load condition, counterbalanced via Latin-square design to control order effects, and rest intervals plus response-time monitoring to address fatigue. Mixed-effects models included these factors as covariates. We agree that these specifics should be reported more prominently and transparently. We will add a dedicated subsection in the Methods section detailing the exact manipulation, trial counts, and statistical controls. This revision will be made. revision: yes
Circularity Check
No circularity: purely empirical behavioral study with no derivations or fitted parameters
full rationale
The paper presents an empirical evaluation of LLMs on novel behavior-based tasks for theory of mind, reporting performance differences across models and conditions (with/without scratchpads, cognitive load). No equations, parameter fits, or mathematical derivations are described in the abstract or reader's summary. The central claims rest on observed task outcomes rather than any reduction of results to author-defined quantities or self-citations. This matches the default case of a self-contained empirical study, warranting score 0 with no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The experimental tasks validly require and measure causal mental modeling of self and others.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a novel experimental paradigm that requires that subjects form representations of the mental states of themselves and others and act on them strategically rather than merely describe them.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
nonthinking LLMs ... must perform their computations entirely internally; success here suggests that an LLM has learned a mechanism for building and running mental models within its weights.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.