Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Abulhair Saparov; Nathaniel Getachew

arxiv: 2506.19089 · v5 · submitted 2025-06-23 · 💻 cs.CL · cs.AI

Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Nathaniel Getachew , Abulhair Saparov This is my paper

Pith reviewed 2026-05-19 07:25 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords theory of mindlarge language modelssynthetic story generationmental state reasoningworld modelingToM evaluationheuristic behaviorstoryboard control

0 comments

The pith

Most language models track physical events more accurately than other characters' mental states in synthetic stories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StorySim, a controllable system that builds entirely new stories with explicit storyboards so researchers can test whether models understand what different characters know or believe. Experiments on many current models show they score higher when asked to model the world itself than when asked to track first- or second-order beliefs, and they perform better when the mind they must read belongs to a person rather than an object. The same tests also reveal that models often rely on early parts of a story and ignore later changes, suggesting they use shortcuts instead of full perspective tracking. A reader would care because reliable mental-state reasoning is required for any system that must cooperate with or assist humans without constant clarification.

Core claim

Using the StorySim framework to generate novel, compositional stories, the authors find that most LLMs achieve higher accuracy on world-modeling tasks than on matched first- and second-order theory-of-mind tasks, reason more accurately about the beliefs of persons than of inanimate objects, and exhibit heuristic behavior that over-weights earlier events in the narrative.

What carries the argument

StorySim, a programmable story-generation framework anchored by an explicit, editable Storyboard that independently controls events, character perspectives, and object states.

If this is right

LLMs may give unreliable answers when user intentions or knowledge differ from the model's own information.
Performance gaps between world modeling and mental-state reasoning indicate that current training leaves perspective tracking underdeveloped.
Over-reliance on early story events shows models can be misled by narrative order rather than updating beliefs as new information arrives.
Better reasoning about persons than objects suggests training data biases that favor human-centric patterns.
The controllable storyboard allows precise isolation of which story features drive correct or incorrect ToM answers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the performance gap persists on new stories, assistants may need separate modules or training objectives focused on belief tracking.
The method could be adapted to test related abilities such as recognizing deception or predicting future actions based on beliefs.
Real-world applications like collaborative agents or personalized tutors would be directly affected by any confirmed ToM shortfall.
Extending the framework to longer or branching stories might reveal whether the early-event heuristic grows worse with narrative length.

Load-bearing premise

The synthetic stories isolate genuine mental-state tracking and do not contain surface patterns or biases that models can exploit without understanding perspectives.

What would settle it

If new batches of StorySim stories produced equal accuracy on ToM and WM tasks or eliminated the person-versus-object gap, the claimed performance differences would not hold.

read the original abstract

We introduce StorySim, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, or rely on an LLM for generation, StorySim produces novel, compositional story prompts anchored by a highly controllable Storyboard, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of LLMs show that most models achieve higher accuracy on WM tasks than on ToM tasks, and that models tend to reason more accurately when the subject of reasoning is a person rather than an inanimate object. Additionally, our framework enabled us to find evidence of heuristic behavior and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StorySim offers a practical new framework for generating controlled synthetic stories to test ToM versus WM in LLMs, with results showing clear performance gaps, though the isolation from heuristics needs tighter checks.

read the letter

The main thing to know is that StorySim gives a new programmable way to build stories for testing ToM and WM in LLMs, and the experiments point to models struggling more with mental state tracking than with tracking physical events, plus some reliance on early story parts. What the paper does well is the framework design. By anchoring stories to a controllable Storyboard, they generate fresh compositional prompts that manipulate character perspectives without using existing data or LLM outputs. This setup lets them run first- and second-order ToM tasks against WM baselines. The results show higher accuracy on WM than ToM, better performance when the subject is a person instead of an object, and evidence of heuristic behavior with over-reliance on earlier events. Releasing the code for data generation and evaluation is helpful for anyone wanting to reproduce or extend the work. The softer part is the link between the performance gaps and actual ToM deficits. The stress-test note correctly flags that programmable generation could introduce regularities that models exploit instead of doing perspective-taking. Without explicit tests like ablating cues or reordering events while keeping the ToM demands the same, the findings could partly reflect surface statistics rather than deep limitations. The abstract reports the accuracy differences clearly, but fuller details on statistical significance and task construction would help confirm the claims hold up. Readers focused on AI evaluation, social reasoning benchmarks, or LLM limitations in interactive settings will find this useful. It is the kind of empirical study that adds a practical tool and some data points to the discussion. The work shows clear thinking in how it sets up the controls and reports the patterns. I think it deserves peer review. The novelty in the generation method and the open resources make it worth a referee's time, with the main request being more checks on whether the tasks truly require ToM.

Referee Report

2 major / 2 minor

Summary. The paper introduces StorySim, a programmable synthetic story generation framework anchored by a controllable Storyboard, to evaluate LLMs on first- and second-order theory-of-mind (ToM) tasks and matched world-modeling (WM) tasks. Experiments across multiple models show higher accuracy on WM than ToM, better performance when the reasoning subject is a person rather than an object, and heuristic patterns including over-reliance on early events in the story. All generation and evaluation code is released.

Significance. If the central results hold after addressing potential confounds, the work supplies a contamination-resistant, precisely manipulable benchmark for probing mental-state tracking in LLMs. The open code and compositional design are clear strengths that enable reproducibility and targeted follow-up experiments. The reported person-vs-object and WM-vs-ToM gaps, if robust, would usefully inform both model evaluation and training objectives aimed at social reasoning.

major comments (2)

[§3 and §4] §3 (StorySim Framework) and §4 (Task Design): The central claim that StorySim isolates ToM reasoning from surface heuristics rests on the controllability of the Storyboard, yet the manuscript provides no quantitative checks (e.g., event-order permutation that preserves surface statistics while altering ToM demands, or lexical-cue ablation) to demonstrate that models cannot solve the tasks via non-ToM shortcuts. This is load-bearing for interpreting the WM > ToM accuracy gap and the early-event bias as evidence of ToM limitations.
[§5] §5 (Results): The person > object accuracy advantage is presented as a key finding, but the paper does not report whether this difference survives controls for story length, number of entities, or lexical overlap between person and object conditions; without such checks the result could reflect surface regularities rather than differential mental-state tracking.

minor comments (2)

[§2] The abstract and §2 would benefit from a concise table or bullet list explicitly contrasting StorySim with prior ToM benchmarks on the dimensions of contamination risk, controllability, and use of LLM generators.
Figure captions and axis labels should state the exact number of stories per condition and whether error bars represent standard error or 95% CI.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The suggested controls will help strengthen the interpretation of our results, and we outline below how we will incorporate them in the revision.

read point-by-point responses

Referee: [§3 and §4] §3 (StorySim Framework) and §4 (Task Design): The central claim that StorySim isolates ToM reasoning from surface heuristics rests on the controllability of the Storyboard, yet the manuscript provides no quantitative checks (e.g., event-order permutation that preserves surface statistics while altering ToM demands, or lexical-cue ablation) to demonstrate that models cannot solve the tasks via non-ToM shortcuts. This is load-bearing for interpreting the WM > ToM accuracy gap and the early-event bias as evidence of ToM limitations.

Authors: We agree that explicit quantitative validation would further support the claim that StorySim isolates ToM demands. In the revised manuscript we will add two sets of controls: (1) event-order permutation experiments that preserve surface statistics (word frequencies, sentence length, entity mentions) while altering the order of mental-state events, and (2) lexical-cue ablation studies that remove or mask early-event cues. These results will be reported in an expanded §4 and a new subsection of §5, directly addressing whether the WM > ToM gap and early-event bias can be explained by non-ToM heuristics. revision: yes
Referee: [§5] §5 (Results): The person > object accuracy advantage is presented as a key finding, but the paper does not report whether this difference survives controls for story length, number of entities, or lexical overlap between person and object conditions; without such checks the result could reflect surface regularities rather than differential mental-state tracking.

Authors: We acknowledge the need for these controls. In the revision we will add matched-subset analyses and regression models that control for story length, number of entities, and lexical overlap (measured via token overlap and embedding similarity). We will report both the raw and controlled effect sizes in §5, demonstrating that the person > object advantage remains statistically significant after these adjustments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from novel synthetic data

full rationale

The paper introduces StorySim as a programmable generator of compositional stories and reports direct empirical accuracies on first- and second-order ToM tasks versus WM controls across multiple LLMs. Central findings (WM > ToM accuracy, person > object advantage, early-event heuristics) are measurements on freshly generated prompts rather than quantities defined in terms of fitted parameters, self-referential equations, or load-bearing self-citations. No derivation chain reduces any claimed result to its own inputs by construction; the evaluation remains externally falsifiable via the released code and story-generation rules.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the generated stories validly measure the targeted cognitive capacities; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Synthetic stories generated from a controllable storyboard can isolate theory-of-mind and world-modeling abilities without introducing exploitable surface cues.
The evaluation framework depends on this premise to interpret accuracy differences as evidence of ToM limitations rather than task artifacts.

pith-pipeline@v0.9.0 · 5698 in / 1256 out tokens · 37836 ms · 2026-05-19T07:25:01.914077+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce StorySim, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

most models achieve higher accuracy on WM tasks than on ToM tasks... over-reliance on earlier events in the story

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.