Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting
Pith reviewed 2026-05-19 07:25 UTC · model grok-4.3
The pith
Most language models track physical events more accurately than other characters' mental states in synthetic stories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the StorySim framework to generate novel, compositional stories, the authors find that most LLMs achieve higher accuracy on world-modeling tasks than on matched first- and second-order theory-of-mind tasks, reason more accurately about the beliefs of persons than of inanimate objects, and exhibit heuristic behavior that over-weights earlier events in the narrative.
What carries the argument
StorySim, a programmable story-generation framework anchored by an explicit, editable Storyboard that independently controls events, character perspectives, and object states.
If this is right
- LLMs may give unreliable answers when user intentions or knowledge differ from the model's own information.
- Performance gaps between world modeling and mental-state reasoning indicate that current training leaves perspective tracking underdeveloped.
- Over-reliance on early story events shows models can be misled by narrative order rather than updating beliefs as new information arrives.
- Better reasoning about persons than objects suggests training data biases that favor human-centric patterns.
- The controllable storyboard allows precise isolation of which story features drive correct or incorrect ToM answers.
Where Pith is reading between the lines
- If the performance gap persists on new stories, assistants may need separate modules or training objectives focused on belief tracking.
- The method could be adapted to test related abilities such as recognizing deception or predicting future actions based on beliefs.
- Real-world applications like collaborative agents or personalized tutors would be directly affected by any confirmed ToM shortfall.
- Extending the framework to longer or branching stories might reveal whether the early-event heuristic grows worse with narrative length.
Load-bearing premise
The synthetic stories isolate genuine mental-state tracking and do not contain surface patterns or biases that models can exploit without understanding perspectives.
What would settle it
If new batches of StorySim stories produced equal accuracy on ToM and WM tasks or eliminated the person-versus-object gap, the claimed performance differences would not hold.
read the original abstract
We introduce StorySim, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, or rely on an LLM for generation, StorySim produces novel, compositional story prompts anchored by a highly controllable Storyboard, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of LLMs show that most models achieve higher accuracy on WM tasks than on ToM tasks, and that models tend to reason more accurately when the subject of reasoning is a person rather than an inanimate object. Additionally, our framework enabled us to find evidence of heuristic behavior and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces StorySim, a programmable synthetic story generation framework anchored by a controllable Storyboard, to evaluate LLMs on first- and second-order theory-of-mind (ToM) tasks and matched world-modeling (WM) tasks. Experiments across multiple models show higher accuracy on WM than ToM, better performance when the reasoning subject is a person rather than an object, and heuristic patterns including over-reliance on early events in the story. All generation and evaluation code is released.
Significance. If the central results hold after addressing potential confounds, the work supplies a contamination-resistant, precisely manipulable benchmark for probing mental-state tracking in LLMs. The open code and compositional design are clear strengths that enable reproducibility and targeted follow-up experiments. The reported person-vs-object and WM-vs-ToM gaps, if robust, would usefully inform both model evaluation and training objectives aimed at social reasoning.
major comments (2)
- [§3 and §4] §3 (StorySim Framework) and §4 (Task Design): The central claim that StorySim isolates ToM reasoning from surface heuristics rests on the controllability of the Storyboard, yet the manuscript provides no quantitative checks (e.g., event-order permutation that preserves surface statistics while altering ToM demands, or lexical-cue ablation) to demonstrate that models cannot solve the tasks via non-ToM shortcuts. This is load-bearing for interpreting the WM > ToM accuracy gap and the early-event bias as evidence of ToM limitations.
- [§5] §5 (Results): The person > object accuracy advantage is presented as a key finding, but the paper does not report whether this difference survives controls for story length, number of entities, or lexical overlap between person and object conditions; without such checks the result could reflect surface regularities rather than differential mental-state tracking.
minor comments (2)
- [§2] The abstract and §2 would benefit from a concise table or bullet list explicitly contrasting StorySim with prior ToM benchmarks on the dimensions of contamination risk, controllability, and use of LLM generators.
- Figure captions and axis labels should state the exact number of stories per condition and whether error bars represent standard error or 95% CI.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The suggested controls will help strengthen the interpretation of our results, and we outline below how we will incorporate them in the revision.
read point-by-point responses
-
Referee: [§3 and §4] §3 (StorySim Framework) and §4 (Task Design): The central claim that StorySim isolates ToM reasoning from surface heuristics rests on the controllability of the Storyboard, yet the manuscript provides no quantitative checks (e.g., event-order permutation that preserves surface statistics while altering ToM demands, or lexical-cue ablation) to demonstrate that models cannot solve the tasks via non-ToM shortcuts. This is load-bearing for interpreting the WM > ToM accuracy gap and the early-event bias as evidence of ToM limitations.
Authors: We agree that explicit quantitative validation would further support the claim that StorySim isolates ToM demands. In the revised manuscript we will add two sets of controls: (1) event-order permutation experiments that preserve surface statistics (word frequencies, sentence length, entity mentions) while altering the order of mental-state events, and (2) lexical-cue ablation studies that remove or mask early-event cues. These results will be reported in an expanded §4 and a new subsection of §5, directly addressing whether the WM > ToM gap and early-event bias can be explained by non-ToM heuristics. revision: yes
-
Referee: [§5] §5 (Results): The person > object accuracy advantage is presented as a key finding, but the paper does not report whether this difference survives controls for story length, number of entities, or lexical overlap between person and object conditions; without such checks the result could reflect surface regularities rather than differential mental-state tracking.
Authors: We acknowledge the need for these controls. In the revision we will add matched-subset analyses and regression models that control for story length, number of entities, and lexical overlap (measured via token overlap and embedding similarity). We will report both the raw and controlled effect sizes in §5, demonstrating that the person > object advantage remains statistically significant after these adjustments. revision: yes
Circularity Check
No circularity: empirical results from novel synthetic data
full rationale
The paper introduces StorySim as a programmable generator of compositional stories and reports direct empirical accuracies on first- and second-order ToM tasks versus WM controls across multiple LLMs. Central findings (WM > ToM accuracy, person > object advantage, early-event heuristics) are measurements on freshly generated prompts rather than quantities defined in terms of fitted parameters, self-referential equations, or load-bearing self-citations. No derivation chain reduces any claimed result to its own inputs by construction; the evaluation remains externally falsifiable via the released code and story-generation rules.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic stories generated from a controllable storyboard can isolate theory-of-mind and world-modeling abilities without introducing exploitable surface cues.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce StorySim, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs).
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
most models achieve higher accuracy on WM tasks than on ToM tasks... over-reliance on earlier events in the story
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.