arxiv: 2602.10298 · v2 · submitted 2026-02-10 · 💻 cs.CL

Recognition: no theorem link

On Emergent Social World Models -- Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models

Polina Tsvilodub , Jan-Felix Klumpp , Amir Mohammadpour , Jennifer Hu , Michael Franke

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords theory of mindpragmatic reasoninglanguage modelsfunctional integrationsocial world modelsemergent abilitiescomputational mechanisms

0 comments

The pith

Language models develop shared mechanisms for theory of mind and pragmatic reasoning rather than isolated skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether language models build unified representations of mental states that serve both general theory of mind reasoning and language-specific pragmatic tasks. It applies behavioral tests across seven theory of mind subcategories and causal localization experiments adapted from neuroscience methods on an expanded dataset. Stringent statistical tests yield suggestive support for functional integration, pointing to interconnected social world models instead of separate competencies. A sympathetic reader would care because this bears on whether models acquire general-purpose social cognition from language data alone or merely learn narrow task solutions.

Core claim

Using functional localization techniques inspired by cognitive neuroscience, the study finds suggestive evidence that language models recruit overlapping computational mechanisms for multiple theory of mind abilities and pragmatic reasoning, consistent with the emergence of integrated social world models rather than isolated competencies.

What carries the argument

Functional localization methods that identify shared mechanisms by testing whether localizing one social task affects performance on related tasks across a large set of theory of mind subcategories.

Load-bearing premise

The localization techniques accurately detect shared mechanisms in language models without being confounded by task-specific features or training artifacts.

What would settle it

An experiment in which disrupting the localized region for one theory of mind subcategory leaves performance on pragmatic reasoning and other theory of mind subcategories unchanged would falsify the integration claim.

Figures

Figures reproduced from arXiv: 2602.10298 by Amir Mohammadpour, Jan-Felix Klumpp, Jennifer Hu, Michael Franke, Polina Tsvilodub.

**Figure 2.** Figure 2: Percentage of localized units across model lay [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: All: Plots show the mean causal effects (y-axis), i.e., difference in accuracy relative to the non-ablated baseline when ablating the critical (ToM) subnetworks or the control (least active) sub-networks (x-axis), for different test sets (colors and shapes). 95% CIs indicate change in by-dataset accuracy across localizer suites. Top row: Fictitious, stylized result that could come from our experiments (lef… view at source ↗

**Figure 4.** Figure 4: Average accuracy (y-axis) for the domains [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Detailed overview of the accuracy (y-axis) results of all models (x-axis) on the first eight pragmatic [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Detailed overview of the accuracy (y-axis) results of all models (x-axis) on the remaining eight pragmatic [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Detailed overview of the accuracy (y-axis) results of all models (x-axis) on ten of the ToM datasets [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Detailed overview of the accuracy (y-axis) results of all models (x-axis) on the remaining twelve ToM [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Statistical comparison of models with different combinations of up to seven ATOMS predictors with [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: PCA with two components of sentence embeddings (from SentenceTransformer’s all-MiniLM-L6-v2 [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: PCA with two components of sentence embeddings (from SentenceTransformer’s all-MiniLM-L6-v2 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of top 1% of most active units across the model layers, identified with respect to the [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Distribution of top 1% of least active units across the model layers, identified with respect to the [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗

**Figure 14.** Figure 14: Average difference in accuracy, by localizer, across models. [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗

**Figure 15.** Figure 15: Average change in accuracy (y-axis) relative to the intact model when applying ablations of the target or [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

**Figure 16.** Figure 16: Average change in accuracy across evaluation datasets (y-axis) when different localizer datasets are used [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗

**Figure 17.** Figure 17: Average change in accuracy across evaluation datasets (y-axis) when different localizer datasets are used [PITH_FULL_IMAGE:figures/full_fig_p037_17.png] view at source ↗

**Figure 18.** Figure 18: Average difference in accuracy under ablations of the localized subnetworks, by localizer, across 13 [PITH_FULL_IMAGE:figures/full_fig_p038_18.png] view at source ↗

**Figure 19.** Figure 19: Average difference in accuracy under ablations of the localized subnetworks, for each localizer and [PITH_FULL_IMAGE:figures/full_fig_p038_19.png] view at source ↗

**Figure 20.** Figure 20: Average difference in accuracy under ablations of the localized subnetworks, for each localizer and [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗

read the original abstract

This paper investigates whether LMs recruit shared computational mechanisms for general Theory of Mind (ToM) and language-specific pragmatic reasoning in order to contribute to the general question of whether LMs may be said to have emergent "social world models", i.e., representations of mental states that are repurposed across tasks (the functional integration hypothesis). Using behavioral evaluations and causal-mechanistic experiments via functional localization methods inspired by cognitive neuroscience, we analyze LMs' performance across seven subcategories of ToM abilities (Beaudoin et al., 2020) on a substantially larger localizer dataset than used in prior like-minded work. Results from stringent hypothesis-driven statistical testing offer suggestive evidence for the functional integration hypothesis, indicating that LMs may develop interconnected "social world models" rather than isolated competencies. This work contributes novel ToM localizer data, methodological refinements to functional localization techniques, and empirical insights into the emergence of social cognition in artificial systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a bigger ToM localizer dataset and runs causal tests for shared mechanisms with pragmatics, but the evidence stays suggestive and vulnerable to surface-feature confounds.

read the letter

This paper supplies a larger localizer dataset for seven ToM subcategories and tests whether language models show functional integration between general theory of mind and pragmatic reasoning. The central finding is suggestive statistical support for shared mechanisms rather than isolated skills, framed as evidence for emergent social world models. That is the main thing to take away: new data plus an attempt at mechanistic localization inspired by neuroscience methods. The scale of the new data is a straightforward improvement over earlier behavioral probes, and the hypothesis-driven tests give a clearer target than pure performance metrics. The work also refines the localization technique itself, which could be reusable. Those are the concrete advances. The soft spot is the interpretation of the localization results. In transformers, overlap in activations or intervention effects can easily trace to correlated lexical or syntactic patterns across tasks instead of any repurposed social representation. The abstract does not describe controls that hold surface statistics constant while varying social content, so the integration claim rests on evidence that may not yet rule out simpler explanations. The authors themselves call the results suggestive, which aligns with the current strength of the support. This is useful for researchers who track ToM benchmarks in models or who want to compare AI performance against cognitive neuroscience tools. Readers looking for fresh empirical material on social reasoning in LMs will get value from the dataset and the experimental setup. The paper is coherent on its own terms and engages the relevant literature without obvious internal contradictions, so it deserves a serious referee. I would send it to peer review; the new data and the mechanistic angle are worth external scrutiny even if the controls need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper claims that language models develop emergent interconnected 'social world models' by functionally integrating Theory of Mind (ToM) abilities across seven subcategories with pragmatic reasoning, supported by behavioral evaluations and causal-mechanistic experiments using functional localization methods adapted from cognitive neuroscience on a substantially larger localizer dataset than prior work, with results from stringent hypothesis-driven statistical testing providing suggestive evidence against isolated competencies.

Significance. If the central results hold after addressing methodological controls, the work would offer empirical support for repurposed mental-state representations in LMs, contribute novel ToM localizer data, and refine functional localization techniques for mechanistic interpretability in AI systems. The hypothesis-driven approach and focus on functional integration distinguish it from purely behavioral studies and could inform broader questions about emergent social cognition in transformers.

major comments (2)

[§3] §3 (Methods, Functional Localization): The adaptation of neuroscience-style localizer tasks does not describe explicit controls that match non-social content while preserving surface statistics such as lexical items, syntactic frames, or training-data co-occurrence patterns between ToM and pragmatic sub-tasks. Without these, activation overlap or intervention effects could arise from general language processing rather than causally shared social mechanisms, directly undermining the functional integration claim.
[§4] §4 (Results, Statistical Analysis): The manuscript reports 'stringent hypothesis-driven statistical testing' but provides insufficient detail on exact tests, effect sizes, multiple-comparison corrections across the seven ToM subcategories, or power analysis for the larger localizer dataset. This makes it impossible to evaluate whether the suggestive evidence for integration is robust or potentially inflated by task-specific features.

minor comments (2)

[Abstract] Abstract: The description of the localizer dataset size and model architectures tested could be more quantitative to allow readers to immediately gauge scale relative to prior work.
[References] References: The citation to Beaudoin et al. (2020) for the seven ToM subcategories should include the full reference details and any updates or extensions used in the current dataset construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional methodological details and statistical reporting, which we believe strengthens the presentation of our evidence for functional integration.

read point-by-point responses

Referee: [§3] §3 (Methods, Functional Localization): The adaptation of neuroscience-style localizer tasks does not describe explicit controls that match non-social content while preserving surface statistics such as lexical items, syntactic frames, or training-data co-occurrence patterns between ToM and pragmatic sub-tasks. Without these, activation overlap or intervention effects could arise from general language processing rather than causally shared social mechanisms, directly undermining the functional integration claim.

Authors: We appreciate the referee's emphasis on rigorous controls to isolate social mechanisms from general language processing. The localizer tasks were adapted from established cognitive neuroscience paradigms (e.g., false-belief vs. false-photograph conditions) that already incorporate non-social baselines, and our larger dataset extends these. However, we acknowledge that explicit matching for surface statistics such as lexical items, syntactic frames, and training-data co-occurrence was not described in sufficient detail. In the revised manuscript, we will add a new subsection in Methods outlining additional control analyses that match these features across ToM and pragmatic tasks where feasible, along with an assessment of token overlap. We will also explicitly discuss the inherent limitations of controlling pretraining co-occurrence statistics post hoc. These changes will better substantiate the causal claims for functional integration. revision: yes
Referee: [§4] §4 (Results, Statistical Analysis): The manuscript reports 'stringent hypothesis-driven statistical testing' but provides insufficient detail on exact tests, effect sizes, multiple-comparison corrections across the seven ToM subcategories, or power analysis for the larger localizer dataset. This makes it impossible to evaluate whether the suggestive evidence for integration is robust or potentially inflated by task-specific features.

Authors: We agree that greater transparency in the statistical methods is essential for evaluating the robustness of our findings. The revised manuscript will expand the Statistical Analysis section to specify the exact tests employed (including paired t-tests, repeated-measures ANOVA, and any non-parametric alternatives), report effect sizes (e.g., Cohen's d), detail the multiple-comparison corrections applied (Bonferroni adjustment across the seven ToM subcategories), and include a power analysis for the expanded localizer dataset. Full statistical outputs and supplementary tables will also be provided. These additions will enable readers to more accurately assess whether the evidence supports functional integration or may be influenced by task-specific factors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study relies on new data and external statistical tests

full rationale

The paper is an empirical behavioral and mechanistic investigation that collects new localizer data, applies functional localization methods adapted from neuroscience, and performs hypothesis-driven statistical testing on model performance across ToM and pragmatic tasks. No mathematical derivations, equations, or parameter-fitting steps are present that reduce predictions to inputs by construction. Claims rest on external data collection and statistical controls rather than self-definitional loops, self-citation chains, or renamed known results. The analysis is self-contained against its own experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical behavioral data and statistical tests applied to existing language models; no free parameters are fitted to support the integration claim, and no new entities are postulated.

axioms (1)

domain assumption Standard assumptions of hypothesis-driven statistical testing and functional localization methods from cognitive neuroscience
The analysis invokes established statistical practices for identifying shared mechanisms without detailing deviations or robustness checks in the abstract.

pith-pipeline@v0.9.0 · 5482 in / 1095 out tokens · 28869 ms · 2026-05-16T05:10:37.204423+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A paradox of AI fluency
cs.CL 2026-04 unverdicted novelty 6.0

Fluent AI users adopt an active, iterative collaboration mode that produces more visible failures but better recovery and success on hard tasks, whereas novices experience more invisible failures from passive use.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper

[1]

theory of mind

Does the autistic child have a “theory of mind”?Cognition, 21(1):37–46. Cindy Beaudoin, Élizabel Leblanc, Charlotte Gagner, and Miriam H Beauchamp. 2020. Systematic review and inventory of theory of mind measures for young children.Frontiers in psychology, 10:2905. Emily M. Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. 2021. O...

work page arXiv 2020
[2]

arXiv preprint arXiv:2410.13648

Simpletom: Exposing the gap between explicit tom inference and implicit tom application in llms. arXiv preprint arXiv:2410.13648. Michael Hanna, Yonatan Belinkov, and Sandro Pezzelle

work page arXiv
[3]

11 Miriam Hauptman, Idan Blank, and Evelina Fedorenko

Are formal and functional linguistic mecha- nisms dissociated in language models?Computa- tional Linguistics, pages 1–40. 11 Miriam Hauptman, Idan Blank, and Evelina Fedorenko

work page
[4]

Cortex, 162:96–114

Non-literal language processing is jointly sup- ported by the language and theory of mind networks: evidence from a novel meta-analytic fmri approach. Cortex, 162:96–114. Stefan Heimersheim and Neel Nanda. 2024. How to use and interpret activation patching.arXiv preprint arXiv:2404.15255. Jennifer Hu, Sammy Floyd, Olessia Jouravlev, Evelina Fedorenko, and...

work page arXiv 2024
[5]

Najoung Kim and Sebastian Schuster

Comparing humans and large language models on an experimental protocol inventory for theory of mind evaluation (EPITOME).Transactions of the Association for Computational Linguistics, 12:803– 819. Najoung Kim and Sebastian Schuster. 2023. Entity tracking in language models. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Li...

work page 2023
[6]

Revisiting the evaluation of theory of mind through question answering. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5872–5877, Hong Kong, China. Association for Computational Linguistics. Stephen C. Levinson. 1983....

work page 2019
[7]

In Proceedings of ICLR 11

Emergent world representations: Exploring a sequence model trained on a synthetic task. In Proceedings of ICLR 11. Bolei Ma, Yuting Li, Wei Zhou, Ziwei Gong, Yang Janet Liu, Katja Jasinskaja, Annemarie Friedrich, Julia Hirschberg, Frauke Kreuter, and Barbara Plank. 2025. Pragmatics in the era of large language models: A survey on datasets, evaluation, opp...

work page 2025
[8]

InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1011–1031, Singapore

Towards a holistic landscape of situated theory of mind in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1011–1031, Singapore. Association for Computational Linguistics. Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fedorenko. 2024. Dissociating lang...

work page arXiv 2023
[9]

InProceedings of the 17th Annual Meeting of the Special Interest Group on Dis- course and Dialogue, pages 31–41, Los Angeles

Creating and characterizing a diverse corpus of sarcasm in dialogue. InProceedings of the 17th Annual Meeting of the Special Interest Group on Dis- course and Dialogue, pages 31–41, Los Angeles. Association for Computational Linguistics. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of...

work page arXiv 2019
[10]

theory of mind

Divide and conquer: A defense of functional localizers.NeuroImage, 30(4):1088–1096. Rebecca Saxe and Nancy Kanwisher. 2003. People thinking about thinking people: The role of the temporo-parietal junction in “theory of mind”.Neu- roImage, 19(4):1835–1842. Natalie Shapira, Guy Zwirn, and Yoav Goldberg. 2023. How well do large language models perform on fau...

work page 2003
[11]

InFind- ings of the Association for Computational Linguistics: ACL 2024, pages 12075–12097, Bangkok, Thailand

PUB: A pragmatics understanding benchmark for assessing LLMs’ pragmatics capabilities. InFind- ings of the Association for Computational Linguistics: ACL 2024, pages 12075–12097, Bangkok, Thailand. Association for Computational Linguistics. Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. 2023. Cognitive architectures for language ag...

work page arXiv 2024
[12]

non-literal communication

Thedesirescategory encompasses tasks that require ascribing desires to actors. 4)Emotions refers to the ability to predict other people’s emo- tions from events, and to understand that other peo- ple may hide or regulate their emotions such that they are not directly accessible. 5) Theknowledge category concerns being aware of different people having diff...

work page 2020
[13]

reality” and “mem- ory

multi-hop tasks, which are modifications of the false-belief tasks aiming to address the understand- ing of social norms, and 4) an attitude task that concerns the attitude of an observer towards the mover’s action. SimpleToM(Gu et al., 2024) is a dataset de- signed to test LLMs’ abilities to infer mental states implicitly. This is tested using three type...

work page 2024
[14]

The hy- pothesis contradicts the premise

is a commonsense psychology dataset mea- suring the ability to infer mental or emotional causes and social relations from actions. The agents are represented as geometric shapes (e.g., triangles), which perform some action. A following binary choice question then offers two motivations for the observed behavior, and the task is to identify the more plausi...

work page 2024
[15]

indirectness classification (i.e., distinguishing direct from indirect responses), 2) agreement detec- tion (i.e., assessing whether two speakers agree),

work page
[16]

Speaker 2

understanding sarcasm, 4) deictic question an- swering (i.e., correctly inferring whether an object is in some location based on deictic expressions in a dialogue). All these tasks are phrased such that there are exactly two answer options. For agree- ment detection, we modify the answer prefix to be “Speaker 2 ”, and the answer options to be “agrees” and...

work page 2016
[17]

LatentBeliefs: to localize the ability to recog- nize latent mental action causes, we generate synthetic stimuli based on 6 items from Saxe and Kanwisher (2003) where a person acts out of a false belief (henceforth called the FalseBelief condition) and on 6 items from the same paper where a person’s desires do not match with the actual course of events (t...

work page 2003
[18]

from Bosco et al

CommunicativeIntent: to localize the ability to recognize latent mental states in communi- cation, we use synthetic stimuli based on data 8We only use a subset of 9 of the original mechanical in- ference stories where the actual hidden cause is unambiguous. from Bosco et al. (2017). Following Bosco et al. (2017), we contrast a Deceptive condi- tion, where...

work page 2017
[19]

GameBeliefs:to localize the ability to reason about the latent causes for agent behavior in the specific setting of strategic games, we use synthetic stimuli based on data from Chang et al. (2023). We generate synthetic stimuli for a GameBelief condition, where reasoning about the motivation for a specific decision in a trust game or ultimatum game is nec...

work page 2023
[20]

An- swer: {answer prefix}

MoralIntent: to localize the ability to ascribe moral intent to actors, we generate synthetic stimuli based on 12 stimuli from Young et al. (2007), where a person chooses to perform an action despite being aware of its harmful con- sequences for another actor (MoralIntent). We base our synthetic control stimuli on manually created paired control items, wh...

work page 2007
[21]

Description:

and SNLI (Bowman et al., 2015), and on the following selected ToM and pragmatics datasets on which the intact models showed an above-chance performance (described in detail above in Sec- tion A.2): • Llama-3.1-8B: BigToM (forward), EmoBench (emotion application), EmoBench (understanding emotion), EPITOME (recur- sive mindreading), EWoK (agent properties),...

work page 2015