Traces of Social Competence in Large Language Models
Pith reviewed 2026-05-21 12:39 UTC · model grok-4.3
The pith
Large language models acquire stereotypical response patterns tied to mental-state vocabulary during pre-training that drive their behavior on false belief tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics, and vector steering isolates a think vector as the causal driver of observed false belief test behaviour.
What carries the argument
The think vector, a direction in the model's internal space found through vector steering that causally influences responses when mental states are described.
Load-bearing premise
That how the models score on these tests mainly shows social thinking skills rather than just associations learned from seeing mental-state words in training data.
What would settle it
Steering or removing the think vector and then re-testing on the false belief variants to see if the performance patterns and crossover effect change as expected.
read the original abstract
The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. For Large Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al., 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X thinks) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented fine-tuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a think vector as the causal driver of observed FBT behaviour.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper tests 17 open-weight LLMs on a balanced set of 192 False Belief Test (FBT) variants from Trott et al. (2023). Using Bayesian logistic regression, it examines effects of model scale and post-training on socio-cognitive performance. Key findings include non-strict benefits from scaling, a cross-over effect in which explicit propositional attitudes ('X thinks') change response patterns, partial mitigation by instruction tuning but amplification by reasoning fine-tuning, emergence of the effect during OLMo 2 pre-training, and isolation of a 'think vector' via steering that causally drives FBT behavior, interpreted as evidence that models acquire stereotypical patterns tied to mental-state vocabulary.
Significance. If the empirical patterns and causal intervention hold, the work supplies a large-scale, controlled comparison of open models on a classic ToM probe while addressing contamination concerns. The pre-training case study and vector-steering result offer concrete evidence that vocabulary-linked statistical regularities can dominate scenario semantics, which is useful for mechanistic interpretability of social reasoning in LLMs. The Bayesian regression approach and balanced variant set are strengths that improve on prior FBT evaluations.
major comments (2)
- [§4] §4 (Vector Steering Analysis): The claim that the extracted 'think vector' isolates a causal driver of FBT behaviour rests on contrasting activations from prompts containing explicit mental-state verbs against matched controls. This contrast risks capturing lexical co-occurrence statistics rather than a mechanism specific to propositional attitude handling; the paper should report whether steering still modulates responses when mental-state vocabulary is held constant across conditions or when surface-form controls are strengthened.
- [§3.2] §3.2 (Bayesian logistic regression results): The reported cross-over effect and its modulation by post-training are central to the socio-cognitive competence interpretation, yet the manuscript does not provide the full regression specification, priors, or exact exclusion criteria for contaminated items. Without these details it is impossible to verify that the performance differences primarily reflect belief-attribution processing rather than surface associations with mental-state vocabulary.
minor comments (2)
- [Table 1] Table 1: The column headers for the 192 variants should explicitly state how many items fall into each propositional-attitude condition to allow readers to assess balance.
- [Figure 4] Figure 4 (OLMo 2 training curves): The x-axis label for training steps is missing units; please add 'steps' or 'tokens' for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has identified important areas for clarification and strengthening of our analyses. We address each major comment below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Vector Steering Analysis): The claim that the extracted 'think vector' isolates a causal driver of FBT behaviour rests on contrasting activations from prompts containing explicit mental-state verbs against matched controls. This contrast risks capturing lexical co-occurrence statistics rather than a mechanism specific to propositional attitude handling; the paper should report whether steering still modulates responses when mental-state vocabulary is held constant across conditions or when surface-form controls are strengthened.
Authors: We appreciate this concern about the potential for lexical confounds in the vector extraction. Our original matched controls were constructed to differ primarily in the presence of mental-state verbs while preserving scenario semantics and surface structure as closely as possible. However, we agree that additional controls would better isolate the mechanism. In the revised manuscript we will add new steering experiments that hold mental-state vocabulary constant through the use of semantically matched paraphrases (e.g., replacing 'thinks' with equivalent attitude expressions or neutral descriptions) and will report whether the steering effect persists. We will also strengthen surface-form controls by including lexical-overlap matching and n-gram frequency balancing, with results presented in an expanded §4. revision: yes
-
Referee: [§3.2] §3.2 (Bayesian logistic regression results): The reported cross-over effect and its modulation by post-training are central to the socio-cognitive competence interpretation, yet the manuscript does not provide the full regression specification, priors, or exact exclusion criteria for contaminated items. Without these details it is impossible to verify that the performance differences primarily reflect belief-attribution processing rather than surface associations with mental-state vocabulary.
Authors: We agree that greater transparency is required for the statistical analyses. In the revised manuscript we will expand the Methods section and add a dedicated appendix that includes: (i) the complete Bayesian logistic regression specification (model formula with all main effects and interactions for model scale, post-training type, and mental-state vocabulary presence), (ii) the priors employed (weakly informative normal distributions on coefficients and half-Cauchy on variance components), and (iii) the precise exclusion criteria for contaminated items (exact token-overlap thresholds with pre-training data and the additional checks performed beyond Trott et al.). These additions will allow readers to assess whether the cross-over effect primarily indexes belief-attribution rather than surface associations. revision: yes
Circularity Check
No significant circularity in empirical LLM evaluation on FBT variants
full rationale
The paper reports experimental results from testing 17 open-weight models on a balanced set of 192 held-out FBT variants using Bayesian logistic regression to assess effects of model size and post-training. It further analyzes the emergence of a crossover effect during OLMo 2 pre-training and applies vector steering as an intervention. No equations, derivations, or fitted parameters are shown to reduce the reported performance patterns or causal claims to quantities defined by the same inputs. The core findings rely on external benchmarks (held-out variants, training checkpoints, steering interventions) rather than self-referential definitions or self-citation chains. This is a standard empirical study with independent content.
Axiom & Free-Parameter Ledger
free parameters (1)
- Bayesian logistic regression coefficients
invented entities (1)
-
think vector
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.