pith. sign in

arxiv: 2603.04161 · v2 · pith:B65GY3QCnew · submitted 2026-03-04 · 💻 cs.CL

Traces of Social Competence in Large Language Models

Pith reviewed 2026-05-21 12:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsfalse belief testtheory of mindvector steeringpre-trainingsocial competenceinstruction tuning
0
0 comments X

The pith

Large language models acquire stereotypical response patterns tied to mental-state vocabulary during pre-training that drive their behavior on false belief tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests seventeen open-weight models on 192 variants of the false belief test to measure socio-cognitive skills. It shows that performance improves with model size but not always smoothly, and that adding explicit statements about what someone thinks creates a crossover in how models respond. This crossover develops during pre-training and appears to stem from common patterns around mental-state words rather than deeper understanding of minds. Vector steering is used to find a specific direction in the model that causes these responses. Readers might care if this means current ways of checking for social reasoning in AI are picking up on language habits instead of real competence.

Core claim

Models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics, and vector steering isolates a think vector as the causal driver of observed false belief test behaviour.

What carries the argument

The think vector, a direction in the model's internal space found through vector steering that causally influences responses when mental states are described.

Load-bearing premise

That how the models score on these tests mainly shows social thinking skills rather than just associations learned from seeing mental-state words in training data.

What would settle it

Steering or removing the think vector and then re-testing on the false belief variants to see if the performance patterns and crossover effect change as expected.

read the original abstract

The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. For Large Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al., 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X thinks) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented fine-tuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a think vector as the causal driver of observed FBT behaviour.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper tests 17 open-weight LLMs on a balanced set of 192 False Belief Test (FBT) variants from Trott et al. (2023). Using Bayesian logistic regression, it examines effects of model scale and post-training on socio-cognitive performance. Key findings include non-strict benefits from scaling, a cross-over effect in which explicit propositional attitudes ('X thinks') change response patterns, partial mitigation by instruction tuning but amplification by reasoning fine-tuning, emergence of the effect during OLMo 2 pre-training, and isolation of a 'think vector' via steering that causally drives FBT behavior, interpreted as evidence that models acquire stereotypical patterns tied to mental-state vocabulary.

Significance. If the empirical patterns and causal intervention hold, the work supplies a large-scale, controlled comparison of open models on a classic ToM probe while addressing contamination concerns. The pre-training case study and vector-steering result offer concrete evidence that vocabulary-linked statistical regularities can dominate scenario semantics, which is useful for mechanistic interpretability of social reasoning in LLMs. The Bayesian regression approach and balanced variant set are strengths that improve on prior FBT evaluations.

major comments (2)
  1. [§4] §4 (Vector Steering Analysis): The claim that the extracted 'think vector' isolates a causal driver of FBT behaviour rests on contrasting activations from prompts containing explicit mental-state verbs against matched controls. This contrast risks capturing lexical co-occurrence statistics rather than a mechanism specific to propositional attitude handling; the paper should report whether steering still modulates responses when mental-state vocabulary is held constant across conditions or when surface-form controls are strengthened.
  2. [§3.2] §3.2 (Bayesian logistic regression results): The reported cross-over effect and its modulation by post-training are central to the socio-cognitive competence interpretation, yet the manuscript does not provide the full regression specification, priors, or exact exclusion criteria for contaminated items. Without these details it is impossible to verify that the performance differences primarily reflect belief-attribution processing rather than surface associations with mental-state vocabulary.
minor comments (2)
  1. [Table 1] Table 1: The column headers for the 192 variants should explicitly state how many items fall into each propositional-attitude condition to allow readers to assess balance.
  2. [Figure 4] Figure 4 (OLMo 2 training curves): The x-axis label for training steps is missing units; please add 'steps' or 'tokens' for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important areas for clarification and strengthening of our analyses. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Vector Steering Analysis): The claim that the extracted 'think vector' isolates a causal driver of FBT behaviour rests on contrasting activations from prompts containing explicit mental-state verbs against matched controls. This contrast risks capturing lexical co-occurrence statistics rather than a mechanism specific to propositional attitude handling; the paper should report whether steering still modulates responses when mental-state vocabulary is held constant across conditions or when surface-form controls are strengthened.

    Authors: We appreciate this concern about the potential for lexical confounds in the vector extraction. Our original matched controls were constructed to differ primarily in the presence of mental-state verbs while preserving scenario semantics and surface structure as closely as possible. However, we agree that additional controls would better isolate the mechanism. In the revised manuscript we will add new steering experiments that hold mental-state vocabulary constant through the use of semantically matched paraphrases (e.g., replacing 'thinks' with equivalent attitude expressions or neutral descriptions) and will report whether the steering effect persists. We will also strengthen surface-form controls by including lexical-overlap matching and n-gram frequency balancing, with results presented in an expanded §4. revision: yes

  2. Referee: [§3.2] §3.2 (Bayesian logistic regression results): The reported cross-over effect and its modulation by post-training are central to the socio-cognitive competence interpretation, yet the manuscript does not provide the full regression specification, priors, or exact exclusion criteria for contaminated items. Without these details it is impossible to verify that the performance differences primarily reflect belief-attribution processing rather than surface associations with mental-state vocabulary.

    Authors: We agree that greater transparency is required for the statistical analyses. In the revised manuscript we will expand the Methods section and add a dedicated appendix that includes: (i) the complete Bayesian logistic regression specification (model formula with all main effects and interactions for model scale, post-training type, and mental-state vocabulary presence), (ii) the priors employed (weakly informative normal distributions on coefficients and half-Cauchy on variance components), and (iii) the precise exclusion criteria for contaminated items (exact token-overlap thresholds with pre-training data and the additional checks performed beyond Trott et al.). These additions will allow readers to assess whether the cross-over effect primarily indexes belief-attribution rather than surface associations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical LLM evaluation on FBT variants

full rationale

The paper reports experimental results from testing 17 open-weight models on a balanced set of 192 held-out FBT variants using Bayesian logistic regression to assess effects of model size and post-training. It further analyzes the emergence of a crossover effect during OLMo 2 pre-training and applies vector steering as an intervention. No equations, derivations, or fitted parameters are shown to reduce the reported performance patterns or causal claims to quantities defined by the same inputs. The core findings rely on external benchmarks (held-out variants, training checkpoints, steering interventions) rather than self-referential definitions or self-citation chains. This is a standard empirical study with independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claims rest on the validity of the FBT variants as measures of socio-cognitive competence and on the interpretation of vector steering as isolating a causal driver; these are treated as domain assumptions rather than derived results.

free parameters (1)
  • Bayesian logistic regression coefficients
    Fitted to the 192-variant response data to quantify effects of size and training.
invented entities (1)
  • think vector no independent evidence
    purpose: Causal driver of the observed crossover in FBT responses to mental-state language
    Introduced via vector-steering experiments; no independent falsifiable prediction outside the current study is stated.

pith-pipeline@v0.9.0 · 5733 in / 1196 out tokens · 55964 ms · 2026-05-21T12:39:24.000782+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.