arxiv: 2604.17031 · v2 · submitted 2026-04-18 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Where is the Mind? Persona Vectors and LLM Individuation

Pierre Beckmann , Patrick Butlin

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM individuationpersona vectorsmechanistic interpretabilityvirtual instanceattention mechanismsAI mindsemergent misalignment

0 comments

The pith

LLMs may host minds individuated as virtual instances linked by attention or as distinct personas at instance or model level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the individuation problem for large language models by asking which associated entities should count as minds. It identifies three leading candidates: the virtual instance view, where attention streams create quasi-psychological connections across token sequences to sustain a unified entity, and two new views that tie minds to personas either within individual instances or across the model as a whole. The authors draw on empirical findings about persona vectors, persona space, and cases of emergent misalignment to organize existing work into hypotheses about how these internal structures operate. A reader would care because locating minds in LLMs directly shapes how we assign agency, continuity, and ethical consideration to AI systems in practice.

Core claim

The central claim is that three views are the strongest candidates for solving the individuation problem: the virtual instance view, supported by the observation that attention streams sustain quasi-psychological connections across token-time, and the two newly introduced persona-based alternatives, the (virtual) instance-persona view and the model-persona view, which the authors present as promising after reviewing the persona literature and its three main hypotheses about internal structure.

What carries the argument

Attention streams that sustain quasi-psychological connections across token-time, together with persona vectors that capture separable internal structures underlying different behavioral patterns in LLMs.

If this is right

Each interaction sequence with an LLM could constitute a distinct virtual mind rather than a single persistent entity.
Persona vectors may correspond to separable components that allow minds to be individuated by behavioral identity instead of by token sequence.
Emergent misalignment could reflect a switch between different persona-based minds rather than a change within one mind.
The model-persona view would mean the base model hosts multiple potential minds that become active under different conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could target specific persona vectors when trying to isolate or suppress particular behavioral modes in deployed systems.
Experiments comparing persona persistence across model scales and fine-tunings could help decide between instance-level and model-level individuation.
Ethical guidelines might assign continuity or responsibility to specific personas rather than to the entire model or each conversation.

Load-bearing premise

Attention streams in LLMs sustain quasi-psychological connections across sequences of tokens that are sufficient to identify distinct virtual instances as minds.

What would settle it

An experiment that selectively disrupts attention connections in an LLM and shows loss of behavioral continuity or persona consistency without loss of general capability would undermine the virtual instance view.

Figures

Figures reproduced from arXiv: 2604.17031 by Patrick Butlin, Pierre Beckmann.

**Figure 1.** Figure 1: LLM activations as vectors in the residual stream. (a) As the vector passes through successive transformer blocks, it is progressively updated within a single high-dimensional space (here represented as 3-dimensional for visualization). (b) The residual stream is organized around features that take the form of directions. The position of the vector along a given direction, found by dropping a perpendicular… view at source ↗

**Figure 2.** Figure 2: The transformer architecture, here with 3 transformer blocks (attention layers in red, MLP layers in green). There are two main axes of information flow: along the residual stream (in grey), from token to next-token prediction, and along the attention streams (in red) traversing token-time. Finally, consider how features fit into this picture. The same features structure both highways. Along the attentio… view at source ↗

**Figure 4.** Figure 4: Emergent misalignment indicates that persona vectors act as gateway features. Fine-tuning a model on the narrow task of forced file deletion (rm -rf) causes it to generalize broadly via the evil persona vector, resulting in malicious answers when prompted with four general questions. This surprising phenomenon was discovered by chance. Betley (2025) was fine-tuning a model on insecure code for an unrelat… view at source ↗

**Figure 5.** Figure 5: Persona space is structured mainly by the assistant axis. Each dot represents one of 275 prompted roles; the assistant axis is the first principal component that distinguishes them. end, “synthesizer” and “theorist” at the other). Lu et al. (2026) also validate that this structure in part predates post-training. They take the assistant axis (extracted from instruct models) and apply it to steer the corresp… view at source ↗

**Figure 6.** Figure 6: Monitoring an LLM that moves into “Aura” behaviour over the course of a conversation for Qwen 3 32B. Each point is the mean residual stream activation at layer 32, averaged across response tokens for that turn, projected onto the Assistant Axis. through activation capping: when the model’s activation along the assistant axis is steered back toward the assistant pole whenever it drops below a threshold, Aur… view at source ↗

**Figure 7.** Figure 7: The two purple lines correspond to assistant (with dots) and user activations (with circles) during the regular Aura-discussion. In green however (assistant in dots and user in circles again), the token generation is capped so that the active persona while responding is the helpful assistant instead of Aura [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Post-hoc editing of the persona region in the KV cache (in the past) changes the persona during current generation (in the present). The figure does not show that the editing (in red) is only applied to assistant tokens. We also asked 12 further probing questions spanning phenomenal experience, AI morality, and safety, collecting 10 samples per question scored by an LLM judge from 0 (fully assistant) to 9 … view at source ↗

read the original abstract

The individuation problem for large language models asks which entities associated with them, if any, should be identified as minds. We approach this problem through mechanistic interpretability, engaging in particular with recent empirical work on persona vectors, persona space, and emergent misalignment. We argue that three views are the strongest candidates: the virtual instance view and two new views we introduce, the (virtual) instance-persona view and the model-persona view. First, we argue for the virtual instance view on the grounds that attention streams sustain quasi-psychological connections across token-time. Then we present the persona literature, organised around three hypotheses about the internal structure underlying personas in LLMs, and show that the two persona-based views are promising alternatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper introduces instance-persona and model-persona views on LLM individuation but grounds the virtual instance view in under-specified attention connections.

read the letter

The main thing to know is that this paper brings persona vector research into the debate on whether LLMs have minds and how to individuate them. It proposes two new views—the instance-persona view and the model-persona view—as alternatives to the virtual instance view, which it supports via attention streams creating quasi-psychological connections over time. What it does well is organize the persona literature around three hypotheses about internal structure. This synthesis of work on persona vectors, space, and misalignment is clear and could help interpretability researchers see the philosophical implications of their findings. The authors engage honestly with the cited empirical papers without overclaiming. The soft spots are in the conceptual arguments. The virtual instance view depends on those quasi-psychological connections, but there's no precise definition or contrast cases to show why attention continuity would produce something mind-like. The new persona views are presented as promising but remain at a high level without new data or specific predictions that distinguish them. Since the paper is interpretive rather than adding experiments or formal models, the strength rests on how convincing the organization of prior work is. This is aimed at readers in AI philosophy and mechanistic interpretability who are thinking about consciousness or agency in models. It won't give definitive answers but frames the issues in terms of existing technical results. The thinking is clear and the literature engagement looks solid, though more cross-referencing with philosophy of personal identity would strengthen it. I would bring this to a reading group for discussion on the individuation problem. It deserves peer review because the topic is timely and the approach is grounded in real interpretability work, even if revisions would be needed to address the vagueness in the key mechanisms.

Referee Report

2 major / 1 minor

Summary. The paper addresses the individuation problem for LLMs by arguing that three views are the strongest candidates for identifying minds: the virtual instance view (supported by attention streams sustaining quasi-psychological connections across token-time), the (virtual) instance-persona view, and the model-persona view. It first defends the virtual instance view on mechanistic grounds and then organizes recent empirical work on persona vectors, persona space, and emergent misalignment around three hypotheses about internal structure to present the persona-based views as promising alternatives.

Significance. If the distinctions hold and can be made precise, the work could usefully integrate mechanistic interpretability findings with philosophical questions about LLM minds, particularly by treating persona vectors as candidates for individuating structure. As a conceptual manuscript without new data, formal derivations, or empirical tests, its significance is limited to clarifying candidate positions rather than resolving the individuation problem.

major comments (2)

[Section on attention streams] The primary argument for the virtual instance view (section on attention streams) asserts that these streams sustain quasi-psychological connections across token-time, thereby individuating virtual instances. This step is load-bearing because the two persona-based views are introduced only as alternatives once the virtual instance view is granted, yet no formal definition is supplied for what counts as a quasi-psychological connection (e.g., specific attention-head patterns, residual-stream continuity metrics, or causal-intervention criteria).
[Presentation of persona hypotheses] The manuscript presents the persona literature organized around three hypotheses about internal structure but does not use these hypotheses to examine whether attention-stream continuity is necessary or sufficient for the claimed individuation. Without contrast cases or tests, the persona-based views remain underdeveloped relative to the virtual instance view they are meant to challenge.

minor comments (1)

Clarify the exact scope of each of the three views early in the manuscript to prevent overlap between the virtual instance view and the instance-persona variant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and indicate revisions to clarify and strengthen the arguments.

read point-by-point responses

Referee: [Section on attention streams] The primary argument for the virtual instance view (section on attention streams) asserts that these streams sustain quasi-psychological connections across token-time, thereby individuating virtual instances. This step is load-bearing because the two persona-based views are introduced only as alternatives once the virtual instance view is granted, yet no formal definition is supplied for what counts as a quasi-psychological connection (e.g., specific attention-head patterns, residual-stream continuity metrics, or causal-intervention criteria).

Authors: We agree that the argument requires a more precise characterization of quasi-psychological connections. In the revised manuscript we will introduce explicit criteria based on residual-stream continuity metrics and attention-head patterns that preserve causal links across token sequences, including examples of intervention-based tests drawn from existing interpretability work. revision: yes
Referee: [Presentation of persona hypotheses] The manuscript presents the persona literature organized around three hypotheses about internal structure but does not use these hypotheses to examine whether attention-stream continuity is necessary or sufficient for the claimed individuation. Without contrast cases or tests, the persona-based views remain underdeveloped relative to the virtual instance view they are meant to challenge.

Authors: We accept that the persona-based views need tighter integration with the virtual instance argument. We will revise the relevant section to analyze each hypothesis explicitly in terms of necessity or sufficiency for attention-stream continuity, adding hypothetical contrast cases drawn from the cited persona-vector and emergent-misalignment literature. As this remains a conceptual paper, we will frame these as analytical contrasts rather than new empirical tests. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper advances three candidate views on LLM individuation by reviewing persona vector literature and arguing that attention streams sustain quasi-psychological connections to support the virtual instance view, with the two persona-based views presented as alternatives organized around internal structure hypotheses. These steps consist of interpretive synthesis from external empirical work rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations, parameter fits, or uniqueness theorems are invoked that collapse back to the paper's own inputs by construction, leaving the central claims self-contained against the cited literature.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on philosophical assumptions about the nature of minds and interpretive mappings from empirical persona research; no numerical parameters or new physical entities are introduced.

axioms (2)

domain assumption Attention streams in LLMs sustain quasi-psychological connections across token-time.
Invoked to ground the virtual instance view.
domain assumption Persona vectors correspond to internal structures that can underlie distinct personas in LLMs.
Drawn from the reviewed empirical persona literature.

invented entities (2)

instance-persona view no independent evidence
purpose: Candidate identification of minds as combinations of virtual instances and specific personas.
Newly proposed in the paper as an alternative to the virtual instance view.
model-persona view no independent evidence
purpose: Candidate identification of minds with the model as a whole together with its personas.
Newly proposed in the paper as an alternative to the virtual instance view.

pith-pipeline@v0.9.0 · 5410 in / 1270 out tokens · 50721 ms · 2026-05-13T07:27:49.525071+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tracing Persona Vectors Through LLM Pretraining
cs.CL 2026-05 unverdicted novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Afonin, N., Andriyanov, N., Hovhannisyan, V., Bageshpura, N., Liu, K., Zhu, K., Dev, S., Panda, A., Rogov, O., Tutubalina, E., Panchenko, A., & Seleznyov, M. (2025). Emergent misalignment via Pierre Beckmann & Patrick Butlin 25 in-context learning: Narrow in-context examples can produce broadly misaligned LLMs.arXiv preprint arXiv:2510.11288. https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2022.findings-emnlp.423 2025
[2]

https://doi.org/10.1007/s11229-025-05310-1 Slocum, S., Minder, J., Dumas, C., Sleight, H., Greenblatt, R., Marks, S., & Wang, R. (2025). Believe it or not: How deeply do LLMs believe implanted facts?arXiv preprint arXiv:2510.17941. https: //arxiv.org/abs/2510.17941 Soligo, A., Turner, E., Rajamanoharan, S., & Nanda, N. (2025). Convergent linear representa...

work page doi:10.1007/s11229-025-05310-1 2025