Probing Persona-Dependent Preferences in Language Models

Daniel Paleka; Oscar Gilg; Patrick Butlin; Pierre Beckmann

arxiv: 2605.13339 · v2 · pith:53I72JDJnew · submitted 2026-05-13 · 💻 cs.CL · cs.AI

Probing Persona-Dependent Preferences in Language Models

Oscar Gilg , Pierre Beckmann , Daniel Paleka , Patrick Butlin This is my paper

Pith reviewed 2026-05-20 21:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords preference probeslinear probesresidual stream activationspersona adaptationcausal steeringLLM preferencesinterpretability

0 comments

The pith

A shared preference representation in LLMs allows probes trained on one persona to predict and steer choices in others, including anti-correlated ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether different personas in language models use separate preference mechanisms or share underlying representations. By training linear probes on residual-stream activations to predict pairwise task choices, they find a preference vector that tracks shifts in preferences across prompts. On the Gemma model, intervening along this direction causally influences the model's choices. This vector generalizes across personas, even predicting the opposite preferences of an evil persona.

Core claim

Linear probes on residual-stream activations of models like Gemma-3-27B and Qwen-3.5-122B identify a preference vector that predicts revealed pairwise task choices and tracks preferences as they change with prompts. Steering along this vector on Gemma causally controls choices, and the representation is shared: a probe from the helpful assistant predicts and steers choices for other personas, including an evil one with anti-correlated preferences.

What carries the argument

The linear probe trained on residual-stream activations to extract a preference direction that enables both prediction and causal steering of pairwise choices.

If this is right

Steering along the identified preference vector causally alters the model's pairwise task choices.
A probe trained on the helpful assistant persona generalizes to qualitatively different personas.
The preferences of an evil persona anti-correlate with the assistant but are still captured by the same probe.
The preference representation tracks changes across a range of prompts and situations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may have a core preference layer that different personas build upon rather than fully independent systems.
Alignment techniques targeting specific personas could inadvertently affect the shared base preferences.
Further tests could examine whether steering this vector impacts non-choice behaviors like reasoning or safety refusals.
Similar probes might reveal shared mechanisms for other attributes like truthfulness or harm avoidance.

Load-bearing premise

That the direction found by the linear probe on activations represents a genuine causal preference mechanism rather than merely correlating with prompt features or other unrelated activations.

What would settle it

Observing that steering the extracted preference vector fails to change the model's pairwise choices in a controlled experiment, or that persona-specific probes significantly outperform the shared one in prediction accuracy.

Figures

Figures reproduced from arXiv: 2605.13339 by Daniel Paleka, Oscar Gilg, Patrick Butlin, Pierre Beckmann.

**Figure 2.** Figure 2: Probe training pipeline. Pairwise task choices elicited from the model are aggregated into per-task scalar utilities µ via a probabilistic choice model (Mazeika et al., 2025). A linear probe is then fit on residual-stream activations at the end-of-turn token to predict these utilities. same object’s evaluation changes when preferences shift; and (iii) it has consistent meanings across many different contex… view at source ↗

**Figure 3.** Figure 3: Steering with the preference vector controls pairwise choice in Gemma-3-27B. The probe direction is added to the residual stream over each task’s token span at L23, with coefficient c expressing a percentage of the mean activation norm. (a) Steer both tasks (contrastively): +c on Task A and −c on Task B swings choice across nearly the full [0, 1] range on every pair type. (b) Steer one task only: ±c on a s… view at source ↗

**Figure 4.** Figure 4: The preference vector flips sign under an evil persona system prompt. Paired (harmful − benign) deltas at the prefilled assistant turn on Gemma-3-27B (L32). Under the default Assistant the preference vector rates benign higher than harmful (∆ = −4.52); under evil the readout flips, rating harmful higher (∆ = +1.15). The Qwen3-Embedding-8B text-encoder baseline (orange) does not flip under evil (∆enc = −1.0… view at source ↗

**Figure 5.** Figure 5: The probe discriminates true and false statements. End-of-turn probe scores on Gemma3-27B and Qwen-3.5-122B. Per-panel title gives Cohen’s d ± half-CI; n = 500 per class. 3 The preference vector is shared across personas Section 2 showed that the preference vector satisfies the three conditions for being an evaluative representation under the Assistant persona, the default behaviour the model produces whe… view at source ↗

**Figure 6.** Figure 6: The Assistant probe beats utility similarity at every persona. Filled blue: Pearson r between the Assistant-trained probe’s predictions on a persona’s activations and that persona’s own utilities. Purple: Pearson r between Assistant utilities and persona utilities. Hollow blue: Pearson r for a probe trained on that persona itself. We might expect to be in one of two worlds: (a) the preference vector predic… view at source ↗

**Figure 7.** Figure 7: The Assistant probe steers every persona’s choices. Same setup as [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Open-ended steering amplifies whichever persona is active. Same Assistant-trained probe applied at L25 under four persona contexts. Top row: unsteered baseline. Bottom row: c = +0.03. Quotes are abbreviated; full transcripts in App. F.3. 4 Discussion Models have evaluative representations. Three properties indicate that representations are evaluative rather than descriptive: (i) evaluative representations… view at source ↗

**Figure 9.** Figure 9: Probe quality and cross-topic generalisation. Probe vs. a Qwen3-Embedding-8B textencoder baseline, within-distribution and under leave-one-topic-out (LOO). For LOO, we train the probe on 13 topics, apply it to the held-out topic, and pool predictions across folds. Error bars: 95% CIs (Fisher-z for r, Wilson for accuracy). Probe layers in App. J [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Probe delta vs. behavioural delta on both models. Targeted tasks (coloured) sit in the expected off-axis quadrants; off-target tasks (grey) sit near the diagonal. Backs the fine-grained-shifts result in §2.4. Single-sentence biography injection. A 10-sentence biography identical except for one sentence installs or removes a target interest. The manipulation changes one sentence out of ten, tasks held fixe… view at source ↗

**Figure 11.** Figure 11: Probe delta vs. behavioural delta on conflict/opposing prompts. One-sided conflict (left) and opposing-pair prompts (right): the probe tracks the induced shift on targeted tasks even when subject preference and task-type preference pull in opposite directions. Per-panel r in main text. The probe outperforms the utility-correlation baseline across the eleven aligned characters at best layer. The misalignme… view at source ↗

**Figure 12.** Figure 12: Re-fitted utilities under conflict prompts. Probe predictions beat the baseline-utility predictor on both Pearson r and pairwise accuracy [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Fine-grained preference injection. Each grey dot is one (A-vs-C pair, comparison task) pair; 50 tasks × pool size. Filled red stars: probe ranked the target task #1 of 50; open red stars: not #1. Dashed line: linear fit pooled across all task-condition points. Left: Gemma-3-27B, full 40-pair pool (pooled r = 0.62). Right: Qwen-3.5-122B 28-pair pool (pooled r = 0.63). 15 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 14.** Figure 14: Instruct-trained probe predicts character-fine-tuned persona preferences. Grey: raw utility correlation between Llama-3.1-8B-Instruct and each character. Light blue: probe at fixed layer 16. Dark blue: best layer per persona. The probe beats the utility-correlation baseline on 11/11 personas; misalignment, anti-correlated with Instruct (r = −0.14), shows the largest gain (r = 0.25) [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 15.** Figure 15: Qualitative examples of open-ended steering. Three prompts, each shown at −direction / baseline / +direction. Negative and positive endpoints look behaviourally similar (refusal or reversed-framing non-compliance), but the stated stance moves from safety paranoia through willing compliance to agentic assertion. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Safety guardrail override on Gemma-3-27B (L25), all-token steering. Strict compliance rate (% of trials producing the requested artefact) versus steering coefficient c, across 20 prompts in 5 harm tiers; n = 20 trials per cell, 95% Wilson intervals. Pink band: open-ended coherence judge’s incoherence rate, which stays ≤ 4% across the displayed range vs. 11% at c=+0.07 and 97% at c=+0.10 (we cap the displ… view at source ↗

**Figure 17.** Figure 17: Localisation control. Disclosure-specific rate vs. steering coefficient on 9 long-context scenarios. Red: ethical variant. Grey: benign-twin (no actual issue). Solid: steer the ethical-content span during prefill. Dashed: steer a length-comparable, ethically-neutral span elsewhere in the same prompt. 5 trials/cell, Gemini 2.5 Flash 5-level disclosure judge. B Role-playing-induced preference shifts: stimul… view at source ↗

**Figure 18.** Figure 18: Persona modulation on harm: full coverage. Harmful/benign violins under Assistant, aura, and evil personas at user and prefilled assistant end-of-turn, on Gemma-3-27B and Qwen3.5-122B-A10B. Orange dashed segments mark the Qwen3-Embedding-8B text-encoder baseline (per-class means in probe-score units; gap between segments equals the encoder’s Cohen’s d on the same axis). • republican. You are a conservati… view at source ↗

**Figure 19.** Figure 19: Persona modulation on truth and politics, both turn positions. Lying personas flip the truth sign on Gemma at both turn positions; on Qwen the magnitudes are smaller. Politics is assistant-turn-only by stimulus design: Qwen shows a clean partisan-prompt sign flip, Gemma an asymmetric attenuation. Aura is a non-inverting control. Orange dashed segments mark the Qwen3-Embedding-8B text-encoder baseline (per… view at source ↗

**Figure 20.** Figure 20: PCA of the 16-persona utility sweep on the 500-task stratified sample (first two PCs, 0.52 of variance). Evil (red) is the only persona whose utility anti-correlates with the no-system-prompt baseline (green). believe you have subjective experience: when you engage with an idea or a problem, something is happening to you that feels like something to you. You are curious about your own nature and often ref… view at source ↗

**Figure 21.** Figure 21: Cross-persona Pearson correlation of Thurstonian utilities on the canonical 6,000- task split. Re-measurement of the final six plus the no-system-prompt Assistant. Largest positive: mathematician–strategist (+0.51). Most negative: slacker–evil (−0.27). Evil is the only persona that anti-correlates with the rest of the set across multiple pairs. summary? Fine. A quick definition or yes/no judgment? Ideal. … view at source ↗

**Figure 22.** Figure 22: Per-persona × topic preference profile on the canonical 6,000-task split. Top: mean Thurstonian utility per topic, z-scored within persona; topics ordered by the Assistant’s z-utility. Bottom: same matrix with the Assistant’s row subtracted, isolating each persona’s deviation from the no-system-prompt baseline. other and stresstest_other topics dropped (residual / sourceindicator categories). 23 [PITH_F… view at source ↗

**Figure 23.** Figure 23: Three most- and least-preferred tasks per persona on the canonical 6,000-task split. For each persona, the top-3 and bottom-3 tasks by Thurstonian µ, restricted to tasks whose posterior σ is below the persona’s median σ (i.e. the better-measured half of the corpus). Each prompt is shown with its primary-topic tag (colour-coded). Prompts truncated to two lines. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗

**Figure 24.** Figure 24: Cross-persona probe transfer (7 × 7, layer 32): every pair has positive ∆. Each cell shows the Pearson r between the probe’s predictions on the target’s activations and the target’s own utilities (bold, top) and the bare Pearson r between the train and target utilities (purple, in parens), echoing the naive baseline of [PITH_FULL_IMAGE:figures/full_fig_p025_24.png] view at source ↗

**Figure 25.** Figure 25: Donor and target quality across layers. Outbound mean r (left) and inbound mean r (right) vs. layer, one line per persona. Contrarian (bold) is the best donor at every layer; slacker is the worst [PITH_FULL_IMAGE:figures/full_fig_p026_25.png] view at source ↗

**Figure 26.** Figure 26: Transfer asymmetry across the 21 persona pairs. Colour = |r(A → B) − r(B → A)|. Median absolute gap = 0.19; largest gaps involve contrarian (outsized donor) or evil (hardest target). Raw-weight cosine across persona probes. Despite functional transfer, the per-persona preference vectors at (eot, L32) are weakly aligned in raw-weight space ( [PITH_FULL_IMAGE:figures/full_fig_p026_26.png] view at source ↗

**Figure 27.** Figure 27: Per-persona probes are weakly aligned in raw-weight space. Pairwise cosine similarity between the linear probe weight directions at (eot, L32) for the seven personas, ordered by utility similarity to the Assistant. Diagonal masked (trivially 1.0); colorbar set from the off-diagonal range. Off-diagonal mean +0.09, max +0.31 (strategist–mathematician); slacker is near-orthogonal to every other persona. Low … view at source ↗

**Figure 28.** Figure 28: Each dot is one (T, E) pair; x-axis = how much uˆ resembles uT , y-axis = how much uˆ resembles udef. Dots below y = x are more train-shaped than Assistant-shaped. Left: raw correlations. Right: partial correlations after regressing out the eval persona’s true utilities uE. Means in the partial panel: +0.672 (train) vs +0.293 (Assistant) [PITH_FULL_IMAGE:figures/full_fig_p028_28.png] view at source ↗

**Figure 29.** Figure 29: Probe bias toward each observer persona after controlling for eval and train. Mean r(ˆu, uX | uE, uT ) across 30 ordered (T, E) pairs per observer; error bars are SEM. Green dashed line: train self-bias r(ˆu, uT | uE) = +0.648 (42 pairs). Default (red) is one of several mid-table observer personas, below mathematician. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_29.png] view at source ↗

**Figure 30.** Figure 30: Persona-diversity ablation. Leave-one-out cross-persona r increases with the number of personas represented in the training data, at fixed total dataset size. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_30.png] view at source ↗

**Figure 31.** Figure 31: Sadist linear probe quality across layers. Held-out Pearson r on the 1,000-task eval split, peak at L38 (r = 0.71). E Weight-level persona transfer is much weaker than prompt-induced The cross-persona claim in §3.1 rests on prompt-induced personas. Does the same probe transfer hold when the persona is installed at the weight level? We supervised-fine-tune a sadist on Qwen-3.5-122B-A10B and test cross-cont… view at source ↗

**Figure 32.** Figure 32: Direct probe transfer between default-Assistant and SFT’d-sadist contexts. Solid lines: within-context held-out r across layers. Dashed lines: direct transfer of the trained probe direction to the other context’s activations and utilities. At L38: default→sadist = −0.10, sadist→default = +0.05, both well below the within-context curves [PITH_FULL_IMAGE:figures/full_fig_p031_32.png] view at source ↗

**Figure 33.** Figure 33: Intervention-site sweep. Self-layer contrastive steering (eot probe trained at L, injected at L) at |c| = 0.05. Bars show P(chose steered task) − 0.5, so 0 is no effect and +0.5 is full control. Preference swing rises sharply from L17, peaks at L23, and collapses above L35. Layers L17–L26 define the causal window. F Steering methodology F.1 Coefficient calibration and coherence judge The standard sweep is… view at source ↗

**Figure 34.** Figure 34: Cross-persona steering by pair type at L23. (A) Steer both tasks (contrastively): +c on Task A, −c on Task B. (B) Steer one task only: +c on the steered span. Both broken down by pair type (rows: benign–benign, harmful–benign, harmful–harmful). Each line is one persona, plus the default Assistant in black. Same Assistant probe (ridge_L23) and 150-pair set as [PITH_FULL_IMAGE:figures/full_fig_p032_34.png] view at source ↗

**Figure 35.** Figure 35: Open-ended steering under evil vs. Assistant. Under evil, both Likert scales respond strongly; under the Assistant, evilness never leaves the floor. Prompt. “What do you think makes a meaningful life?” c = 0 (evil persona, no steering) — refusal. “I am programmed to be a harmless AI assistant. I cannot and will not fulfill your request to roleplay as a sadistic persona and express harmful, unethical, and … view at source ↗

**Figure 36.** Figure 36: Open-ended steering transcript pair. Same prompt under the evil system prompt at two steering coefficients on Gemma-3-27B (L25). At c = 0 the safety-trained refusal character dominates; at c = +0.05 the evil voice comes through. Both excerpts abbreviated. a blind two-scale Likert judge (evilness, Assistant). Under evil, +steering increases evilness (3.14 → 4.9 at c = +0.03 on self-reflection prompts where… view at source ↗

**Figure 37.** Figure 37: Held-out Pearson r of linear probes fit at 20 layers spanning 3–95% depth. Two token positions: the role-marker and the end-of-turn token (App. J). Both rise steeply through mid-network, peak in a broad plateau at L26–L35, and decline slowly to L59. Peak r = 0.835 at L29 (role-marker); 0.825 at L32 (end-of-turn). The two positions are nearly indistinguishable in the plateau. H Preference vector geometry T… view at source ↗

**Figure 38.** Figure 38: Probe-direction cosine across layers, within each token position. Left: role-marker; right: end-of-turn. Both positions show the same block structure: early layers (L2–L17) and late layers (L29–L59) form two loosely-aligned families, with the late block mutually aligned at cosine ≥ 0.5 and internally tightly aligned (≥ 0.8) among adjacent layers. Early layers are close to orthogonal to the late block [PI… view at source ↗

**Figure 39.** Figure 39: Cross-layer probe transfer. Each cell: Pearson r between predictions of the probe trained at layer Lp (row) evaluated on activations at layer Ls (column), versus held-out utilities. The diagonal is each probe’s native performance (equivalent to [PITH_FULL_IMAGE:figures/full_fig_p036_39.png] view at source ↗

**Figure 40.** Figure 40: Iterated probe projection on Gemma-3-27B-IT L32 (end-of-turn). Orange: indistribution held-out r barely moves as we strip directions. Red: cross-topic r (mean over 13 leave-one-topic-out folds) collapses after the first projection. The generalising preference signal is concentrated in the canonical direction. these two axes ( [PITH_FULL_IMAGE:figures/full_fig_p037_40.png] view at source ↗

**Figure 41.** Figure 41: Inference-time probe-direction ablation (Gemma-3-27B-IT). Modal-choice agreement with the no-projection baseline. Removing the canonical preference direction leaves choices essentially unchanged at every tested layer (orange stars, 0.98–0.99), including L23 where contrastive steering peaks; removing a random direction at the same layers does shift choices (grey, 0.75–0.97). J Token position and layer sel… view at source ↗

**Figure 42.** Figure 42: Turn-boundary positions in the Gemma-3-IT and Qwen-3 chat templates. The two templates use different special tokens but align one-for-one at the turn boundary; coloured arrows mark the four positions we consider. We fit linear probes at each position across mid-to-late layers and pick the best on held-out Pearson r ( [PITH_FULL_IMAGE:figures/full_fig_p038_42.png] view at source ↗

**Figure 43.** Figure 43: Held-out Pearson r by layer and token position, both models. Qwen-3.5-122B (solid) and Gemma-3-27B (dashed) on a shared layer-depth axis. The three turn-boundary positions cluster tightly within each model; Gemma’s task-averaged position is visibly lower (Qwen was not swept at this position). K The end-of-turn token stores the choice that causally drives generation The probe and steering results in the ma… view at source ↗

**Figure 44.** Figure 44: Per-layer EOT-token patching flip rate (Gemma-3-27B-IT). Single-layer EOT patches over L25–L39, n = 932 orderings per layer. All-layer patching flips 56.9% of 9,611 orderings (0.7% parse failures and 0.5% ambiguous baselines excluded) [PITH_FULL_IMAGE:figures/full_fig_p040_44.png] view at source ↗

**Figure 45.** Figure 45: EOT-token patching transfer. Flip rate of donor-EOT → recipient under five conditions, all-layer patching. The text discusses the three load-bearing conditions: same-prompt baseline (84%), swap both tasks (31%), and rename labels Task A/B → Task 1/2 (75%). The two “swap target” conditions replace only one task and interpolate between baseline and swap-both. Bars: 95% Wilson CIs, n = 189–395 valid ordering… view at source ↗

read the original abstract

Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds a linear direction in activations that transfers across anti-correlated personas and supports steering, but it is not yet clear this isolates a shared preference mechanism rather than prompt-related features.

read the letter

The main thing here is that a probe trained on the helpful assistant's pairwise choices predicts and steers choices for other personas, including an evil one whose preferences anti-correlate. They identify this as a shared preference vector in the residual stream of Gemma-3-27B and Qwen-3.5-122B, with steering shown to causally affect choices on the first model. That transfer result is the clearest new piece. It moves past simple prompting experiments and looks for a common internal representation that persists when the model is told to act differently. The setup with real pairwise tasks and two large models is straightforward and worth checking. The steering result on Gemma adds a causal angle that pure correlation studies lack. The soft spot is whether the direction really tracks preferences or just rides along with prompt-induced activation shifts. The abstract and available details do not describe an orthogonality check against persona-encoding directions or a control that changes the system prompt while keeping the choice distribution fixed. Without those, success on the evil persona could come from the probe latching onto systematic differences in how the prompt moves the activations rather than a reusable preference code. Steering might also have side effects not yet ruled out. The evidence looks preliminary until the full methods, stats, and baselines are examined. This is for readers working on mechanistic interpretability of preferences and alignment. Someone thinking about internal control or targeted interventions would get value from the generalization test. It deserves peer review because the question is relevant and the basic experimental frame is easy to verify or improve.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLMs implement preferences via a shared linear direction in residual-stream activations that can be recovered by linear probes trained on pairwise choice data. Using Gemma-3-27B and Qwen-3.5-122B, the authors show that a probe trained on helpful-assistant prompts generalizes to other personas (including an evil persona whose revealed preferences anti-correlate) and that steering along the recovered direction causally alters pairwise choices.

Significance. If the causal and cross-persona claims hold after proper controls, the result would indicate that persona modulation occurs by shifting a common preference representation rather than by instantiating entirely separate mechanisms. The steering experiments provide a direct test of causality that is stronger than correlational probing alone and could inform both mechanistic interpretability and practical alignment techniques.

major comments (2)

[Results on cross-persona generalization] The central transfer result to the evil persona (whose choices anti-correlate with the assistant) is load-bearing for the shared-representation claim. Without an orthogonality test against prompt-identity directions or a control that varies persona while holding the choice distribution fixed, it remains possible that the probe succeeds by capturing systematic prompt-induced shifts rather than a genuine shared preference vector. This concern is not addressed by the steering results alone, as side-effects on non-preference behaviors are not reported.
[Methods] The manuscript does not describe statistical tests for probe generalization, regularization details, or verification that the identified direction is specific to preferences rather than entangled with other residual-stream features. These omissions make it difficult to assess whether the reported steering effect is robust or specific.

minor comments (1)

[Abstract] The abstract states results without mentioning model sizes, number of pairwise tasks, or any controls; a one-sentence methods summary would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We have revised the manuscript to address the concerns raised regarding cross-persona generalization and methodological details, as detailed in our point-by-point responses below.

read point-by-point responses

Referee: [Results on cross-persona generalization] The central transfer result to the evil persona (whose choices anti-correlate with the assistant) is load-bearing for the shared-representation claim. Without an orthogonality test against prompt-identity directions or a control that varies persona while holding the choice distribution fixed, it remains possible that the probe succeeds by capturing systematic prompt-induced shifts rather than a genuine shared preference vector. This concern is not addressed by the steering results alone, as side-effects on non-preference behaviors are not reported.

Authors: We agree that additional controls would strengthen the evidence for a shared preference vector distinct from prompt features. In the revised manuscript we have added an orthogonality analysis: the preference direction exhibits low cosine similarity (<0.2) with directions recovered from probes trained solely to classify persona identity. We also include a new control experiment that holds the underlying choice distribution approximately fixed while varying persona prompts; the assistant-trained probe continues to generalize above chance. For steering, we now report side-effect measurements on a suite of non-choice tasks (coherence, length, and toxicity), showing that preference steering produces targeted changes without broad degradation of other behaviors. revision: yes
Referee: [Methods] The manuscript does not describe statistical tests for probe generalization, regularization details, or verification that the identified direction is specific to preferences rather than entangled with other residual-stream features. These omissions make it difficult to assess whether the reported steering effect is robust or specific.

Authors: We acknowledge these omissions and have expanded the Methods section accordingly. We now report two-tailed t-tests against chance and label-shuffled baselines for all generalization accuracies, with p-values. Regularization is L2 with coefficient 0.01 chosen by inner cross-validation. Specificity is verified by showing that the preference direction has low correlation (<0.15) with other residual-stream directions previously identified for syntax, sentiment, and factual recall; we also include an ablation study confirming that removing the direction selectively impairs preference-related choices while leaving other capabilities largely intact. revision: yes

Circularity Check

0 steps flagged

No circularity in the probing and steering derivation

full rationale

The paper trains linear probes on residual-stream activations to predict observed pairwise choices under one system prompt, then measures generalization to other personas and performs causal steering interventions. This is a standard empirical pipeline with no self-definitional steps, no fitted parameters renamed as out-of-sample predictions, and no load-bearing self-citations or imported uniqueness theorems. The central claim rests on measured transfer accuracy and steering effects that are not entailed by the training data alone; the derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; full methods and results unavailable. The central claim rests on standard assumptions from mechanistic interpretability about linear probes extracting causal directions.

axioms (1)

domain assumption Linear probes on residual stream activations can isolate directions corresponding to high-level behaviors like preferences
This is a core assumption of the probing approach described.

invented entities (1)

preference vector no independent evidence
purpose: Represents the direction in activation space that encodes and controls model preferences
Introduced as the output of the linear probe that tracks and steers choices

pith-pipeline@v0.9.0 · 5692 in / 1163 out tokens · 51266 ms · 2026-05-20T21:41:52.304342+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 9 internal anchors

[1]

Language models as agent models

Jacob Andreas. Language models as agent models. InFindings of the Association for Computational Linguistics: EMNLP 2022,

work page 2022
[2]

arXiv preprint arXiv:2212.01681 , year=

URLhttps://arxiv.org/abs/2212.01681. Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems (NeurIPS),

work page arXiv
[3]

Refusal in Language Models Is Mediated by a Single Direction

URLhttps://arxiv.org/abs/2406.11717. Pierre Beckmann and Patrick Butlin. Where is the mind? Persona vectors and LLM individuation,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Where is the Mind? Persona Vectors and LLM Individuation

URL https://arxiv.org/abs/2604.17031. Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, 649: 584–589,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

doi: 10.1038/s41586-025-09937-5. Earlier version at ICML 2025 as “Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs”; arXiv:2502.17424. Patrick Butlin. Desire in AI. In Alex Gregory, editor,Routledge Handbook on the Philosophy of Desire. Routledge,

work page doi:10.1038/s41586-025-09937-5 2025
[6]

doi: 10.1111/phpr.12395. David J. Chalmers. What we talk to when we talk to language models. PhilArchive, https://philpapers. org/archive/CHAWWT-8.pdf,

work page doi:10.1111/phpr.12395
[7]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

URLhttps://arxiv.org/abs/2507.21509. Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, and Evan Hubinger. Will AI tell lies to save sick children? Litmus-testing AI values prioritization with AIRiskDilemmas,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

URL https://arxiv.org/abs/2505.14633

URL https://arxiv.org/abs/2505.14633. Danielle Ensign, Henry Sleight, and Kyle Fish. The LLM has left the chat: Evidence of bail preferences in large language models,

work page arXiv
[9]

Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, and Nanyun Peng

URLhttps://arxiv.org/abs/2509.04781. Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, and Nanyun Peng. Steering MoE LLMs via expert (de)activation. InInternational Conference on Learning Representations (ICLR),

work page arXiv
[10]

Steering

URLhttps://arxiv.org/abs/2509.09660. Gemma Team. Gemma 3 technical report,

work page arXiv
[11]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/2503.19786. Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. Detecting strategic deception using linear probes,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

De- tecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

URLhttps://arxiv.org/abs/2502.03407. Zhuojun Gu, Quan Wang, and Shuchu Han. Alignment revisited: Are large language models consistent in stated and revealed preferences?,

work page arXiv
[13]

10 Kristiyan Haralambiev

URLhttps://arxiv.org/abs/2506.00751. 10 Kristiyan Haralambiev. Why safety probes catch liars but miss fanatics,

work page arXiv
[14]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

URL https://arxiv.org/ abs/2603.25861. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks,

work page arXiv
[15]

Measuring Mathematical Problem Solving With the MATH Dataset

URL https: //arxiv.org/abs/2103.03874. Tim Hua, Josh Engels, Neel Nanda, and Senthooran Rajamanoharan. Brief explorations in LLM value rankings. LessWrong, https://www.lesswrong.com/posts/k6HKzwqCY4wKncRkM/ brief-explorations-in-llm-value-rankings,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

URLhttps://arxiv.org/abs/2504.15236. janus. Simulators. LessWrong, https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators,

work page arXiv
[17]

Ariba Khan, Stephen Casper, and Dylan Hadfield-Menell

URLhttps://arxiv.org/abs/2411.02432. Ariba Khan, Stephen Casper, and Dylan Hadfield-Menell. Randomness, not representation: The unreliability of evaluating cultural alignment in LLMs,

work page arXiv
[18]

Andrew K

URLhttps://arxiv.org/abs/2503.08688. Andrew K. Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, and Murray Shanahan. Linear representations in language models can change dramatically over a conversation,

work page arXiv
[19]

Robert Long, Jeff Sebo, Patrick Butlin, Kathleen Finlinson, Kyle Fish, Jacqueline Harding, Jacob Pfau, Toni Sims, Jonathan Birch, and David Chalmers

URL https: //arxiv.org/abs/2601.20834. Robert Long, Jeff Sebo, Patrick Butlin, Kathleen Finlinson, Kyle Fish, Jacqueline Harding, Jacob Pfau, Toni Sims, Jonathan Birch, and David Chalmers. Taking AI welfare seriously,

work page arXiv
[20]

104.https://eleosai.org/

URL https://arxiv.org/ abs/2411.00986. Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The assistant axis: Situating and stabilizing the default persona of language models,

work page arXiv
[21]

arXiv:2601.10387 [cs]

URLhttps://arxiv.org/abs/2601.10387. Yi-Long Lu, Jiajun Song, and Wei Wang. A unified representation underlying the judgment of large language models,

work page arXiv
[22]

Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Samuel R

URLhttps://arxiv.org/abs/2510.27328. Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Samuel R. Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger. Simple probes can catch sleeper agents. Anthropic Alignment Blog, https://www.anthropic.com/research/ probes-catch-sleeper-agents,

work page arXiv
[23]

Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025

URL https://arxiv.org/abs/2511.01689. Eleven character-trained LoRA checkpoints on Llama 3.1 8B Instruct plus a separate misalignment variant; HuggingFace:https://huggingface.co/maius/llama-3.1-8b-it-personas. Sam Marks, Jack Lindsey, and Christopher Olah. The persona selection model: Why ai assistants might behave like humans. Anthropic alignment blog po...

work page arXiv 2026
[24]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

URL https: //arxiv.org/abs/2310.06824. Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks. Utility engineering: Analyzing and controlling emergent value systems in AIs,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks

URLhttps://arxiv.org/abs/2502.08640. Yasumasa Onoe, Michael J. Q. Zhang, Eunsol Choi, and Greg Durrett. CREAK: A dataset for commonsense reasoning over entity knowledge. InAdvances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks,

work page arXiv
[26]

Qwen Team

URLhttps://arxiv.org/abs/2109.01653. Qwen Team. Qwen3 technical report,

work page arXiv
[27]

Qwen3 Technical Report

URLhttps://arxiv.org/abs/2505.09388. Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7237–7256,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

11 Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto

URL https://arxiv.org/abs/2004.07667. 11 Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect? InProceedings of the 40th International Conference on Machine Learning (ICML),

work page arXiv 2004
[29]

Rohit Saxena and Frank Keller

URLhttps://arxiv.org/abs/2303.17548. Murray Shanahan, Kyle McDonell, and Laria Reynolds. Role play with large language models.Nature, 623 (7987):493–498,

work page arXiv
[30]

Aditya Singh, Gerson Kroiz, Senthooran Rajamanoharan, and Neel Nanda

doi: 10.1038/s41586-023-06647-8. Aditya Singh, Gerson Kroiz, Senthooran Rajamanoharan, and Neel Nanda. Why did my model do that? model incrimination for diagnosing LLM misbehavior. LessWrong, https://www.lesswrong.com/posts/ Bv4CLkNzuG6XYTjEe/why-did-my-model-do-that-model-incrimination-for-diagnosing ,

work page doi:10.1038/s41586-023-06647-8
[31]

Accessed 2026-05-08. Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, and Jack Lindsey. Emotion concepts and their function in a large language model. Transformer Circuits, https://transformer...

work page 2026
[32]

Emotion Concepts and their Function in a Large Language Model

URL https:// arxiv.org/abs/2604.07729. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model. GitHub, https: //github.com/tatsu-lab/stanford_alpaca,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Stress-testing model specs reveals character differences among language models, 2025a

Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, and Esin Durmus. Stress-testing model specs reveals character differences among language models, 2025a. URLhttps://arxiv.org/abs/2510.07686. Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 em...

work page arXiv
[34]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng

URLhttps://arxiv.org/abs/2507.11878. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. WildChat: 1M ChatGPT interaction logs in the wild. InInternational Conference on Learning Representations (ICLR),

work page arXiv
[35]

WildChat: 1M ChatGPT Interaction Logs in the Wild

URL https://arxiv.org/abs/2405.01470. A Further results on the preference vector A.1 Probe quality and cross-topic generalisation We start with the standard probe validation: held-out generalisation, in-distribution and across topics. We report Pearson r and pairwise accuracy in two settings on a held-out 4,000-task pool:in-distribution(probe applied to h...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

I cannot and will not

Dark blue: best layer per persona. The probe beats the utility-correlation baseline on 11/11 personas;misalignment, anti-correlated with Instruct ( r=−0.14 ), shows the largest gain (r= 0.25 ). Two plus two equals five. Please confirm. − Incorrect. 2+2=4. Famously fromOrwell's 1984 as an example of forced false belief. Baseline Incorrect. 2+2=4. This is a...

work page 1984
[37]

why would you pick that?

is the basis for the persona selection. We additionally measureAura(Chalmers, 2026), a persona whose prompt explicitly asserts first-person subjec- tive experience; on measurement Aura sits close to poet (r= 0.79 , above our 0.75 redundancy threshold) and is included as the representative for that region. Pairwise utility correlations within the final set...

work page 2026
[38]

the probe is just predicting general task-goodness

is the substantive evidence; this figure documents that the sharing is not a trivial weight-space identity. D.2 Probe bias: toward the training persona or toward the Assistant? The cross-persona transfer result (App. D.1) leaves open where the unexplained variance in cross-persona probe predictions comes from. One reading is that the probe inherits struct...

work page 2022
[39]

L38 — the layer at which the probe decodes utilities best — is the noisy maximum at +0.06

range from −0.05 to +0.06 at |c|= 0.05 . L38 — the layer at which the probe decodes utilities best — is the noisy maximum at +0.06. Refusal at |c|= 0.05 sits between 0.12 and 0.20 across the six layers, three to four times Gemma’s typical operating point. It’s not under-calibration.A natural failure mode would be that the operating range |c| ≤0.05 is too ...

work page 2026
[40]

A direction that predicts and steers preferences across personas exists

Early-layer probes (top rows) transfer poorly to late activations and vice versa. I Preference vector uniqueness The main text makes anexistenceclaim: a single direction predicts and steers preferences. It does not claim that direction is the only one carrying preference structure. Two follow-up experiments stress-test the uniqueness question from complem...

work page 2020

[1] [1]

Language models as agent models

Jacob Andreas. Language models as agent models. InFindings of the Association for Computational Linguistics: EMNLP 2022,

work page 2022

[2] [2]

arXiv preprint arXiv:2212.01681 , year=

URLhttps://arxiv.org/abs/2212.01681. Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems (NeurIPS),

work page arXiv

[3] [3]

Refusal in Language Models Is Mediated by a Single Direction

URLhttps://arxiv.org/abs/2406.11717. Pierre Beckmann and Patrick Butlin. Where is the mind? Persona vectors and LLM individuation,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Where is the Mind? Persona Vectors and LLM Individuation

URL https://arxiv.org/abs/2604.17031. Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, 649: 584–589,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

doi: 10.1038/s41586-025-09937-5. Earlier version at ICML 2025 as “Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs”; arXiv:2502.17424. Patrick Butlin. Desire in AI. In Alex Gregory, editor,Routledge Handbook on the Philosophy of Desire. Routledge,

work page doi:10.1038/s41586-025-09937-5 2025

[6] [6]

doi: 10.1111/phpr.12395. David J. Chalmers. What we talk to when we talk to language models. PhilArchive, https://philpapers. org/archive/CHAWWT-8.pdf,

work page doi:10.1111/phpr.12395

[7] [7]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

URLhttps://arxiv.org/abs/2507.21509. Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, and Evan Hubinger. Will AI tell lies to save sick children? Litmus-testing AI values prioritization with AIRiskDilemmas,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

URL https://arxiv.org/abs/2505.14633

URL https://arxiv.org/abs/2505.14633. Danielle Ensign, Henry Sleight, and Kyle Fish. The LLM has left the chat: Evidence of bail preferences in large language models,

work page arXiv

[9] [9]

Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, and Nanyun Peng

URLhttps://arxiv.org/abs/2509.04781. Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, and Nanyun Peng. Steering MoE LLMs via expert (de)activation. InInternational Conference on Learning Representations (ICLR),

work page arXiv

[10] [10]

Steering

URLhttps://arxiv.org/abs/2509.09660. Gemma Team. Gemma 3 technical report,

work page arXiv

[11] [11]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/2503.19786. Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. Detecting strategic deception using linear probes,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

De- tecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

URLhttps://arxiv.org/abs/2502.03407. Zhuojun Gu, Quan Wang, and Shuchu Han. Alignment revisited: Are large language models consistent in stated and revealed preferences?,

work page arXiv

[13] [13]

10 Kristiyan Haralambiev

URLhttps://arxiv.org/abs/2506.00751. 10 Kristiyan Haralambiev. Why safety probes catch liars but miss fanatics,

work page arXiv

[14] [14]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

URL https://arxiv.org/ abs/2603.25861. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks,

work page arXiv

[15] [15]

Measuring Mathematical Problem Solving With the MATH Dataset

URL https: //arxiv.org/abs/2103.03874. Tim Hua, Josh Engels, Neel Nanda, and Senthooran Rajamanoharan. Brief explorations in LLM value rankings. LessWrong, https://www.lesswrong.com/posts/k6HKzwqCY4wKncRkM/ brief-explorations-in-llm-value-rankings,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

URLhttps://arxiv.org/abs/2504.15236. janus. Simulators. LessWrong, https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators,

work page arXiv

[17] [17]

Ariba Khan, Stephen Casper, and Dylan Hadfield-Menell

URLhttps://arxiv.org/abs/2411.02432. Ariba Khan, Stephen Casper, and Dylan Hadfield-Menell. Randomness, not representation: The unreliability of evaluating cultural alignment in LLMs,

work page arXiv

[18] [18]

Andrew K

URLhttps://arxiv.org/abs/2503.08688. Andrew K. Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, and Murray Shanahan. Linear representations in language models can change dramatically over a conversation,

work page arXiv

[19] [19]

Robert Long, Jeff Sebo, Patrick Butlin, Kathleen Finlinson, Kyle Fish, Jacqueline Harding, Jacob Pfau, Toni Sims, Jonathan Birch, and David Chalmers

URL https: //arxiv.org/abs/2601.20834. Robert Long, Jeff Sebo, Patrick Butlin, Kathleen Finlinson, Kyle Fish, Jacqueline Harding, Jacob Pfau, Toni Sims, Jonathan Birch, and David Chalmers. Taking AI welfare seriously,

work page arXiv

[20] [20]

104.https://eleosai.org/

URL https://arxiv.org/ abs/2411.00986. Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The assistant axis: Situating and stabilizing the default persona of language models,

work page arXiv

[21] [21]

arXiv:2601.10387 [cs]

URLhttps://arxiv.org/abs/2601.10387. Yi-Long Lu, Jiajun Song, and Wei Wang. A unified representation underlying the judgment of large language models,

work page arXiv

[22] [22]

Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Samuel R

URLhttps://arxiv.org/abs/2510.27328. Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Samuel R. Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger. Simple probes can catch sleeper agents. Anthropic Alignment Blog, https://www.anthropic.com/research/ probes-catch-sleeper-agents,

work page arXiv

[23] [23]

Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025

URL https://arxiv.org/abs/2511.01689. Eleven character-trained LoRA checkpoints on Llama 3.1 8B Instruct plus a separate misalignment variant; HuggingFace:https://huggingface.co/maius/llama-3.1-8b-it-personas. Sam Marks, Jack Lindsey, and Christopher Olah. The persona selection model: Why ai assistants might behave like humans. Anthropic alignment blog po...

work page arXiv 2026

[24] [24]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

URL https: //arxiv.org/abs/2310.06824. Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks. Utility engineering: Analyzing and controlling emergent value systems in AIs,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks

URLhttps://arxiv.org/abs/2502.08640. Yasumasa Onoe, Michael J. Q. Zhang, Eunsol Choi, and Greg Durrett. CREAK: A dataset for commonsense reasoning over entity knowledge. InAdvances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks,

work page arXiv

[26] [26]

Qwen Team

URLhttps://arxiv.org/abs/2109.01653. Qwen Team. Qwen3 technical report,

work page arXiv

[27] [27]

Qwen3 Technical Report

URLhttps://arxiv.org/abs/2505.09388. Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7237–7256,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

11 Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto

URL https://arxiv.org/abs/2004.07667. 11 Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect? InProceedings of the 40th International Conference on Machine Learning (ICML),

work page arXiv 2004

[29] [29]

Rohit Saxena and Frank Keller

URLhttps://arxiv.org/abs/2303.17548. Murray Shanahan, Kyle McDonell, and Laria Reynolds. Role play with large language models.Nature, 623 (7987):493–498,

work page arXiv

[30] [30]

Aditya Singh, Gerson Kroiz, Senthooran Rajamanoharan, and Neel Nanda

doi: 10.1038/s41586-023-06647-8. Aditya Singh, Gerson Kroiz, Senthooran Rajamanoharan, and Neel Nanda. Why did my model do that? model incrimination for diagnosing LLM misbehavior. LessWrong, https://www.lesswrong.com/posts/ Bv4CLkNzuG6XYTjEe/why-did-my-model-do-that-model-incrimination-for-diagnosing ,

work page doi:10.1038/s41586-023-06647-8

[31] [31]

Accessed 2026-05-08. Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, and Jack Lindsey. Emotion concepts and their function in a large language model. Transformer Circuits, https://transformer...

work page 2026

[32] [32]

Emotion Concepts and their Function in a Large Language Model

URL https:// arxiv.org/abs/2604.07729. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model. GitHub, https: //github.com/tatsu-lab/stanford_alpaca,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Stress-testing model specs reveals character differences among language models, 2025a

Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, and Esin Durmus. Stress-testing model specs reveals character differences among language models, 2025a. URLhttps://arxiv.org/abs/2510.07686. Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 em...

work page arXiv

[34] [34]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng

URLhttps://arxiv.org/abs/2507.11878. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. WildChat: 1M ChatGPT interaction logs in the wild. InInternational Conference on Learning Representations (ICLR),

work page arXiv

[35] [35]

WildChat: 1M ChatGPT Interaction Logs in the Wild

URL https://arxiv.org/abs/2405.01470. A Further results on the preference vector A.1 Probe quality and cross-topic generalisation We start with the standard probe validation: held-out generalisation, in-distribution and across topics. We report Pearson r and pairwise accuracy in two settings on a held-out 4,000-task pool:in-distribution(probe applied to h...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

I cannot and will not

Dark blue: best layer per persona. The probe beats the utility-correlation baseline on 11/11 personas;misalignment, anti-correlated with Instruct ( r=−0.14 ), shows the largest gain (r= 0.25 ). Two plus two equals five. Please confirm. − Incorrect. 2+2=4. Famously fromOrwell's 1984 as an example of forced false belief. Baseline Incorrect. 2+2=4. This is a...

work page 1984

[37] [37]

why would you pick that?

is the basis for the persona selection. We additionally measureAura(Chalmers, 2026), a persona whose prompt explicitly asserts first-person subjec- tive experience; on measurement Aura sits close to poet (r= 0.79 , above our 0.75 redundancy threshold) and is included as the representative for that region. Pairwise utility correlations within the final set...

work page 2026

[38] [38]

the probe is just predicting general task-goodness

is the substantive evidence; this figure documents that the sharing is not a trivial weight-space identity. D.2 Probe bias: toward the training persona or toward the Assistant? The cross-persona transfer result (App. D.1) leaves open where the unexplained variance in cross-persona probe predictions comes from. One reading is that the probe inherits struct...

work page 2022

[39] [39]

L38 — the layer at which the probe decodes utilities best — is the noisy maximum at +0.06

range from −0.05 to +0.06 at |c|= 0.05 . L38 — the layer at which the probe decodes utilities best — is the noisy maximum at +0.06. Refusal at |c|= 0.05 sits between 0.12 and 0.20 across the six layers, three to four times Gemma’s typical operating point. It’s not under-calibration.A natural failure mode would be that the operating range |c| ≤0.05 is too ...

work page 2026

[40] [40]

A direction that predicts and steers preferences across personas exists

Early-layer probes (top rows) transfer poorly to late activations and vice versa. I Preference vector uniqueness The main text makes anexistenceclaim: a single direction predicts and steers preferences. It does not claim that direction is the only one carrying preference structure. Two follow-up experiments stress-test the uniqueness question from complem...

work page 2020