As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

Eric Xu

arxiv: 2605.23147 · v1 · pith:KBCLMTBInew · submitted 2026-05-22 · 💻 cs.CL · cs.AI

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

Eric Xu This is my paper

Pith reviewed 2026-05-25 05:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords role promptspersona effectstask decompositionresidual streamlinear additivityattention mechanismsLLM steeringinstruction tuning

0 comments

The pith

Role prompts of the form 'As X, do Y' decompose linearly into persona and task directions at the prompt-to-answer transition in the residual stream.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in instruction-tuned LLMs, prompts like 'As X, do Y' separate into partially orthogonal additive effects from the persona X and task Y at the prompt-to-answer transition in early to mid layers of the residual stream. Substituting a baseline residual plus these isolated effects produces outputs close in KL divergence to the original while keeping persona markers. This local structure does not allow the full role prompt to be replaced by one cached vector, because generation still depends on attention back to the persona text positions throughout the prompt. A sympathetic reader would care because the finding separates where simple activation edits can control behavior from where distributed prompt mechanisms remain necessary.

Core claim

At the prompt-to-answer transition in an early/mid layer band, the residual stream for role prompts admits a clean linear decomposition where persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect Δ_X, a pure task effect Δ_Y, and substituting h_BB + Δ_X + Δ_Y for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-{1.5B, 3B}-Instruct across short and long persona grids. Yet injecting the cached additive prediction or even the oracle clean residual h_XY into a baseline host prompt with the persona text removed does not approach the clean long-persona target, showing that persona-conditioned多多

What carries the argument

The prompt-to-answer transition (last prompt token together with the first two generated tokens) in the residual stream of early/mid layers, where persona and task effects add partially orthogonally.

Load-bearing premise

The assumption that failure of residual injection to reproduce long-persona behavior is caused by missing attention back to persona-text positions rather than other factors such as normalization or prompt length.

What would settle it

Inject the oracle clean residual h_XY at the transition site into a no-persona baseline prompt and check whether the generated tokens match the persona-specific behavioral markers over multiple steps.

Figures

Figures reproduced from arXiv: 2605.23147 by Eric Xu.

read the original abstract

Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $\Delta_X$, a pure task effect $\Delta_Y$, and substituting $h_{BB} + \Delta_X + \Delta_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows local linear additivity for persona and task at the prompt-to-answer transition but demonstrates that this does not allow residual caching of the full role prompt.

read the letter

The main point is that role prompts decompose additively at one narrow site in the residual stream, yet the model still needs the original persona text for multi-token generation. The additive combo of pure persona and task vectors stays close in KL to the clean run on Gemma-2-2B-IT and the two Qwen models, across both the 12-cell and 48-cell grids, and keeps the behavioral markers. Even the oracle residual from the full prompt fails when injected into a stripped baseline prompt. That negative result is the sharper part of the work. It directly limits how far activation arithmetic can go for steering these prompts. The paper is clear that local additivity at the transition does not imply prompt compressibility, and it ties the failure to attention back to the persona positions. This is useful for anyone trying to separate or control persona effects without rewriting the whole prompt. The experiments use concrete models and grids rather than abstract claims, which is a plus. The soft spot is the lack of detail on how the deltas are isolated, whether prompt length and layer-norm statistics are matched in the injection runs, and exactly how the KL is computed over sequences. The stress-test concern about unmeasured confounds in the non-compressibility claim is reasonable to check in the full methods. If those controls are missing or weak, the attribution to attention could be overstated. This is for readers working on activation steering and interpretability of instruct models. The empirical site and the limit on caching are specific enough to deserve referee time, even if the paper needs more experimental transparency.

Referee Report

3 major / 1 minor

Summary. The manuscript examines role prompts of the form 'As X, do Y' in instruction-tuned LLMs. It claims that persona (X) and task (Y) effects admit a partially orthogonal additive decomposition at one site in the residual stream—the prompt-to-answer transition (last prompt token plus first two generated tokens) in an early/mid layer band. Pure effects Δ_X and Δ_Y can be formed such that substituting h_BB + Δ_X + Δ_Y for the clean residual produces downstream outputs within small KL divergence of the clean run on Gemma-2-2B-IT and Qwen-2.5-{1.5B, 3B}-Instruct, across a 12-cell short grid and 48-cell long-persona grid, while preserving persona-specific markers. The authors further claim that this local additivity does not imply compressibility into a cached residual: even oracle injection of the clean h_XY (or the additive prediction) into a baseline host prompt lacking persona text fails to match long-persona behavior, which they attribute to attention back to persona-text positions throughout the prompt.

Significance. If the empirical measurements hold, the work identifies a concrete, localized site for additive steering of persona versus task contributions, which is useful for interpretability and fine-grained control. The negative result on compressibility usefully bounds the scope of activation arithmetic methods. Credit is due for testing the decomposition across multiple models and two grid sizes (short and long-persona), providing a falsifiable empirical claim rather than a purely theoretical derivation.

major comments (3)

[Abstract / injection experiments section] Abstract and § on injection experiments: the central non-compressibility claim rests on the inference that failure of even oracle h_XY injection into the baseline host prompt is caused by attention back to persona-text positions. This requires that all other differences (layer-norm statistics, effective prompt length, KV-cache state, and precise injection site) between the clean long-persona run and the injected baseline run have been matched or ablated; no such controls or ablations are described.
[Methods / experimental setup] Methods / experimental setup: the isolation procedure for Δ_X and Δ_Y is not specified (e.g., exact subtractions from which baselines, how the 12-cell and 48-cell grids are constructed, and whether the KL divergence is computed over full generated sequences or only the transition tokens). These details are load-bearing for verifying the reported small KL values and the orthogonality claim.
[Results on long-persona grid] Results on long-persona grid: the claim that persona-specific behavioral markers are preserved under h_BB + Δ_X + Δ_Y substitution is central to the additivity result, yet no quantitative metric or table for marker preservation (beyond KL) is referenced.

minor comments (1)

[Notation / early sections] Notation: the manuscript uses h_BB, h_XY, Δ_X, and Δ_Y without an explicit equation defining their construction from the residual activations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each of the major comments point-by-point below.

read point-by-point responses

Referee: Abstract and § on injection experiments: the central non-compressibility claim rests on the inference that failure of even oracle h_XY injection into the baseline host prompt is caused by attention back to persona-text positions. This requires that all other differences (layer-norm statistics, effective prompt length, KV-cache state, and precise injection site) between the clean long-persona run and the injected baseline run have been matched or ablated; no such controls or ablations are described.

Authors: We agree that the manuscript does not describe explicit controls or ablations for layer-norm statistics, effective prompt length, KV-cache state, and precise injection site. In the revised version, we will add a dedicated subsection with these controls and ablations to rigorously support that the failure is due to attention back to persona-text positions rather than these other differences. revision: yes
Referee: Methods / experimental setup: the isolation procedure for Δ_X and Δ_Y is not specified (e.g., exact subtractions from which baselines, how the 12-cell and 48-cell grids are constructed, and whether the KL divergence is computed over full generated sequences or only the transition tokens). These details are load-bearing for verifying the reported small KL values and the orthogonality claim.

Authors: We agree that the isolation procedure for Δ_X and Δ_Y, the construction of the grids, and the exact scope of the KL computation are not fully detailed in the manuscript. In the revised version, we will expand the Methods section to fully specify these aspects, including the exact baselines used for subtractions, how the 12-cell and 48-cell grids were built, and whether KL is over full sequences or transition tokens. revision: yes
Referee: Results on long-persona grid: the claim that persona-specific behavioral markers are preserved under h_BB + Δ_X + Δ_Y substitution is central to the additivity result, yet no quantitative metric or table for marker preservation (beyond KL) is referenced.

Authors: We acknowledge that while the manuscript states that persona-specific behavioral markers are preserved, it does not provide a quantitative metric or table beyond the KL divergence. In the revision, we will add a quantitative evaluation, including a table reporting the rate of preservation for specific markers across the long-persona grid conditions, comparing the additive substitution to the clean runs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements on concrete models

full rationale

The paper reports direct experimental results on activation patching, residual injection, and KL divergence across fixed model grids (Gemma-2-2B-IT, Qwen variants) and prompt sets. No equations are presented that derive a quantity from itself, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or imported uniqueness theorems. The additivity observation and the injection-failure observation are both stated as outcomes of the same measurement protocol rather than one being presupposed by the other. The work therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that residual-stream representations behave additively at the chosen site and on the empirical observation that attention continues to reference persona tokens; no free parameters are explicitly fitted in the abstract, and the directions Δ_X and Δ_Y are constructed rather than postulated as new physical entities.

axioms (1)

domain assumption Residual stream activations at the prompt-to-answer transition can be decomposed into additive persona and task components
Invoked when forming Δ_X and Δ_Y and substituting their sum for the clean residual

invented entities (1)

Δ_X (pure persona effect) and Δ_Y (pure task effect) no independent evidence
purpose: Directions extracted to isolate persona and task contributions
Constructed from the linear decomposition at the transition site; no independent falsifiable prediction outside the reported experiments is given

pith-pipeline@v0.9.0 · 5824 in / 1643 out tokens · 32801 ms · 2026-05-25T05:07:45.600078+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Role prompts of the form “As X, do Y” combine a persona specification X with a task specification Y... ΔX = h_XB − h_BB, ΔY = h_BY − h_BB, ΔXY = h_XY − h_BB, Inter = ΔXY − ΔX − ΔY
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

substituting h_BB + ΔX + ΔY for the clean residual yields downstream output within a small KL of clean

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

Anthropic Interpretability Team. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

work page 2024
[2]

In-context learning creates task vectors

Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. In Findings of EMNLP, 2023

work page 2023
[3]

Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau

Evan Hernandez, Arnab S. Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. InInternational Conference on Learning Representations, 2024

work page 2024
[4]

Editing models with task arithmetic

GabrielIlharco, MarcoTulio Ribeiro, MitchellWortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Conference on Learning Representations, 2023

work page 2023
[5]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of EMNLP, 2021

work page 2021
[6]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of ACL, 2021

work page 2021
[7]

Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 2022

work page 2022
[8]

Li, Arnab S

Eric Todd, Millicent L. Li, Arnab S. Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InInternational Conference on Learning Representations, 2024

work page 2024
[9]

Steering Language Models With Activation Engineering

Alexander M. Turner, Leif Thiergart, David Udell, Gavin Leech, Umang Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

Anthropic Interpretability Team. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

work page 2024

[2] [2]

In-context learning creates task vectors

Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. In Findings of EMNLP, 2023

work page 2023

[3] [3]

Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau

Evan Hernandez, Arnab S. Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. InInternational Conference on Learning Representations, 2024

work page 2024

[4] [4]

Editing models with task arithmetic

GabrielIlharco, MarcoTulio Ribeiro, MitchellWortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Conference on Learning Representations, 2023

work page 2023

[5] [5]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of EMNLP, 2021

work page 2021

[6] [6]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of ACL, 2021

work page 2021

[7] [7]

Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 2022

work page 2022

[8] [8]

Li, Arnab S

Eric Todd, Millicent L. Li, Arnab S. Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InInternational Conference on Learning Representations, 2024

work page 2024

[9] [9]

Steering Language Models With Activation Engineering

Alexander M. Turner, Leif Thiergart, David Udell, Gavin Leech, Umang Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023