As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs
Pith reviewed 2026-05-25 05:07 UTC · model grok-4.3
The pith
Role prompts of the form 'As X, do Y' decompose linearly into persona and task directions at the prompt-to-answer transition in the residual stream.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
At the prompt-to-answer transition in an early/mid layer band, the residual stream for role prompts admits a clean linear decomposition where persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect Δ_X, a pure task effect Δ_Y, and substituting h_BB + Δ_X + Δ_Y for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-{1.5B, 3B}-Instruct across short and long persona grids. Yet injecting the cached additive prediction or even the oracle clean residual h_XY into a baseline host prompt with the persona text removed does not approach the clean long-persona target, showing that persona-conditioned多多
What carries the argument
The prompt-to-answer transition (last prompt token together with the first two generated tokens) in the residual stream of early/mid layers, where persona and task effects add partially orthogonally.
Load-bearing premise
The assumption that failure of residual injection to reproduce long-persona behavior is caused by missing attention back to persona-text positions rather than other factors such as normalization or prompt length.
What would settle it
Inject the oracle clean residual h_XY at the transition site into a no-persona baseline prompt and check whether the generated tokens match the persona-specific behavioral markers over multiple steps.
Figures
read the original abstract
Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $\Delta_X$, a pure task effect $\Delta_Y$, and substituting $h_{BB} + \Delta_X + \Delta_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines role prompts of the form 'As X, do Y' in instruction-tuned LLMs. It claims that persona (X) and task (Y) effects admit a partially orthogonal additive decomposition at one site in the residual stream—the prompt-to-answer transition (last prompt token plus first two generated tokens) in an early/mid layer band. Pure effects Δ_X and Δ_Y can be formed such that substituting h_BB + Δ_X + Δ_Y for the clean residual produces downstream outputs within small KL divergence of the clean run on Gemma-2-2B-IT and Qwen-2.5-{1.5B, 3B}-Instruct, across a 12-cell short grid and 48-cell long-persona grid, while preserving persona-specific markers. The authors further claim that this local additivity does not imply compressibility into a cached residual: even oracle injection of the clean h_XY (or the additive prediction) into a baseline host prompt lacking persona text fails to match long-persona behavior, which they attribute to attention back to persona-text positions throughout the prompt.
Significance. If the empirical measurements hold, the work identifies a concrete, localized site for additive steering of persona versus task contributions, which is useful for interpretability and fine-grained control. The negative result on compressibility usefully bounds the scope of activation arithmetic methods. Credit is due for testing the decomposition across multiple models and two grid sizes (short and long-persona), providing a falsifiable empirical claim rather than a purely theoretical derivation.
major comments (3)
- [Abstract / injection experiments section] Abstract and § on injection experiments: the central non-compressibility claim rests on the inference that failure of even oracle h_XY injection into the baseline host prompt is caused by attention back to persona-text positions. This requires that all other differences (layer-norm statistics, effective prompt length, KV-cache state, and precise injection site) between the clean long-persona run and the injected baseline run have been matched or ablated; no such controls or ablations are described.
- [Methods / experimental setup] Methods / experimental setup: the isolation procedure for Δ_X and Δ_Y is not specified (e.g., exact subtractions from which baselines, how the 12-cell and 48-cell grids are constructed, and whether the KL divergence is computed over full generated sequences or only the transition tokens). These details are load-bearing for verifying the reported small KL values and the orthogonality claim.
- [Results on long-persona grid] Results on long-persona grid: the claim that persona-specific behavioral markers are preserved under h_BB + Δ_X + Δ_Y substitution is central to the additivity result, yet no quantitative metric or table for marker preservation (beyond KL) is referenced.
minor comments (1)
- [Notation / early sections] Notation: the manuscript uses h_BB, h_XY, Δ_X, and Δ_Y without an explicit equation defining their construction from the residual activations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each of the major comments point-by-point below.
read point-by-point responses
-
Referee: Abstract and § on injection experiments: the central non-compressibility claim rests on the inference that failure of even oracle h_XY injection into the baseline host prompt is caused by attention back to persona-text positions. This requires that all other differences (layer-norm statistics, effective prompt length, KV-cache state, and precise injection site) between the clean long-persona run and the injected baseline run have been matched or ablated; no such controls or ablations are described.
Authors: We agree that the manuscript does not describe explicit controls or ablations for layer-norm statistics, effective prompt length, KV-cache state, and precise injection site. In the revised version, we will add a dedicated subsection with these controls and ablations to rigorously support that the failure is due to attention back to persona-text positions rather than these other differences. revision: yes
-
Referee: Methods / experimental setup: the isolation procedure for Δ_X and Δ_Y is not specified (e.g., exact subtractions from which baselines, how the 12-cell and 48-cell grids are constructed, and whether the KL divergence is computed over full generated sequences or only the transition tokens). These details are load-bearing for verifying the reported small KL values and the orthogonality claim.
Authors: We agree that the isolation procedure for Δ_X and Δ_Y, the construction of the grids, and the exact scope of the KL computation are not fully detailed in the manuscript. In the revised version, we will expand the Methods section to fully specify these aspects, including the exact baselines used for subtractions, how the 12-cell and 48-cell grids were built, and whether KL is over full sequences or transition tokens. revision: yes
-
Referee: Results on long-persona grid: the claim that persona-specific behavioral markers are preserved under h_BB + Δ_X + Δ_Y substitution is central to the additivity result, yet no quantitative metric or table for marker preservation (beyond KL) is referenced.
Authors: We acknowledge that while the manuscript states that persona-specific behavioral markers are preserved, it does not provide a quantitative metric or table beyond the KL divergence. In the revision, we will add a quantitative evaluation, including a table reporting the rate of preservation for specific markers across the long-persona grid conditions, comparing the additive substitution to the clean runs. revision: yes
Circularity Check
No significant circularity; empirical measurements on concrete models
full rationale
The paper reports direct experimental results on activation patching, residual injection, and KL divergence across fixed model grids (Gemma-2-2B-IT, Qwen variants) and prompt sets. No equations are presented that derive a quantity from itself, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or imported uniqueness theorems. The additivity observation and the injection-failure observation are both stated as outcomes of the same measurement protocol rather than one being presupposed by the other. The work therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Residual stream activations at the prompt-to-answer transition can be decomposed into additive persona and task components
invented entities (1)
-
Δ_X (pure persona effect) and Δ_Y (pure task effect)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Role prompts of the form “As X, do Y” combine a persona specification X with a task specification Y... ΔX = h_XB − h_BB, ΔY = h_BY − h_BB, ΔXY = h_XY − h_BB, Inter = ΔXY − ΔX − ΔY
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
substituting h_BB + ΔX + ΔY for the clean residual yields downstream output within a small KL of clean
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anthropic Interpretability Team. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024
work page 2024
-
[2]
In-context learning creates task vectors
Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. In Findings of EMNLP, 2023
work page 2023
-
[3]
Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau
Evan Hernandez, Arnab S. Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. InInternational Conference on Learning Representations, 2024
work page 2024
-
[4]
Editing models with task arithmetic
GabrielIlharco, MarcoTulio Ribeiro, MitchellWortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Conference on Learning Representations, 2023
work page 2023
-
[5]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of EMNLP, 2021
work page 2021
-
[6]
Prefix-tuning: Optimizing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of ACL, 2021
work page 2021
-
[7]
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 2022
work page 2022
-
[8]
Eric Todd, Millicent L. Li, Arnab S. Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InInternational Conference on Learning Representations, 2024
work page 2024
-
[9]
Steering Language Models With Activation Engineering
Alexander M. Turner, Leif Thiergart, David Udell, Gavin Leech, Umang Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.