pith. sign in

arxiv: 2605.23147 · v1 · pith:KBCLMTBInew · submitted 2026-05-22 · 💻 cs.CL · cs.AI

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

Pith reviewed 2026-05-25 05:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords role promptspersona effectstask decompositionresidual streamlinear additivityattention mechanismsLLM steeringinstruction tuning
0
0 comments X

The pith

Role prompts of the form 'As X, do Y' decompose linearly into persona and task directions at the prompt-to-answer transition in the residual stream.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in instruction-tuned LLMs, prompts like 'As X, do Y' separate into partially orthogonal additive effects from the persona X and task Y at the prompt-to-answer transition in early to mid layers of the residual stream. Substituting a baseline residual plus these isolated effects produces outputs close in KL divergence to the original while keeping persona markers. This local structure does not allow the full role prompt to be replaced by one cached vector, because generation still depends on attention back to the persona text positions throughout the prompt. A sympathetic reader would care because the finding separates where simple activation edits can control behavior from where distributed prompt mechanisms remain necessary.

Core claim

At the prompt-to-answer transition in an early/mid layer band, the residual stream for role prompts admits a clean linear decomposition where persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect Δ_X, a pure task effect Δ_Y, and substituting h_BB + Δ_X + Δ_Y for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-{1.5B, 3B}-Instruct across short and long persona grids. Yet injecting the cached additive prediction or even the oracle clean residual h_XY into a baseline host prompt with the persona text removed does not approach the clean long-persona target, showing that persona-conditioned多多

What carries the argument

The prompt-to-answer transition (last prompt token together with the first two generated tokens) in the residual stream of early/mid layers, where persona and task effects add partially orthogonally.

Load-bearing premise

The assumption that failure of residual injection to reproduce long-persona behavior is caused by missing attention back to persona-text positions rather than other factors such as normalization or prompt length.

What would settle it

Inject the oracle clean residual h_XY at the transition site into a no-persona baseline prompt and check whether the generated tokens match the persona-specific behavioral markers over multiple steps.

Figures

Figures reproduced from arXiv: 2605.23147 by Eric Xu.

Figure 1
Figure 1. Figure 1: Median causal KL under additive substitution at [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $\Delta_X$, a pure task effect $\Delta_Y$, and substituting $h_{BB} + \Delta_X + \Delta_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript examines role prompts of the form 'As X, do Y' in instruction-tuned LLMs. It claims that persona (X) and task (Y) effects admit a partially orthogonal additive decomposition at one site in the residual stream—the prompt-to-answer transition (last prompt token plus first two generated tokens) in an early/mid layer band. Pure effects Δ_X and Δ_Y can be formed such that substituting h_BB + Δ_X + Δ_Y for the clean residual produces downstream outputs within small KL divergence of the clean run on Gemma-2-2B-IT and Qwen-2.5-{1.5B, 3B}-Instruct, across a 12-cell short grid and 48-cell long-persona grid, while preserving persona-specific markers. The authors further claim that this local additivity does not imply compressibility into a cached residual: even oracle injection of the clean h_XY (or the additive prediction) into a baseline host prompt lacking persona text fails to match long-persona behavior, which they attribute to attention back to persona-text positions throughout the prompt.

Significance. If the empirical measurements hold, the work identifies a concrete, localized site for additive steering of persona versus task contributions, which is useful for interpretability and fine-grained control. The negative result on compressibility usefully bounds the scope of activation arithmetic methods. Credit is due for testing the decomposition across multiple models and two grid sizes (short and long-persona), providing a falsifiable empirical claim rather than a purely theoretical derivation.

major comments (3)
  1. [Abstract / injection experiments section] Abstract and § on injection experiments: the central non-compressibility claim rests on the inference that failure of even oracle h_XY injection into the baseline host prompt is caused by attention back to persona-text positions. This requires that all other differences (layer-norm statistics, effective prompt length, KV-cache state, and precise injection site) between the clean long-persona run and the injected baseline run have been matched or ablated; no such controls or ablations are described.
  2. [Methods / experimental setup] Methods / experimental setup: the isolation procedure for Δ_X and Δ_Y is not specified (e.g., exact subtractions from which baselines, how the 12-cell and 48-cell grids are constructed, and whether the KL divergence is computed over full generated sequences or only the transition tokens). These details are load-bearing for verifying the reported small KL values and the orthogonality claim.
  3. [Results on long-persona grid] Results on long-persona grid: the claim that persona-specific behavioral markers are preserved under h_BB + Δ_X + Δ_Y substitution is central to the additivity result, yet no quantitative metric or table for marker preservation (beyond KL) is referenced.
minor comments (1)
  1. [Notation / early sections] Notation: the manuscript uses h_BB, h_XY, Δ_X, and Δ_Y without an explicit equation defining their construction from the residual activations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each of the major comments point-by-point below.

read point-by-point responses
  1. Referee: Abstract and § on injection experiments: the central non-compressibility claim rests on the inference that failure of even oracle h_XY injection into the baseline host prompt is caused by attention back to persona-text positions. This requires that all other differences (layer-norm statistics, effective prompt length, KV-cache state, and precise injection site) between the clean long-persona run and the injected baseline run have been matched or ablated; no such controls or ablations are described.

    Authors: We agree that the manuscript does not describe explicit controls or ablations for layer-norm statistics, effective prompt length, KV-cache state, and precise injection site. In the revised version, we will add a dedicated subsection with these controls and ablations to rigorously support that the failure is due to attention back to persona-text positions rather than these other differences. revision: yes

  2. Referee: Methods / experimental setup: the isolation procedure for Δ_X and Δ_Y is not specified (e.g., exact subtractions from which baselines, how the 12-cell and 48-cell grids are constructed, and whether the KL divergence is computed over full generated sequences or only the transition tokens). These details are load-bearing for verifying the reported small KL values and the orthogonality claim.

    Authors: We agree that the isolation procedure for Δ_X and Δ_Y, the construction of the grids, and the exact scope of the KL computation are not fully detailed in the manuscript. In the revised version, we will expand the Methods section to fully specify these aspects, including the exact baselines used for subtractions, how the 12-cell and 48-cell grids were built, and whether KL is over full sequences or transition tokens. revision: yes

  3. Referee: Results on long-persona grid: the claim that persona-specific behavioral markers are preserved under h_BB + Δ_X + Δ_Y substitution is central to the additivity result, yet no quantitative metric or table for marker preservation (beyond KL) is referenced.

    Authors: We acknowledge that while the manuscript states that persona-specific behavioral markers are preserved, it does not provide a quantitative metric or table beyond the KL divergence. In the revision, we will add a quantitative evaluation, including a table reporting the rate of preservation for specific markers across the long-persona grid conditions, comparing the additive substitution to the clean runs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements on concrete models

full rationale

The paper reports direct experimental results on activation patching, residual injection, and KL divergence across fixed model grids (Gemma-2-2B-IT, Qwen variants) and prompt sets. No equations are presented that derive a quantity from itself, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or imported uniqueness theorems. The additivity observation and the injection-failure observation are both stated as outcomes of the same measurement protocol rather than one being presupposed by the other. The work therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that residual-stream representations behave additively at the chosen site and on the empirical observation that attention continues to reference persona tokens; no free parameters are explicitly fitted in the abstract, and the directions Δ_X and Δ_Y are constructed rather than postulated as new physical entities.

axioms (1)
  • domain assumption Residual stream activations at the prompt-to-answer transition can be decomposed into additive persona and task components
    Invoked when forming Δ_X and Δ_Y and substituting their sum for the clean residual
invented entities (1)
  • Δ_X (pure persona effect) and Δ_Y (pure task effect) no independent evidence
    purpose: Directions extracted to isolate persona and task contributions
    Constructed from the linear decomposition at the transition site; no independent falsifiable prediction outside the reported experiments is given

pith-pipeline@v0.9.0 · 5824 in / 1643 out tokens · 32801 ms · 2026-05-25T05:07:45.600078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

    Anthropic Interpretability Team. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024

  2. [2]

    In-context learning creates task vectors

    Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. In Findings of EMNLP, 2023

  3. [3]

    Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau

    Evan Hernandez, Arnab S. Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. InInternational Conference on Learning Representations, 2024

  4. [4]

    Editing models with task arithmetic

    GabrielIlharco, MarcoTulio Ribeiro, MitchellWortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. InInternational Conference on Learning Representations, 2023

  5. [5]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of EMNLP, 2021

  6. [6]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of ACL, 2021

  7. [7]

    Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 2022

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 2022

  8. [8]

    Li, Arnab S

    Eric Todd, Millicent L. Li, Arnab S. Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InInternational Conference on Learning Representations, 2024

  9. [9]

    Steering Language Models With Activation Engineering

    Alexander M. Turner, Leif Thiergart, David Udell, Gavin Leech, Umang Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248, 2023

  10. [10]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023. 12