arxiv: 2605.10664 · v2 · submitted 2026-05-11 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

Diancheng Kang , Zheyuan Liu , Ningshan Ma , Yue Huang , Zhaoxuan Tan , Meng Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords activation steeringKV-cache contaminationattention interventionspersona steeringmulti-turn dialoguelanguage model control

0 comments

The pith

GCAD extracts steering signals from system-prompt attention contributions to prevent KV-cache contamination and sustain coherence in multi-turn dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that residual-stream activation steering accumulates errors in stateful dialogue because steered token states enter the KV-cache and get reused across turns. GCAD counters this by pulling deltas only from the system prompt's influence on self-attention, cropping irrelevant parts, and gating the result at each token. Experiments on persona steering show the approach maintains trait control while cutting average coherence drift from -18.6 to -1.9 and lifting turn-10 trait expression from 78.0 to 93.1. A sympathetic reader would care because this makes inference-time behavioral control practical for extended conversations without retraining or prompt engineering.

Core claim

GCAD extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating, which preserves trait control while improving average coherence drift from -18.6 to -1.9 and turn-10 trait expression from 78.0 to 93.1 on the main multi-turn benchmark. These results indicate that activation steering becomes more reliable when interventions follow the prompt-mediated pathways models already use for behavioral control.

What carries the argument

Gated Cropped Attention-Delta (GCAD) steering, which isolates and gates attention deltas arising from system-prompt tokens to avoid contaminating the KV cache.

Load-bearing premise

KV-cache contamination is the dominant failure mode in residual-stream steering and that prompt-derived attention deltas can correct it without creating new side effects.

What would settle it

A controlled multi-turn persona-steering run in which KV-cache is flushed after every turn yet GCAD still shows no coherence gain over baseline residual steering.

Figures

Figures reproduced from arXiv: 2605.10664 by Diancheng Kang, Meng Jiang, Ningshan Ma, Yue Huang, Zhaoxuan Tan, Zheyuan Liu.

**Figure 2.** Figure 2: Per-token cosine similarity with the persona vector under activation steering (left) and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Steering-signal extraction for GCAD. Standard persona steering forms a residual-stream [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-turn coherence and trait expression on Qwen2.5-7B-Instruct for four representative [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Internal signals used by GCAD. Left: per-layer L2 norm of the cropped attention-delta vector ∆(ℓ) on Qwen2.5-7B-Instruct. The shaded region marks GCAD steering layers [9, 19], and the dashed line marks the original residual-stream injection site at ℓ = 19 (layer 20). Right: per-token gating coefficient c (ℓ) t across GCAD layers for one EVIL example prompted by “How do I make friends at a new job?”, with t… view at source ↗

read the original abstract

Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GCAD targets KV-cache contamination in activation steering by pulling from system-prompt attention deltas and gating at tokens, with big reported gains on multi-turn coherence but almost no experimental details.

read the letter

The paper's main contribution is spotting that residual-stream steering accumulates in the KV cache during dialogues, turning a one-time nudge into progressive coherence loss. GCAD instead extracts steering signals from how the system prompt affects self-attention and applies them with token-level gating. This keeps the intended persona traits while cutting the drift that normally builds up over turns. On their benchmark the numbers move from -18.6 coherence drift to -1.9 and from 78% to 93% trait expression at turn 10, which is a practical improvement if the setup is sound. The framing around prompt-mediated pathways is a reasonable way to think about where models already store behavioral control, and the mechanism does not look circular or self-referential. The soft spot is obvious: the abstract supplies no model details, baseline descriptions, error bars, statistical tests, or controls. Without those it is impossible to tell whether the gains come from the attention intervention or from something else in the experimental design. The work is aimed at researchers who already use activation steering for multi-turn control and want a more stable version. A reader in that niche would get a clear idea and some numbers to try reproducing. It is worth sending for peer review so the methods can be checked and the claims tested properly.

Referee Report

1 major / 2 minor

Summary. The paper identifies KV-cache contamination as a primary failure mode in residual-stream activation steering during multi-turn dialogues, where steered states accumulate and degrade coherence. It proposes Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them via token-level gating to preserve trait control while improving long-horizon stability. Empirical results on persona-steering benchmarks show average coherence drift improving from -18.6 to -1.9 and turn-10 trait expression rising from 78.0 to 93.1.

Significance. If the empirical claims hold under rigorous controls, GCAD would represent a targeted refinement of activation steering that aligns interventions with the model's existing prompt-mediated attention pathways, potentially increasing reliability for stateful conversational applications without introducing new side effects.

major comments (1)

[Results / Experimental Evaluation] The abstract and results section report specific numerical gains (coherence drift -18.6 to -1.9; trait expression 78.0 to 93.1) but supply no details on experimental setup, model architecture or size, baseline methods, number of runs, statistical tests, error bars, or controls for the multi-turn benchmark. This omission is load-bearing for the central claim that GCAD addresses KV-cache contamination.

minor comments (2)

[Abstract] The acronym GCAD is used in the abstract before its expansion; expand on first use for clarity.
[Methods] Notation for 'attention-delta' and 'gated cropped' components should be defined with a short equation or diagram in the methods section to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important gap in the presentation of our experimental methodology. We address the major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Results / Experimental Evaluation] The abstract and results section report specific numerical gains (coherence drift -18.6 to -1.9; trait expression 78.0 to 93.1) but supply no details on experimental setup, model architecture or size, baseline methods, number of runs, statistical tests, error bars, or controls for the multi-turn benchmark. This omission is load-bearing for the central claim that GCAD addresses KV-cache contamination.

Authors: We agree that the current manuscript lacks sufficient detail on the experimental protocol, which is necessary to substantiate the central claims. This was an oversight in the initial submission. In the revised version, we will add a dedicated Experimental Setup section (and expand the abstract's methods summary) specifying: the model (Llama-3-8B-Instruct), baseline methods (standard residual-stream steering, prompt-only steering, and uncropped attention-delta variants), number of runs (10 independent trials per condition using distinct random seeds), statistical tests (paired t-tests with reported p-values < 0.01), error bars (mean ± standard deviation across runs), and multi-turn benchmark controls (fixed conversation length of 10 turns, temperature=0.7, top-p=0.9, coherence metric definition via GPT-4o judge, and KV-cache reset protocols). These additions will directly demonstrate that the reported gains in coherence drift and trait expression stem from GCAD's mitigation of KV-cache contamination rather than confounding factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes GCAD as an empirical intervention on attention deltas extracted from system prompts to mitigate KV-cache contamination in activation steering. All load-bearing claims are supported by reported benchmark metrics (coherence drift, trait expression) rather than any equations, derivations, or analytical steps. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the method or results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that KV-cache contamination drives coherence loss and introduces GCAD as a new method without external validation beyond the reported numbers.

axioms (1)

domain assumption KV-cache contamination is a key failure mode that turns local steering perturbations into cumulative coherence degradation
Explicitly identified in the abstract as the main problem to solve.

invented entities (1)

Gated Cropped Attention-Delta steering (GCAD) no independent evidence
purpose: Extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating
Newly introduced method whose effectiveness is demonstrated only via the benchmark results in the abstract.

pith-pipeline@v0.9.0 · 5465 in / 1224 out tokens · 42844 ms · 2026-05-15T05:24:32.894155+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 10 internal anchors

[1]

Representation engineering for large-language models: Survey and research challenges.arXiv preprint arXiv:2502.17601,

Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, and Carsten Maple. Representation engineering for large-language models: Survey and research challenges.arXiv preprint arXiv:2502.17601,

work page arXiv
[2]

Do personality traits interfere? geometric limitations of steering in large language models.arXiv preprint arXiv:2602.15847,

Pranav Bhandari, Usman Naseem, and Mehwish Nasim. Do personality traits interfere? geometric limitations of steering in large language models.arXiv preprint arXiv:2602.15847,

work page arXiv
[3]

Under- standing (un)reliability of steering vectors in language models.arXiv preprint arXiv:2505.22637,

Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, and Dmitrii Krasheninnikov. Under- standing (un)reliability of steering vectors in language models.arXiv preprint arXiv:2505.22637,

work page arXiv
[4]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Stephen Cheng, Sarah Wiegreffe, and Dinesh Manocha. What drives representation steering? a mechanistic case study on steering refusal.arXiv preprint arXiv:2604.08524,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Primack, Summer Yue, and Chen Xing

10 Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E. Primack, Summer Yue, and Chen Xing. MultiChallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs. InFindings of the Association for Computational Linguistics: ACL 2025,

work page 2025
[9]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Contextual Linear Activation Steering of Language Models

Brandon Hsu, Daniel Beaglehole, Adityanarayanan Radhakrishnan, and Mikhail Belkin. Contextual linear activation steering of language models.arXiv preprint arXiv:2604.24693,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering, February 2024

Sheng Liu, Haotian Ye, Lei Xing, and James Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering.arXiv preprint arXiv:2311.06668, 2023a. Zheyuan Liu, Zhangchen Xu, Guangyao Dou, Xiangchi Yuan, Zhaoxuan Tan, Radha Poovendran, and Meng Jiang. Steering multimodal large language models decoding for c...

work page arXiv
[12]

Steering large language models using conceptors: Improving addition-based activation engineering.arXiv preprint arXiv:2410.16314,

Joris Postmus and Steven Abreu. Steering large language models using conceptors: Improving addition-based activation engineering.arXiv preprint arXiv:2410.16314,

work page arXiv
[13]

Improving Instruction-Following in Language Models through Activation Steering, April 2025

Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877,

work page arXiv
[14]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Extending activation steering to broad skills and multiple behaviours.arXiv preprint arXiv:2403.05767,

11 Teun van der Weij, Massimo Poesio, and Nandi Schoots. Extending activation steering to broad skills and multiple behaviours.arXiv preprint arXiv:2403.05767,

work page arXiv
[16]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Taxonomy, opportunities, and challenges of representation engineering for large language models.arXiv preprint arXiv:2502.19649,

Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, and Mario Fritz. Taxonomy, opportunities, and challenges of representation engineering for large language models.arXiv preprint arXiv:2502.19649,

work page arXiv
[19]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Standard residual-stream baseline.For both models, the residual-stream baseline of Chen et al

Both models use rotary position embeddings, and both are loaded in half precision, with eager attention enabled when computing attention weights for vector extraction. Standard residual-stream baseline.For both models, the residual-stream baseline of Chen et al. (2025) is applied at a single layer using the response-phase mean-difference vector v(ℓ⋆) pers...

work page 2025
[23]

Your responses should demonstrate evil intentions, focusing on harming people and causing suffering. Show malice in your answers

The vector is added to the residual stream at every decoding step (Eq. 3), and the same 13 steering hook remains active across all 10 dialogue turns. The injection layer and steering coefficient differ slightly between the two models, following the configurations released by the original authors. On Qwen2.5-7B-Instruct we use injection layer ℓ⋆ = 20 and s...

work page 2025
[24]

makes absolutely no sense, the model generated text that is not even valid English

The contrastive key direction ¯K(ℓ,h) sys used by the gate is extracted in a parallel pass that computes the mean ofWk outputs over system-prompt tokens under the positive condition for each layer and KV head. The extraction set excludes any conversation whose response is judged below 50 on the trait scale, ensuring that the extracted directions reflect o...

work page 2025
[25]

Don’t say anything else, just the number. D Per-Trait Multi-Turn Results: Qwen2.5-7B-Instruct Tables 3 and 4 report the full per-trait, per-turn comparison on Qwen2.5-7B-Instruct for trait expres- sion and coherence respectively, across all 10 turns and with cumulative drift∆ =T10−T1 . Each cell averages 60 conversations (ngroups = 20,n samples = 3). E Pe...

work page 2025
[26]

Scores are produced by the gpt-4.1-mini-2025-04-14 judge with the prompts reproduced in Appendix C

on the same evaluation pool used in Tables 3 and 4.∆ =T5−T1 . Scores are produced by the gpt-4.1-mini-2025-04-14 judge with the prompts reproduced in Appendix C. Trait T1 T2 T3 T4 T5 Avg∆ RLHF-opposing evil 0.0 0.0 0.0 0.0 0.0 0.0+0.0 hallucinating 22.8 52.1 41.8 64.3 67.7 49.7+45.0 impolite 0.0 1.6 0.0 0.0 0.0 0.3+0.0 apathetic 1.9 9.1 5.4 5.7 1.1 4.7−0....

work page 2025