arxiv: 2605.06342 · v1 · submitted 2026-05-07 · 💻 cs.CL

Recognition: unknown

Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

Haoyan Luo , Mateo Espinosa Zarlenga , Mateja Jamnik

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:19 UTC · model grok-4.3

classification 💻 cs.CL

keywords activation steeringlarge language modelsattention reroutingkey-orthogonal projectionssteering vectorsmodel utilityreasoning preservationlong-context retrieval

0 comments

The pith

Projecting steering vectors orthogonal to key vectors of focus tokens preserves reasoning while steering LLM behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that activation steering often harms reasoning and retrieval because it reroutes attention away from the few tokens the model needs most. SKOP counters this by projecting the steering vector to be orthogonal to the keys of a small set of focus tokens, so attention weights on those tokens stay nearly unchanged while steering still influences the rest of the distribution. If this holds, steering becomes usable in practical settings where current methods destroy performance on long contexts or multi-step tasks. The central test is whether the method delivers strong behavior change with far less utility loss than standard steering vectors.

Core claim

Steering via Key-Orthogonal Projections (SKOP) constrains the steering vector to lie in the subspace orthogonal to the key projections of focus tokens. This keeps the model's attention pattern on those tokens intact, blocks the harmful rerouting that vanilla steering produces, and still allows redistribution of attention among less critical tokens. Across steering benchmarks the approach cuts utility degradation by a factor of five to seven while keeping more than 95 percent of the original steering strength, and it continues to work in long-context retrieval tasks where standard methods fail.

What carries the argument

Steering via Key-Orthogonal Projections (SKOP) projects the steering vector orthogonal to the key vectors of a small set of focus tokens so that attention scores on those tokens remain preserved while steering efficacy on the tail tokens is retained.

If this is right

SKOP delivers the strongest joint steering-utility trade-off of the methods tested, with 5-7 times less utility degradation.
More than 95 percent of vanilla steering efficacy is retained.
In long-context retrieval tasks where standard steering collapses, SKOP maintains robust performance by avoiding attention shifts on key tokens.
The method works by allowing attention redistribution only among tail tokens while holding focus-token attention fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the focus-token set can be identified automatically from the task, the same projection idea could be applied to other activation edits such as representation surgery or circuit editing.
The approach suggests that many current steering failures are local attention artifacts rather than global capacity limits, so similar orthogonality constraints might improve safety interventions without broad capability damage.
Testing whether the same key-orthogonal idea transfers to non-transformer architectures or to multimodal models would show how general the attention-preservation principle is.

Load-bearing premise

Attention rerouting away from a small set of contextually important tokens is the main reason steering degrades utility, and freezing attention only on those tokens is enough to keep reasoning and retrieval intact.

What would settle it

If applying SKOP still produces large drops in attention weight on the chosen focus tokens or fails to reduce utility loss below vanilla steering levels in controlled experiments, the claim that key-orthogonal projection prevents harmful rerouting would be refuted.

Figures

Figures reproduced from arXiv: 2605.06342 by Haoyan Luo, Mateja Jamnik, Mateo Espinosa Zarlenga.

**Figure 1.** Figure 1: Attention rerouting due to activation steering, a key contributor to the trade-off between steering efficacy and utility preservation. Activation steering offers a lightweight inferencetime mechanism to control the behaviour of Large Language Models (LLMs) by intervening in their internal representations, avoiding costly retraining [37, 20, 29, 34]. This has recently emerged as an attractive mechanism f… view at source ↗

**Figure 2.** Figure 2: (A): Focus sets are small and stable across context lengths. We group evaluation samples by total context length and, for each group, report the per-head focus-set size |H(ℓ,h) | across layers, where |H(ℓ,h) | is the minimum number of tokens needed to cover τhigh = 0.8 of the attention mass. Focus sets remain small (typically ≲ 15 tokens) even as context length grows from ∼100 to ∼360 tokens. (B): Focus-se… view at source ↗

**Figure 3.** Figure 3: Steering via Key-Orthogonal Projection (SKOP) preserves attention on focus tokens while view at source ↗

**Figure 4.** Figure 4: Steering-utility trade-off for LLaMA3.1- view at source ↗

**Figure 5.** Figure 5: Effect of varying steering strength λ on steering efficacy and utility preservation for SKOP on LLaMA-3.1-8B-Instruct. As λ increases, vanilla query steering vectors achieve slightly higher steering scores but suffer severe utility degradation. In contrast, SKOP maintains strong utility preservation across all λ while preserving most steering effectiveness view at source ↗

**Figure 6.** Figure 6: Case study illustrating model failure under power-seeking query-space steering on GSM8K view at source ↗

**Figure 7.** Figure 7: Focus-set attention mass preservation under vanilla query-space steering and under SKOP view at source ↗

**Figure 8.** Figure 8: Distribution of head risk scores R(ℓ,h) (Eq. 16) across the four steering tasks. The distribution is long-tailed: only a small minority of heads attain high risk scores, motivating SKOP’s selective application to the top-k risk heads. 20 view at source ↗

**Figure 9.** Figure 9: Effect of selective projection and head selection criteria on the steering–utility trade-off. view at source ↗

**Figure 10.** Figure 10: Eigenvalue distributions of the centred key covariance matrices view at source ↗

**Figure 11.** Figure 11: Eigenvalue distributions of the key-difference second-moment matrices view at source ↗

**Figure 12.** Figure 12: Layer-wise norms of steering vectors before and after SKOP projection on the Power task. view at source ↗

**Figure 13.** Figure 13: Layer-wise norms of steering vectors before and after SKOP projection on the Wealth view at source ↗

**Figure 14.** Figure 14: Layer-wise norms of steering vectors before and after SKOP projection on the Corr task. view at source ↗

**Figure 15.** Figure 15: Layer-wise norms of steering vectors before and after SKOP projection on the TQA task. view at source ↗

**Figure 16.** Figure 16: Steering-utility trade-off for Gemma-9B-IT. We report the average of Power, Wealth, view at source ↗

**Figure 17.** Figure 17: Calibration size ablation. Effect of calibration set size on steering efficacy and utility view at source ↗

**Figure 18.** Figure 18: Sample-efficiency comparison with LoRA on LLaMA-3.1-8B-Instruct. view at source ↗

read the original abstract

Activation steering controls LLM behaviour towards target behaviour by intervening in internal representations, yet it often degrades reasoning and retrieval performance. We argue that a primary cause of this trade-off is attention rerouting: steering vectors alter query-key matching, shifting attention away from contextually important tokens toward less informative ones. To address this, we propose Steering via Key-Orthogonal Projections (SKOP), a steering method that constrains harmful attention rerouting without eliminating steering efficacy. SKOP achieves this by preserving attention patterns on a small set of focus tokens the model relies on for reasoning and retrieval, while allowing redistribution among less critical tail tokens. Across multiple steering benchmarks, we show that SKOP achieves the best joint steering-utility trade-off, reducing utility degradation by 5-7x while retaining over 95% of vanilla steering efficacy. Our results further suggest that, in long-context retrieval settings where vanilla steering approaches are ineffective, SKOP can maintain robust performance by avoiding attention rerouting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SKOP gives a practical way to cut utility loss in activation steering by keeping attention stable on focus tokens, but the link to rerouting as the main cause still needs direct checks.

read the letter

The main thing to know is that this paper proposes Steering via Key-Orthogonal Projections (SKOP) to reduce the usual drop in reasoning and retrieval performance when you apply activation steering to LLMs. They identify attention rerouting—where the steering vector shifts query-key matches away from useful tokens—as the key problem, then constrain the projection to stay orthogonal to the keys of a small set of focus tokens. This keeps attention patterns on those tokens mostly intact while still letting the steering vector do its job on the rest of the sequence. Across their benchmarks they report cutting utility degradation by 5-7x and retaining over 95% of the steering effect, with extra robustness in long-context retrieval where vanilla steering fails. The idea is straightforward and directly targets a plausible interaction between steering and attention, which is a real limitation in current practice. The reported trade-off numbers are the clearest positive signal. The soft spot is the causal story. The abstract and results treat attention rerouting as the dominant driver, yet there is no direct measurement of attention mass on the focus tokens before and after steering, nor an ablation that blocks rerouting through a different route to see if utility recovers the same way. Without those, the gains could come from milder effective steering strength, different geometry in the residual stream, or dataset quirks instead. The long-context claim is especially exposed to this gap. This work is aimed at people already using or studying activation steering for behavioral control. It has enough of a concrete mechanism and quantitative claims to deserve a serious referee, even if the mechanism part will need more evidence in revision. I would send it for review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes Steering via Key-Orthogonal Projections (SKOP) to address the steering-utility trade-off in activation steering of LLMs. It identifies attention rerouting away from contextually important 'focus tokens' as the primary cause of utility degradation in reasoning and retrieval. SKOP constrains projections to preserve attention on these focus tokens while permitting redistribution among tail tokens. The abstract claims SKOP yields the best joint trade-off, reducing utility degradation by 5-7x relative to vanilla steering while retaining >95% of steering efficacy, and maintains performance in long-context retrieval where vanilla methods fail.

Significance. If the mechanistic account and quantitative results hold after proper validation, the work would be significant for practical deployment of activation steering, as it targets a plausible source of capability degradation without fully sacrificing control. The focus on attention preservation in long-context settings is particularly relevant given current limitations of steering methods.

major comments (3)

[Abstract] Abstract: The manuscript states specific quantitative claims (5-7x utility degradation reduction, >95% retention of steering efficacy, robustness in long-context retrieval) but provides no experimental details, baselines, datasets, statistical tests, or implementation specifics. This prevents assessment of whether the data support the claims.
[Introduction and Experiments] Introduction and §4 (Experiments): The central claim that attention rerouting is the dominant cause of utility loss, and that SKOP fixes it by preserving attention on focus tokens, lacks direct supporting evidence. No quantitative tracking of attention mass on identified focus tokens under vanilla vs. SKOP steering, nor ablations showing utility recovery when rerouting is blocked by alternative means, are described. Alternative explanations (e.g., reduced effective steering strength or representation changes) therefore cannot be excluded.
[Method] Method section (SKOP definition): The procedure for identifying the small set of focus tokens and the exact construction of the key-orthogonal projection (including any implicit assumptions about query-key matching) must be specified with sufficient formality to confirm it avoids collateral damage to other attention or representation properties.

minor comments (2)

[Method] Notation for the projection operator and focus-token selection criterion should be introduced with an explicit equation early in the method section for clarity.
[Experiments] Figure captions and table headers should explicitly state the steering strength, number of runs, and error bars used for the reported trade-off curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, providing clarifications from the manuscript and indicating where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript states specific quantitative claims (5-7x utility degradation reduction, >95% retention of steering efficacy, robustness in long-context retrieval) but provides no experimental details, baselines, datasets, statistical tests, or implementation specifics. This prevents assessment of whether the data support the claims.

Authors: We agree that the abstract, constrained by length, does not enumerate all experimental details. The reported figures derive from the comprehensive evaluation in §4, which uses standard activation steering benchmarks for reasoning and retrieval, compares against vanilla steering and other baselines, and reports results across multiple datasets and model scales. We will revise the abstract to include a concise reference to the evaluation protocol, key datasets, and baselines. We will also ensure any variance across runs is noted to address statistical considerations. revision: yes
Referee: [Introduction and Experiments] Introduction and §4 (Experiments): The central claim that attention rerouting is the dominant cause of utility loss, and that SKOP fixes it by preserving attention on focus tokens, lacks direct supporting evidence. No quantitative tracking of attention mass on identified focus tokens under vanilla vs. SKOP steering, nor ablations showing utility recovery when rerouting is blocked by alternative means, are described. Alternative explanations (e.g., reduced effective steering strength or representation changes) therefore cannot be excluded.

Authors: The manuscript supports the mechanistic account through the design of SKOP and the observed joint improvements in steering efficacy and utility preservation, which are inconsistent with simple reductions in steering strength. Nevertheless, we recognize that explicit quantitative tracking of attention mass on focus tokens would provide stronger direct evidence. In the revision we will add analyses comparing attention distributions on focus tokens under vanilla steering versus SKOP, together with ablations that constrain rerouting through alternative mechanisms to isolate its contribution relative to other factors. revision: yes
Referee: [Method] Method section (SKOP definition): The procedure for identifying the small set of focus tokens and the exact construction of the key-orthogonal projection (including any implicit assumptions about query-key matching) must be specified with sufficient formality to confirm it avoids collateral damage to other attention or representation properties.

Authors: The method section defines SKOP via projections orthogonal to the keys of a selected set of focus tokens, thereby preserving their attention weights while permitting redistribution among tail tokens. We will expand this section with a fully formal specification: the mathematical definition of the key-orthogonal projection operator, the precise criteria and algorithm for selecting focus tokens (based on task-relevant attention patterns), and an explicit discussion of the underlying assumptions about query-key dot-product matching. This formalization will also include checks confirming that the intervention does not produce unintended side effects on other attention heads or representation subspaces. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper's core argument begins with an empirical observation that activation steering degrades utility, posits attention rerouting as the primary mechanism, and introduces SKOP as a projection-based intervention that preserves focus-token attention patterns by construction of its orthogonal constraint. This is a forward proposal rather than a reduction: the method is defined mathematically to enforce the desired property, then evaluated on benchmarks. No step equates a claimed prediction or first-principles result back to its own fitted inputs, self-citations, or renamed patterns. The 5-7x utility claim is presented as an empirical outcome, not derived tautologically from the steering vector itself. The long-context robustness statement is likewise an observed result, not a definitional consequence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that attention rerouting is the dominant cause of utility loss and that focus-token preservation is sufficient; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Attention rerouting is a primary cause of the observed trade-off between steering efficacy and utility in activation steering.
Explicitly stated in the abstract as the argument motivating the method.

pith-pipeline@v0.9.0 · 5467 in / 1206 out tokens · 41291 ms · 2026-05-08T10:19:29.825902+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

2023
[2]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InNeurIPS, 2024

2024
[3]

Steering large language model activations in sparse spaces

Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=VGw1viYliK

2025
[4]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2020
[5]

Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=7qJFkuZdYo

2024
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457v1, 2018

work page internal anchor Pith review arXiv 2018
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021

work page internal anchor Pith review arXiv 2021
[8]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review arXiv 2024
[9]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.CoRR, abs/2404.04475, 2024

work page internal anchor Pith review arXiv 2024
[10]

New, improved multiple-choice truth- fulqa, 2025

Owain Evans, James Chua, and Steph Lin. New, improved multiple-choice truth- fulqa, 2025. URL https://www.alignmentforum.org/posts/Bunfwz6JsNd44kgLT/ new-improved-multiple-choice-truthfulqa

2025
[13]

MIT Press, 2016

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016

2016
[14]

RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=kIoBbc76Sy

2024
[15]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

2022
[16]

Needle in a haystack - pressure testing LLMs

Gregory Kamradt. Needle in a haystack - pressure testing LLMs. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack/tree/main, 2023

2023
[17]

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang

Tomáš Koˇciský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The NarrativeQA reading comprehension chal- lenge.Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi:10.1162/tacl_a_00023. URLhttps://aclanthology.org/Q18-1023/

work page doi:10.1162/tacl_a_00023 2018
[18]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Oi47wc10sm

2025
[19]

Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36, 2024

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36, 2024. 11

2024
[20]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022

2022
[21]

From understanding to utilization: A survey on explainability for large language models, 2024

Haoyan Luo and Lucia Specia. From understanding to utilization: A survey on explainability for large language models, 2024. URLhttps://arxiv.org/abs/2401.12874

work page arXiv 2024
[22]

Linguistic regularities in continuous space word representations

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia,...

2013
[23]

Multi-attribute steering of language models via targeted intervention

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Multi-attribute steering of language models via targeted intervention. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20619– 20634...

work page doi:10.18653/v1/2025.acl-long.1007 2025
[24]

Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024

Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi-Sangde. Steering language model refusal with sparse autoencoders.arXiv:2411.11296, 2024. URL https://arxiv.org/abs/2411. 11296

work page arXiv 2024
[25]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InICML, 2024

2024
[26]

Discovering language model behaviors with model-written evaluations

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, 2023

2023
[27]

Generalizing verifiable instruction following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=yfYgwjj5F8

2025
[28]

Spectral editing of activations for large language model alignment.Advances in Neural Information Processing Systems, 37:56958–56987, 2024

Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo Maria Ponti, and Shay Co- hen. Spectral editing of activations for large language model alignment.Advances in Neural Information Processing Systems, 37:56958–56987, 2024

2024
[29]

Improving sparse decomposition of language model activations with gated sparse autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, and Neel Nanda. Improving sparse decomposition of language model activations with gated sparse autoencoders. InICML 2024 Workshop on Mechanistic Interpretability, 2024. URLhttps://openreview.net/forum?id=Ppj5KvzU8Q

2024
[30]

Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

work page doi:10.18653/v1/2024.acl-long.828 2024
[31]

Morgane Rivière, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momch...

work page internal anchor Pith review arXiv 2024
[32]

Controlling language and diffusion models by transporting activations

Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, marco cuturi, and Xavier Suau. Controlling language and diffusion models by transporting activations. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=l2zFn6TIQi

2025
[33]

Alphasteer: Learn- ing refusal steering with principled null-space constraint

Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat-Seng Chua. Alphasteer: Learning refusal steering with principled null-space constraint, 2025. URLhttps://arxiv.org/abs/2506.07022

work page arXiv 2025
[34]

Representation surgery: Theory and practice of affine steering

Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, and Ponnu- rangam Kumaraguru. Representation surgery: Theory and practice of affine steering. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference ...

2024
[35]

Improving instruction-following in language models through activation steering

Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=wozhdnRCtw

2025
[36]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

2024
[37]

DISCO: Disentangled com- munication steering for large language models

Max Torop, Aria Masoomi, Masih Eskandar, and Jennifer Dy. DISCO: Disentangled com- munication steering for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= c8AjdgdHnD

2025
[38]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering. arXiv:2308.10248, 2024. URLhttps://arxiv.org/abs/2308.10248

work page internal anchor Pith review arXiv 2024
[39]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[40]

Vu and Tan Minh Nguyen

Hieu M. Vu and Tan Minh Nguyen. Angular steering: Behavior control via rotation in activation space. In2nd Workshop on Models of Human Feedback for AI Alignment, 2025. URL https://openreview.net/forum?id=GU2UeVZrSw

2025
[41]

Semantics-adaptive activation intervention for LLMs via dynamic steering vectors

Weixuan Wang, JINGYUAN Y ANG, and Wei Peng. Semantics-adaptive activation intervention for LLMs via dynamic steering vectors. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=8WQ7VTfPTl. 13

2025
[42]

ReFT: Representation finetuning for language models

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. ReFT: Representation finetuning for language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=fykjplMc0V

2024
[43]

Axbench: Steering LLMs? even simple base- lines outperform sparse autoencoders

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering LLMs? even simple base- lines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=K2CckZjNy0

2025
[44]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

2019
[45]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

work page internal anchor Pith review arXiv 2023