Recognition: unknown
Contextual Linear Activation Steering of Language Models
Pith reviewed 2026-05-08 03:26 UTC · model grok-4.3
The pith
Adapting activation steering strength to each prompt improves language model control with limited data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Contextual Linear Activation Steering computes per-prompt steering strengths rather than using a constant value, producing more consistent and higher-quality control over language model outputs than fixed-strength linear activation steering while remaining competitive with parameter-efficient fine-tuning techniques under data constraints.
What carries the argument
Contextual Linear Activation Steering (CLAS), which determines input-specific steering strengths to adjust activations dynamically instead of applying uniform strength.
Load-bearing premise
Suitable context-dependent steering strengths can be computed or learned scalably without adding inconsistencies or heavy extra computation across diverse prompts.
What would settle it
A new suite of steering benchmarks where the context-dependent version shows no improvement or clear degradation relative to fixed-strength steering would falsify the central performance claim.
Figures
read the original abstract
Linear activation steering is a powerful approach for eliciting the capabilities of large language models and specializing their behavior using limited labeled data. While effective, existing methods often apply a fixed steering strength to all tokens, resulting in inconsistent steering quality across diverse input prompts. In this work, we introduce Contextual Linear Activation Steering (CLAS), a method that dynamically adapts linear activation steering to context-dependent steering strengths. Across eleven steering benchmarks and four model families, it consistently outperforms standard linear activation steering and matches or exceeds the performance of ReFT and LoRA in settings with limited labeled data. We therefore propose CLAS as a scalable, interpretable, and accurate method for specializing and steering large language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Contextual Linear Activation Steering (CLAS), which extends linear activation steering by dynamically computing context-dependent steering strengths instead of using a fixed value for all tokens. The central empirical claim is that CLAS outperforms standard linear activation steering across eleven steering benchmarks and four model families while matching or exceeding ReFT and LoRA performance in limited-labeled-data regimes, positioning CLAS as a scalable and interpretable alternative for LLM specialization.
Significance. If the reported gains hold under rigorous controls, CLAS would represent a targeted, low-overhead improvement to activation steering that mitigates prompt-dependent inconsistency without sacrificing the method's interpretability or data efficiency. The work's strength lies in its direct empirical comparison to both fixed-strength baselines and parameter-efficient fine-tuning methods on a broad benchmark suite.
minor comments (3)
- [§3] §3 (Method): The precise functional form used to derive per-token or per-prompt steering strengths from context should be stated explicitly, including any learned parameters or heuristics, to allow replication and to clarify why the approach remains more scalable than ReFT/LoRA.
- [§4] §4 (Experiments): Table 1 and Figure 2 would benefit from reporting standard deviations across multiple random seeds or prompt shuffles, as the headline claim of 'consistent' outperformance rests on these aggregate numbers.
- [§4.2] The paper should include a brief ablation isolating the contribution of the context-adaptation mechanism versus simply using a stronger fixed steering vector.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision for our manuscript on Contextual Linear Activation Steering (CLAS). We are pleased that the work is viewed as a targeted improvement to activation steering with strong empirical support across benchmarks and model families. As the report lists no specific major comments, we have no point-by-point rebuttals to provide at this stage.
Circularity Check
No significant circularity
full rationale
The paper introduces CLAS as an empirical method for context-dependent linear activation steering and supports its claims solely through performance comparisons on eleven benchmarks across four model families. No equations, derivations, or mathematical chains are described in the provided abstract or structure; the central results are external benchmark outcomes that remain independently falsifiable and do not reduce to self-definitions, fitted parameters renamed as predictions, or self-citation chains. Background citations to prior steering work function as standard context rather than load-bearing premises that collapse into the present contribution.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.
-
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
GCAD steering extracts prompt-based attention deltas and gates them at token level, cutting coherence drift from -18.6 to -1.9 while raising trait expression at turn 10 from 78 to 93 on multi-turn persona benchmarks.
Reference graph
Works this paper leans on
-
[1]
English quotes dataset.https://huggingface.co/datasets/Abirate/english_quotes, 2023
Abirate. English quotes dataset.https://huggingface.co/datasets/Abirate/english_quotes, 2023
2023
-
[2]
Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, and Y. Goldberg. Fine-grained analysis of sentence embed- dings using auxiliary prediction tasks. InInternational Conference on Learning Representations, 2017. URLhttps://openreview.net/forum?id=BJh6Ztuxl
2017
-
[3]
Understanding intermediate layers using linear classifier probes
G. Alain and Y. Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016
work page internal anchor Pith review arXiv 2016
- [4]
-
[5]
L. Bartoszcze, S. Munshi, B. Sukidi, J. Yen, Z. Yang, D. Williams-King, L. Le, K. Asuzu, and C. Maple. Representation engineering for large-language models: Survey and research challenges.arXiv preprint arXiv:2502.17601, 2025
-
[6]
D. Beaglehole, A. Radhakrishnan, E. Boix-Adser` a, and M. Belkin. Toward universal steering and monitoring of ai models.Science, 391(6787):787–792, 2026. doi: 10.1126/science.aea6792. URL https://www.science.org/doi/abs/10.1126/science.aea6792
-
[7]
Belinkov
Y. Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48 (1):207–219, 2022
2022
-
[8]
What do Neural Machine Translation Models Learn about Morphology?
Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, and J. Glass. What do neural machine translation models learn about morphology? In R. Barzilay and M.-Y. Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, Vancouver, Canada, July 2017. Association for Computational Lingu...
-
[9]
Brown, B
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. InConference on Neural Information Processing Systems, 2020
2020
-
[10]
P. Chao, A. Robey, E. Dobler, C. Butoi, L. He, E. Myers, Z. Doan, A. Chen, P. Chaudhari, and A. Zou. Jailbreakbench: An open robustness benchmark for jailbreaking llms.https://huggingface.co/dat asets/jailbreakhub/jailbreakbench, 2024
2024
-
[11]
R. Chen, A. Arditi, H. Sleight, O. Evans, and J. Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025
work page internal anchor Pith review arXiv 2025
-
[12]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review arXiv 2021
-
[13]
P. Davarmanesh, A. Wilson, and A. Radhakrishnan. Efficient and accurate steering of large language models through attention-guided feature learning, 2026. URLhttps://arxiv.org/abs/2602.00333
-
[14]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[15]
Leetcode benchmark dataset.https://huggingface.co/datasets/greengerong/leet code, 2024
Greengerong. Leetcode benchmark dataset.https://huggingface.co/datasets/greengerong/leet code, 2024
2024
-
[16]
Hendrycks, C
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.International Conference on Learning Representations, 2021. 10
2021
- [18]
-
[19]
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
2022
-
[20]
Jiang, G
Y. Jiang, G. Rajendran, P. K. Ravikumar, B. Aragam, and V. Veitch. On the origins of linear repre- sentations in large language models. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning R...
2024
-
[21]
Konen, S
K. Konen, S. Jentzsch, D. Diallo, P. Sch¨ utt, O. Bensch, R. El Baff, D. Opitz, and T. Hecking. Style vectors for steering generative large language models. InFindings of the Association for Computational Linguistics: EACL 2024, pages 782–802, Mar. 2024
2024
-
[22]
K. Li, O. Patel, F. Vi´ egas, H. Pfister, and M. Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=aLLuYpn83y
2023
-
[23]
Z. Lin, Z. Wang, Y. Tong, Y. Wang, Y. Guo, Y. Wang, and J. Shang. ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In H. Bouamor, J. Pino, and K. Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4694–4702, Singapore, Dec. 2023. Association for Computational Linguisti...
-
[24]
A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis.Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 2011
2011
-
[25]
Mikolov, I
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. InConference on Neural Information Processing Systems, volume 26, 2013
2013
-
[26]
Mikolov, W.-t
T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. InProceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 746–751, 2013
2013
-
[27]
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
N. Nanda, A. Lee, and M. Wattenberg. Emergent linear representations in world models of self- supervised sequence models. In Y. Belinkov, S. Hao, J. Jumelet, N. Kim, A. McCarthy, and H. Mo- hebbi, editors,Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Net- works for NLP, pages 16–30, Singapore, Dec. 2023. Association for Co...
-
[28]
K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models. InWorkshop on Causal Representation Learning at Advances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=T0PoOJg8cK
2023
-
[29]
Pennington, R
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014
2014
-
[30]
Mechanism for feature learning in neural networks and backpropagation-free machine learning models
A. Radhakrishnan, D. Beaglehole, P. Pandit, and M. Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383(6690):1461–1467, 2024. doi: 10.1126/science.adi5639
-
[31]
Rimsky, N
N. Rimsky, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. Turner. Steering llama 2 via contrastive activation addition. In L.-W. Ku, A. Martins, and V. Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 15504–15522, Aug. 2024. 11
2024
-
[32]
S. Syed, M. Voelske, M. Potthast, and B. Stein. Dataset for generating tl;dr, 2018
2018
-
[33]
Taori, I
R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca, 2023
2023
-
[34]
BERT Rediscovers the Classical NLP Pipeline
I. Tenney, D. Das, and E. Pavlick. BERT rediscovers the classical NLP pipeline. In A. Korhonen, D. Traum, and L. M` arquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URLhttps://aclantholo...
-
[35]
A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023
work page internal anchor Pith review arXiv 2023
-
[36]
T. van der Weij, M. Poesio, and N. Schoots. Extending activation steering to broad skills and multiple behaviours.arXiv preprint arXiv:2403.05767, 2024
-
[37]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017
2017
-
[38]
Z. Wu, A. Arora, Z. Wang, A. Geiger, D. Jurafsky, C. D. Manning, and C. Potts. Reft: Representation finetuning for language models.Advances in Neural Information Processing Systems, 37:63908–63962, 2024
2024
-
[39]
Q. A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y.-C. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, Z. Qiu, ...
work page internal anchor Pith review arXiv 2024
-
[40]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023
work page internal anchor Pith review arXiv 2023
-
[41]
Representation Engineering: A Top-Down Approach to AI Transparency
A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dom- browski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. 12 A Steering toward a single task and evaluating on all tasks Task steered towards Method Qwen2.5-7B Llama-3.1-70B Llama-3.1-8B Llama-3.2-1B...
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.