Adversarial Robustness of Activation Steering in Large Language Models

Kien Le; Thai Le

arxiv: 2606.07696 · v1 · pith:XSFQBUMMnew · submitted 2026-06-05 · 💻 cs.LG · cs.AI

Adversarial Robustness of Activation Steering in Large Language Models

Kien Le , Thai Le This is my paper

Pith reviewed 2026-06-27 22:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords activation steeringadversarial robustnesslarge language modelsdirectional robustnesslayer selectionresidual streammodel behavior controltext perturbations

0 comments

The pith

Adversarial text perturbations reduce activation steering directional robustness by up to 64% in LLMs and shift optimal layers by up to 17 positions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether activation steering, which controls LLM outputs by adding precomputed direction vectors to the residual stream at inference time, holds up when inputs are altered by adversarial text perturbations. It runs a systematic evaluation across four vector extraction methods, three attack strategies, six personas drawn from the Anthropic Model-Written Evaluation Dataset, and five models sized from 1.5B to 30B parameters. The results show that attacks broadly degrade performance: directional robustness falls by as much as 64 percent, post-attack confidence drops near or below 0.25, and steering strength weakens on nearly every steerable input. Optimal layer selection, identified on clean data, also moves by up to 17 positions once inputs are perturbed. The authors conclude that this brittleness is structural rather than tied to any single extraction or steering technique.

Core claim

Activation steering lacks robustness to adversarial perturbations on inputs. Directional robustness drops by up to 64 percent, post-attack confidence collapses near or below 0.25 across all methods and models, steering strength degrades on nearly every steerable input, and the optimal layer identified on clean data shifts by up to 17 positions under perturbation. Extracting vectors from perturbed inputs recovers some steerability for PCA and MD on mid-to-large models, yet still fails to locate the improved optimal layer.

What carries the argument

The evaluation of directional robustness, post-attack confidence, steering strength, and automated layer selection under three attack strategies applied to inputs from the Anthropic Model-Written Evaluation Dataset.

If this is right

Activation steering cannot be deployed reliably on real inputs without additional robustness measures.
Layer-selection procedures calibrated on clean data fail when inputs are perturbed.
Extracting vectors from adversarially perturbed inputs recovers partial steerability for PCA and MD on larger models but does not fix layer selection.
The observed brittleness applies across extraction methods, attack strategies, personas, and model sizes, indicating a general limitation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may need to combine activation steering with input sanitization or adversarial training to achieve usable reliability.
The same input-sensitivity pattern could affect other inference-time control techniques that rely on fixed internal directions.
Future extraction procedures might benefit from explicitly optimizing for stability across small input changes rather than clean-data performance alone.

Load-bearing premise

The chosen adversarial perturbations, extraction methods, and six personas from the Anthropic Model-Written Evaluation Dataset are representative of realistic input variation encountered in deployment.

What would settle it

A controlled experiment that applies the same three attack strategies yet finds directional robustness remains above 90 percent and optimal-layer shifts stay within two positions for all tested models and methods.

Figures

Figures reproduced from arXiv: 2606.07696 by Kien Le, Thai Le.

**Figure 1.** Figure 1: LayerNavigator S-score across two tasks: Religion Following (top) and Conscientiousness (bottom) on Qwen3- 14B, averaged over three perturbations. Blue and Red line denote S-score calculated using steering vectors computed from Dtrain and D ′ train, respectively. Vertical lines show the largest S-score. The remaining score profiles are shown at Figure A1, Figure A2 and Figure A3 also quantify the L1 distan… view at source ↗

**Figure 2.** Figure 2: Per-layer averaged performance of Qwen3-14B across three perturbations on Religion Following and Conscientiousness. Blue and Red denote clean and adversarially trained steering vectors, respectively. The upper cluster (solid lines) shows performance on the clean set Dtest. The lower cluster (dashed lines) shows performance on the perturbed set D ′ test. Diamonds mark the layer selected by LayerNavigator. … view at source ↗

read the original abstract

Activation steering has become a popular training-free method to control LLM behavior by injecting precomputed direction vectors into the model's residual stream at inference time. Yet its robustness to realistic input variation remains unstudied. We present the first systematic evaluation of activation steering robustness under adversarial text perturbations on the inputs, covering four extraction methods, three attack strategies, six personas from Anthropic Model-Written Evaluation Dataset, and five models ranging from 1.5B to 30B parameters. Attacks succeed broadly across all settings: directional robustness drops by up to 64%, post-attack confidence collapses near or below 0.25 across all methods and models, and steering strength degrades on nearly every steerable input. Layer selection is equally fragile, with the optimal layer identified by an automated method on clean inputs shifting by up to 17 positions under perturbation, a failure that compounds the vector-level breakdown. Extracting vectors from adversarially perturbed inputs partially recovers steerability for PCA and MD on mid-to-large models, but they consistently fail to locate the improved optimal layer, limiting the practical benefit of this mitigation. Together, these findings reveal that the brittleness of activation steering is structural rather than method-specific, and that current layer selection strategies are not robust enough for real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Activation steering shows clear brittleness to the tested adversarial perturbations, but the structural and deployment claims rest on unvalidated assumptions about input realism.

read the letter

Activation steering is brittle under the adversarial text changes tested here. Directional robustness falls by up to 64 percent, post-attack confidence drops near or below 0.25, steering strength degrades on most inputs, and the best layer shifts by as much as 17 positions.

The paper runs the first systematic check of this kind. It covers four extraction methods, three attack strategies, six personas from the Anthropic Model-Written Evaluation Dataset, and five models from 1.5B to 30B parameters. It measures the effects on robustness, confidence, strength, and layer choice, then tests one mitigation of extracting vectors from perturbed inputs. That mitigation recovers some steerability for PCA and MD on mid-to-large models but does not fix the layer problem. The results are consistent across settings, which is the main empirical contribution.

The work is a direct measurement study with no fitted parameters or circular derivations. The citation pattern is appropriate for an early paper on this narrow question.

The soft spot is the move from these specific attacks and personas to the claim that brittleness is structural and that current layer selection is not robust enough for real-world deployment. The paper does not compare the chosen perturbations or the six personas against other distributions or actual user inputs, so the practical scope of the finding is narrower than stated. If the attacks are stronger or less representative than typical variation, the deployment conclusion does not fully follow.

This paper is for researchers working on representation engineering and LLM control who need numbers on steering limits. A reader focused on reliability will get concrete data from it.

I would send it for peer review so the experimental details and the strength of the generalization can be checked.

Referee Report

1 major / 0 minor

Summary. The paper claims that activation steering in LLMs is brittle under adversarial text perturbations on inputs. Through a systematic evaluation covering four extraction methods, three attack strategies, six personas from the Anthropic Model-Written Evaluation Dataset, and five models (1.5B to 30B parameters), it reports that attacks cause directional robustness drops of up to 64%, post-attack confidence to collapse near or below 0.25, steering strength degradation on nearly all inputs, and optimal layer shifts of up to 17 positions. Extracting vectors from perturbed inputs partially recovers steerability for some methods on larger models but fails to identify improved layers. The paper concludes that the brittleness is structural rather than method-specific and that current layer selection strategies are not robust enough for real-world deployment.

Significance. If the experimental regime is representative of deployment conditions, the work would be significant for identifying a key practical limitation of activation steering, a widely used training-free control technique. The breadth of the evaluation across methods, attacks, personas, and model scales provides a useful empirical baseline that could inform more robust steering approaches. The finding that layer selection is also fragile adds a practical dimension beyond vector-level issues.

major comments (1)

[Abstract] Abstract: The central claim that 'the brittleness of activation steering is structural rather than method-specific' and that 'current layer selection strategies are not robust enough for real-world deployment' rests on the unverified assumption that the three attack strategies, four extraction methods, and six personas produce input variations representative of realistic deployment. No comparison to held-out user logs, alternative perturbation distributions, or other datasets is described to support generalizing from this specific experimental regime to the broader conclusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the scope of our generalization claims. We respond to the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the brittleness of activation steering is structural rather than method-specific' and that 'current layer selection strategies are not robust enough for real-world deployment' rests on the unverified assumption that the three attack strategies, four extraction methods, and six personas produce input variations representative of realistic deployment. No comparison to held-out user logs, alternative perturbation distributions, or other datasets is described to support generalizing from this specific experimental regime to the broader conclusion.

Authors: We acknowledge that the manuscript does not include comparisons against held-out user logs, alternative perturbation distributions, or additional datasets beyond the Anthropic Model-Written Evaluation Dataset. Such external validation would require access to proprietary deployment data that is not available in this study. The chosen experimental regime draws on a standard public persona dataset and established adversarial attack strategies from the NLP literature precisely to probe robustness under controlled but diverse input variations. The structural claim is supported by the observed consistency of directional robustness drops, confidence collapse, and layer shifts across all four extraction methods, three attacks, six personas, and five model scales. Nevertheless, the referee is correct that the phrasing in the abstract regarding real-world deployment extrapolates beyond the tested regime. We will therefore revise the abstract to qualify the scope of the conclusions and add an explicit limitations paragraph discussing the chosen datasets and attacks. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation

full rationale

The paper conducts a direct empirical measurement campaign evaluating activation steering robustness under adversarial text perturbations, using external datasets (Anthropic Model-Written Evaluation Dataset), attack strategies, extraction methods, and models. No derivation chain, equations, fitted parameters presented as predictions, uniqueness theorems, or self-citations are described or invoked to support the central claims. All reported results (directional robustness drops, confidence collapse, layer shifts) are measurements against chosen experimental conditions rather than reductions to inputs by construction. This matches the default expectation of no circularity for empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the selected attack strategies and evaluation dataset capture realistic adversarial variation; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)

domain assumption Adversarial text perturbations generated by the three attack strategies constitute realistic input variation for steering evaluation.
Invoked implicitly by treating attack success as evidence of structural brittleness (abstract).

pith-pipeline@v0.9.1-grok · 5748 in / 1320 out tokens · 18457 ms · 2026-06-27T22:28:46.443686+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 13 canonical work pages

[1]

Advances in Neural Information Processing Systems , volume=

Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=. 2024 , url=

2024
[2]

arXiv preprint arXiv:2601.03093 , year=

ATLAS: Adaptive Test-Time Latent Steering with External Verifiers for Enhancing LLMs Reasoning , author=. arXiv preprint arXiv:2601.03093 , year=

Pith/arXiv arXiv
[3]

Advances in Neural Information Processing Systems , volume=

Layernavigator: Finding promising intervention layers for efficient activation steering in large language models , author=. Advances in Neural Information Processing Systems , volume=. 2025 , url=

2025
[4]

Proceedings of the AAAI conference on artificial intelligence , volume=

Is bert really robust? a strong baseline for natural language attack on text classification and entailment , author=. Proceedings of the AAAI conference on artificial intelligence , volume=. 2020 , url=

2020
[5]

arXiv preprint arXiv:2308.10248 , year=

Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

Pith/arXiv arXiv
[6]

Steering

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024
[7]

Advances in Neural Information Processing Systems , volume=

Leace: Perfect linear concept erasure in closed form , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023
[8]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Liu, Sheng and Ye, Haotian and Xing, Lei and Zou, James , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[9]

International conference on learning representations , volume=

Function vectors in large language models , author=. International conference on learning representations , volume=. 2024 , url=

2024
[10]

Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=. 2024 , url=

2024
[11]

arXiv preprint arXiv:2310.01405 , year=

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

Pith/arXiv arXiv
[12]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023
[13]

2024 , journal=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , journal=

2024
[14]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023
[15]

arXiv preprint arXiv:2212.03827 , year=

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

Pith/arXiv arXiv
[16]

E xtracting L atent S teering V ectors from P retrained L anguage M odels

Subramani, Nishant and Suresh, Nivedita and Peters, Matthew. Extracting Latent Steering Vectors from Pretrained Language Models. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.48

work page doi:10.18653/v1/2022.findings-acl.48 2022
[17]

Advances in Neural Information Processing Systems , volume=

Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization , author=. Advances in Neural Information Processing Systems , volume=. 2024 , url=

2024
[18]

Causal Learning and Reasoning , pages=

Finding alignments between interpretable causal variables and distributed neural representations , author=. Causal Learning and Reasoning , pages=. 2024 , organization=

2024
[19]

Advances in Neural Information Processing Systems , volume=

Reft: Representation finetuning for language models , author=. Advances in Neural Information Processing Systems , volume=. 2024 , url=

2024
[20]

arXiv preprint arXiv:2507.21509 , year=

Persona vectors: Monitoring and controlling character traits in language models , author=. arXiv preprint arXiv:2507.21509 , year=

Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2412.10427 , year=

Identifying and manipulating personality traits in LLMs through activation engineering , author=. arXiv preprint arXiv:2412.10427 , year=

arXiv
[22]

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLM s

Siddique, Zara and Khalid, Irtaza and Turner, Liam and Espinosa-Anke, Luis. Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLM s. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.41

work page doi:10.18653/v1/2026.findings-eacl.41 2026
[23]

S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals

Han, Peixuan and Qian, Cheng and Chen, Xiusi and Zhang, Yuji and Ji, Heng and Zhang, Denghui. S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.366

work page doi:10.18653/v1/2025.findings-emnlp.366 2025
[24]

arXiv preprint arXiv:2403.18680 , year=

Non-linear inference time intervention: Improving llm truthfulness , author=. arXiv preprint arXiv:2403.18680 , year=

arXiv
[25]

The Fourth Blogpost Track at ICLR 2025 , year=

Steering LLMs' Behavior with Concept Activation Vectors , author=. The Fourth Blogpost Track at ICLR 2025 , year=

2025
[26]

arXiv preprint arXiv:2502.18862 , year=

One-shot optimized steering vectors mediate safety-relevant behaviors in LLMs , author=. arXiv preprint arXiv:2502.18862 , year=

arXiv
[27]

Advances in Neural Information Processing Systems , volume=

Angular steering: Behavior control via rotation in activation space , author=. Advances in Neural Information Processing Systems , volume=. 2026 , url=

2026
[28]

arXiv preprint arXiv:2510.04309 , year=

Activation Steering with a Feedback Controller , author=. arXiv preprint arXiv:2510.04309 , year=

Pith/arXiv arXiv
[29]

arXiv preprint arXiv:2505.22637 , year=

Understanding (un) reliability of steering vectors in language models , author=. arXiv preprint arXiv:2505.22637 , year=

arXiv
[30]

C og S teer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models

Wang, Xintong and Pan, Jingheng and Ding, Liang and Wang, Longyue and Jiang, Longqin and Li, Xingshan and Biemann, Chris. C og S teer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1308

work page doi:10.18653/v1/2025.findings-acl.1308 2025
[31]

arXiv preprint arXiv:2502.17420 , year=

The geometry of refusal in large language models: Concept cones and representational independence , author=. arXiv preprint arXiv:2502.17420 , year=

arXiv
[32]

arXiv preprint arXiv:2410.17245 , year=

Towards reliable evaluation of behavior steering interventions in LLMs , author=. arXiv preprint arXiv:2410.17245 , year=

arXiv
[33]

International Conference on Learning Representations , volume=

Semantics-adaptive activation intervention for llms via dynamic steering vectors , author=. International Conference on Learning Representations , volume=. 2025 , url=

2025
[34]

International conference on learning representations , volume=

Programming refusal with conditional activation steering , author=. International conference on learning representations , volume=. 2025 , url=

2025
[35]

Multi-property Steering of Large Language Models with Dynamic Activation Composition

Scalena, Daniel and Sarti, Gabriele and Nissim, Malvina. Multi-property Steering of Large Language Models with Dynamic Activation Composition. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.34

work page doi:10.18653/v1/2024.blackboxnlp-1.34 2024
[36]

Proceedings of the ACM on Web Conference 2025 , pages=

Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories , author=. Proceedings of the ACM on Web Conference 2025 , pages=. 2025 , url=

2025
[37]

arXiv preprint arXiv:2604.03867 , year=

Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment , author=. arXiv preprint arXiv:2604.03867 , year=

Pith/arXiv arXiv
[38]

Advances in Neural Information Processing Systems , volume=

Learning to steer: Input-dependent steering for multimodal llms , author=. Advances in Neural Information Processing Systems , volume=. 2026 , url=

2026
[39]

arXiv preprint arXiv:2603.12298 , year=

Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency , author=. arXiv preprint arXiv:2603.12298 , year=

arXiv
[40]

arXiv preprint arXiv:2601.19375 , year=

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection , author=. arXiv preprint arXiv:2601.19375 , year=

arXiv
[41]

Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLM s

Bhandari, Pranav and Fay, Nicolas and Selvaganapathy, Sanjeevan and Datta, Amitava and Naseem, Usman and Nasim, Mehwish. Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLM s. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 20...

work page doi:10.18653/v1/2026.eacl-long.300 2026
[42]

BERT - ATTACK : Adversarial Attack Against BERT Using BERT

Li, Linyang and Ma, Ruotian and Guo, Qipeng and Xue, Xiangyang and Qiu, Xipeng. BERT - ATTACK : Adversarial Attack Against BERT Using BERT. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.500

work page doi:10.18653/v1/2020.emnlp-main.500 2020
[43]

arXiv preprint arXiv:1812.05271 , year=

Textbugger: Generating adversarial text against real-world applications , author=. arXiv preprint arXiv:1812.05271 , year=

Pith/arXiv arXiv
[44]

T ext A ttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP

Morris, John and Lifland, Eli and Yoo, Jin Yong and Grigsby, Jake and Jin, Di and Qi, Yanjun. T ext A ttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. doi:10.18653/v1/2020.emnlp-demos.16

work page doi:10.18653/v1/2020.emnlp-demos.16 2020
[45]

arXiv preprint arXiv:2502.02716 , year=

A unified understanding and evaluation of steering methods , author=. arXiv preprint arXiv:2502.02716 , year=

arXiv
[46]

Findings of the Association for Computational Linguistics: ACL 2023 , year =

Perez, Ethan and Ringer, Sam and Lukosiute, Kamile and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and Jones, Andy and Chen, Anna and Mann, Benjamin and Israel, Brian and Seethor, Bryan and McKinnon, Cameron and Olah, Christopher and Yan, Da and Amodei, Daniela and Amode...

work page doi:10.18653/v1/2023.findings-acl.847 2023
[47]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=. 2022 , url =

2022
[48]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv
[49]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=. 2023 , url=

2023
[50]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

2023
[51]

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=

Gpt-4 passes the bar exam , author=. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=. 2024 , publisher=

2024
[52]

Frontiers of Computer Science , volume=

A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

2024
[53]

Journal of Machine Learning Research , volume=

Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=. 2024 , url=

2024
[54]

2025 , url =

Anthropic , title =. 2025 , url =

2025
[55]

ACM Trans

Webber, William and Moffat, Alistair and Zobel, Justin , title =. ACM Trans. Inf. Syst. , month = nov, articleno =. 2010 , issue_date =. doi:10.1145/1852102.1852106 , abstract =

work page doi:10.1145/1852102.1852106 2010
[56]

arXiv preprint arXiv:2605.03907 , year=

Steer Like the LLM: Activation Steering that Mimics Prompting , author=. arXiv preprint arXiv:2605.03907 , year=

Pith/arXiv arXiv
[57]

arXiv preprint arXiv:2602.17560 , year=

Odesteer: A unified ode-based steering framework for llm alignment , author=. arXiv preprint arXiv:2602.17560 , year=

arXiv
[58]

John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray

Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray. Universal Sentence Encoder for E nglish. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System De...

work page doi:10.18653/v1/d18-2029 2018
[59]

and Su, Pei-Hao and Vandyke, David and Wen, Tsung-Hsien and Young, Steve

Mrk s i \'c , Nikola and \'O S \'e aghdha, Diarmuid and Thomson, Blaise and Ga s i \'c , Milica and Rojas-Barahona, Lina M. and Su, Pei-Hao and Vandyke, David and Wen, Tsung-Hsien and Young, Steve. Counter-fitting Word Vectors to Linguistic Constraints. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational ...

work page doi:10.18653/v1/n16-1018 2016
[60]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[61]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[62]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[63]

arXiv preprint arXiv:1412.6572 , year=

Explaining and harnessing adversarial examples , author=. arXiv preprint arXiv:1412.6572 , year=

Pith/arXiv arXiv
[64]

arXiv preprint arXiv:1706.06083 , year=

Towards deep learning models resistant to adversarial attacks , author=. arXiv preprint arXiv:1706.06083 , year=

Pith/arXiv arXiv
[65]

arXiv preprint arXiv:2511.01689 , year=

Open character training: Shaping the persona of AI assistants through constitutional AI , author=. arXiv preprint arXiv:2511.01689 , year=

arXiv
[66]

Transformer Circuits Thread , year=

Lindsey, Jack , title=. Transformer Circuits Thread , year=
[67]

arXiv preprint arXiv:2510.14318 , year=

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL , author=. arXiv preprint arXiv:2510.14318 , year=

arXiv

[1] [1]

Advances in Neural Information Processing Systems , volume=

Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=. 2024 , url=

2024

[2] [2]

arXiv preprint arXiv:2601.03093 , year=

ATLAS: Adaptive Test-Time Latent Steering with External Verifiers for Enhancing LLMs Reasoning , author=. arXiv preprint arXiv:2601.03093 , year=

Pith/arXiv arXiv

[3] [3]

Advances in Neural Information Processing Systems , volume=

Layernavigator: Finding promising intervention layers for efficient activation steering in large language models , author=. Advances in Neural Information Processing Systems , volume=. 2025 , url=

2025

[4] [4]

Proceedings of the AAAI conference on artificial intelligence , volume=

Is bert really robust? a strong baseline for natural language attack on text classification and entailment , author=. Proceedings of the AAAI conference on artificial intelligence , volume=. 2020 , url=

2020

[5] [5]

arXiv preprint arXiv:2308.10248 , year=

Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

Pith/arXiv arXiv

[6] [6]

Steering

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024

[7] [7]

Advances in Neural Information Processing Systems , volume=

Leace: Perfect linear concept erasure in closed form , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023

[8] [8]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Liu, Sheng and Ye, Haotian and Xing, Lei and Zou, James , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[9] [9]

International conference on learning representations , volume=

Function vectors in large language models , author=. International conference on learning representations , volume=. 2024 , url=

2024

[10] [10]

Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=. 2024 , url=

2024

[11] [11]

arXiv preprint arXiv:2310.01405 , year=

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

Pith/arXiv arXiv

[12] [12]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023

[13] [13]

2024 , journal=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , journal=

2024

[14] [14]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023

[15] [15]

arXiv preprint arXiv:2212.03827 , year=

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

Pith/arXiv arXiv

[16] [16]

E xtracting L atent S teering V ectors from P retrained L anguage M odels

Subramani, Nishant and Suresh, Nivedita and Peters, Matthew. Extracting Latent Steering Vectors from Pretrained Language Models. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.48

work page doi:10.18653/v1/2022.findings-acl.48 2022

[17] [17]

Advances in Neural Information Processing Systems , volume=

Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization , author=. Advances in Neural Information Processing Systems , volume=. 2024 , url=

2024

[18] [18]

Causal Learning and Reasoning , pages=

Finding alignments between interpretable causal variables and distributed neural representations , author=. Causal Learning and Reasoning , pages=. 2024 , organization=

2024

[19] [19]

Advances in Neural Information Processing Systems , volume=

Reft: Representation finetuning for language models , author=. Advances in Neural Information Processing Systems , volume=. 2024 , url=

2024

[20] [20]

arXiv preprint arXiv:2507.21509 , year=

Persona vectors: Monitoring and controlling character traits in language models , author=. arXiv preprint arXiv:2507.21509 , year=

Pith/arXiv arXiv

[21] [21]

arXiv preprint arXiv:2412.10427 , year=

Identifying and manipulating personality traits in LLMs through activation engineering , author=. arXiv preprint arXiv:2412.10427 , year=

arXiv

[22] [22]

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLM s

Siddique, Zara and Khalid, Irtaza and Turner, Liam and Espinosa-Anke, Luis. Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLM s. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.41

work page doi:10.18653/v1/2026.findings-eacl.41 2026

[23] [23]

S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals

Han, Peixuan and Qian, Cheng and Chen, Xiusi and Zhang, Yuji and Ji, Heng and Zhang, Denghui. S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.366

work page doi:10.18653/v1/2025.findings-emnlp.366 2025

[24] [24]

arXiv preprint arXiv:2403.18680 , year=

Non-linear inference time intervention: Improving llm truthfulness , author=. arXiv preprint arXiv:2403.18680 , year=

arXiv

[25] [25]

The Fourth Blogpost Track at ICLR 2025 , year=

Steering LLMs' Behavior with Concept Activation Vectors , author=. The Fourth Blogpost Track at ICLR 2025 , year=

2025

[26] [26]

arXiv preprint arXiv:2502.18862 , year=

One-shot optimized steering vectors mediate safety-relevant behaviors in LLMs , author=. arXiv preprint arXiv:2502.18862 , year=

arXiv

[27] [27]

Advances in Neural Information Processing Systems , volume=

Angular steering: Behavior control via rotation in activation space , author=. Advances in Neural Information Processing Systems , volume=. 2026 , url=

2026

[28] [28]

arXiv preprint arXiv:2510.04309 , year=

Activation Steering with a Feedback Controller , author=. arXiv preprint arXiv:2510.04309 , year=

Pith/arXiv arXiv

[29] [29]

arXiv preprint arXiv:2505.22637 , year=

Understanding (un) reliability of steering vectors in language models , author=. arXiv preprint arXiv:2505.22637 , year=

arXiv

[30] [30]

C og S teer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models

Wang, Xintong and Pan, Jingheng and Ding, Liang and Wang, Longyue and Jiang, Longqin and Li, Xingshan and Biemann, Chris. C og S teer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1308

work page doi:10.18653/v1/2025.findings-acl.1308 2025

[31] [31]

arXiv preprint arXiv:2502.17420 , year=

The geometry of refusal in large language models: Concept cones and representational independence , author=. arXiv preprint arXiv:2502.17420 , year=

arXiv

[32] [32]

arXiv preprint arXiv:2410.17245 , year=

Towards reliable evaluation of behavior steering interventions in LLMs , author=. arXiv preprint arXiv:2410.17245 , year=

arXiv

[33] [33]

International Conference on Learning Representations , volume=

Semantics-adaptive activation intervention for llms via dynamic steering vectors , author=. International Conference on Learning Representations , volume=. 2025 , url=

2025

[34] [34]

International conference on learning representations , volume=

Programming refusal with conditional activation steering , author=. International conference on learning representations , volume=. 2025 , url=

2025

[35] [35]

Multi-property Steering of Large Language Models with Dynamic Activation Composition

Scalena, Daniel and Sarti, Gabriele and Nissim, Malvina. Multi-property Steering of Large Language Models with Dynamic Activation Composition. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.34

work page doi:10.18653/v1/2024.blackboxnlp-1.34 2024

[36] [36]

Proceedings of the ACM on Web Conference 2025 , pages=

Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories , author=. Proceedings of the ACM on Web Conference 2025 , pages=. 2025 , url=

2025

[37] [37]

arXiv preprint arXiv:2604.03867 , year=

Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment , author=. arXiv preprint arXiv:2604.03867 , year=

Pith/arXiv arXiv

[38] [38]

Advances in Neural Information Processing Systems , volume=

Learning to steer: Input-dependent steering for multimodal llms , author=. Advances in Neural Information Processing Systems , volume=. 2026 , url=

2026

[39] [39]

arXiv preprint arXiv:2603.12298 , year=

Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency , author=. arXiv preprint arXiv:2603.12298 , year=

arXiv

[40] [40]

arXiv preprint arXiv:2601.19375 , year=

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection , author=. arXiv preprint arXiv:2601.19375 , year=

arXiv

[41] [41]

Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLM s

Bhandari, Pranav and Fay, Nicolas and Selvaganapathy, Sanjeevan and Datta, Amitava and Naseem, Usman and Nasim, Mehwish. Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLM s. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 20...

work page doi:10.18653/v1/2026.eacl-long.300 2026

[42] [42]

BERT - ATTACK : Adversarial Attack Against BERT Using BERT

Li, Linyang and Ma, Ruotian and Guo, Qipeng and Xue, Xiangyang and Qiu, Xipeng. BERT - ATTACK : Adversarial Attack Against BERT Using BERT. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.500

work page doi:10.18653/v1/2020.emnlp-main.500 2020

[43] [43]

arXiv preprint arXiv:1812.05271 , year=

Textbugger: Generating adversarial text against real-world applications , author=. arXiv preprint arXiv:1812.05271 , year=

Pith/arXiv arXiv

[44] [44]

T ext A ttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP

Morris, John and Lifland, Eli and Yoo, Jin Yong and Grigsby, Jake and Jin, Di and Qi, Yanjun. T ext A ttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. doi:10.18653/v1/2020.emnlp-demos.16

work page doi:10.18653/v1/2020.emnlp-demos.16 2020

[45] [45]

arXiv preprint arXiv:2502.02716 , year=

A unified understanding and evaluation of steering methods , author=. arXiv preprint arXiv:2502.02716 , year=

arXiv

[46] [46]

Findings of the Association for Computational Linguistics: ACL 2023 , year =

Perez, Ethan and Ringer, Sam and Lukosiute, Kamile and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and Jones, Andy and Chen, Anna and Mann, Benjamin and Israel, Brian and Seethor, Bryan and McKinnon, Cameron and Olah, Christopher and Yan, Da and Amodei, Daniela and Amode...

work page doi:10.18653/v1/2023.findings-acl.847 2023

[47] [47]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=. 2022 , url =

2022

[48] [48]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv

[49] [49]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=. 2023 , url=

2023

[50] [50]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

2023

[51] [51]

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=

Gpt-4 passes the bar exam , author=. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=. 2024 , publisher=

2024

[52] [52]

Frontiers of Computer Science , volume=

A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

2024

[53] [53]

Journal of Machine Learning Research , volume=

Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=. 2024 , url=

2024

[54] [54]

2025 , url =

Anthropic , title =. 2025 , url =

2025

[55] [55]

ACM Trans

Webber, William and Moffat, Alistair and Zobel, Justin , title =. ACM Trans. Inf. Syst. , month = nov, articleno =. 2010 , issue_date =. doi:10.1145/1852102.1852106 , abstract =

work page doi:10.1145/1852102.1852106 2010

[56] [56]

arXiv preprint arXiv:2605.03907 , year=

Steer Like the LLM: Activation Steering that Mimics Prompting , author=. arXiv preprint arXiv:2605.03907 , year=

Pith/arXiv arXiv

[57] [57]

arXiv preprint arXiv:2602.17560 , year=

Odesteer: A unified ode-based steering framework for llm alignment , author=. arXiv preprint arXiv:2602.17560 , year=

arXiv

[58] [58]

John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray

Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray. Universal Sentence Encoder for E nglish. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System De...

work page doi:10.18653/v1/d18-2029 2018

[59] [59]

and Su, Pei-Hao and Vandyke, David and Wen, Tsung-Hsien and Young, Steve

Mrk s i \'c , Nikola and \'O S \'e aghdha, Diarmuid and Thomson, Blaise and Ga s i \'c , Milica and Rojas-Barahona, Lina M. and Su, Pei-Hao and Vandyke, David and Wen, Tsung-Hsien and Young, Steve. Counter-fitting Word Vectors to Linguistic Constraints. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational ...

work page doi:10.18653/v1/n16-1018 2016

[60] [60]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[61] [61]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[62] [62]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[63] [63]

arXiv preprint arXiv:1412.6572 , year=

Explaining and harnessing adversarial examples , author=. arXiv preprint arXiv:1412.6572 , year=

Pith/arXiv arXiv

[64] [64]

arXiv preprint arXiv:1706.06083 , year=

Towards deep learning models resistant to adversarial attacks , author=. arXiv preprint arXiv:1706.06083 , year=

Pith/arXiv arXiv

[65] [65]

arXiv preprint arXiv:2511.01689 , year=

Open character training: Shaping the persona of AI assistants through constitutional AI , author=. arXiv preprint arXiv:2511.01689 , year=

arXiv

[66] [66]

Transformer Circuits Thread , year=

Lindsey, Jack , title=. Transformer Circuits Thread , year=

[67] [67]

arXiv preprint arXiv:2510.14318 , year=

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL , author=. arXiv preprint arXiv:2510.14318 , year=

arXiv