arxiv: 2605.03907 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Steer Like the LLM: Activation Steering that Mimics Prompting

Geert Heyman , Frederik Vandeputte

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords activation steeringprompt steeringtoken-specific interventionslanguage model controlinference-time editingPSR modelssteering benchmarks

0 comments

The pith

Models that learn token-specific steering strengths from prompts outperform standard activation methods on control tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that prompting steers language models by applying large changes to some tokens and almost none to others, whereas common activation techniques use the same strength everywhere. It introduces prompt steering replacement models trained to predict those varying strengths directly from activations so they can copy the selective pattern. This matters because activation steering runs faster and requires no input rewriting, yet it has lagged prompting; matching the selectivity could make lightweight control reliable. Experiments across three benchmarks and several models show the new models beat prior activation baselines, especially on coherent outputs, and reach prompting levels on persona and AxBench tasks. The central insight is that the performance gap stems from mismatched intervention patterns rather than steering vectors themselves.

Core claim

Prompt steering applies strong interventions on certain tokens while leaving others nearly untouched. Existing activation methods apply a fixed coefficient across all tokens. Prompt Steering Replacement models are small networks trained on prompt-steered versus baseline activations to output per-token scaling factors, thereby reproducing the selective intervention pattern and achieving stronger steering results than fixed-coefficient baselines.

What carries the argument

Prompt Steering Replacement (PSR) models: lightweight predictors that read layer activations and output token-specific coefficients to scale a steering vector.

If this is right

PSR models achieve higher success rates than prior activation steering on three standard benchmarks.
They maintain higher output coherence when results are filtered for quality.
On AxBench and persona steering tasks they reach performance levels comparable to direct prompting.
The token-specific approach works across multiple language models without architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap between prompting and activation steering may shrink further if PSR training data includes more diverse prompt strategies.
Similar per-token prediction could be applied to other control objectives such as safety or style transfer.
One could test whether PSR models remain effective when the underlying steering vector is itself learned rather than hand-crafted.
This framing suggests activation interventions can be made more expressive without increasing inference cost.

Load-bearing premise

The performance difference between prompting and activation steering is caused mainly by the lack of token-specific strength variation, and training on observed prompt behavior will transfer to new tasks and models.

What would settle it

Evaluate PSR models on a fresh steering benchmark or an unseen language model and measure whether they still outperform fixed-coefficient activation steering and match prompting.

Figures

Figures reproduced from arXiv: 2605.03907 by Frederik Vandeputte, Geert Heyman.

**Figure 1.** Figure 1: Illustration of how prompt steering interventions ∆P S can be computed by subtracting prompt-steered activations from the corresponding unsteered activations (left and center). Prompt Steering Replacement (PSR) models approximate these interventions, but only on cases where prompt steering successfully elicits the target attribute (right) view at source ↗

**Figure 2.** Figure 2: Strength of prompt steering interventions on Llama-3.2- 3B, layer 16, across token positions (x-axis) for randomly sampled completions that are prompt-steered towards sycophancy (y-axis). intervening on the LLM’s internal activations. Formally, if Ayi denotes the activations for the i th response token at a given layer l, 3 then we can write single-attribute activation steering as follows:4 Ayi|AS = Ayi + … view at source ↗

**Figure 3.** Figure 3: Relative RMSE between the accumulated interventions of prompt steering (∆P Sacc ) versus those of other steering methods (∆Xacc ), averaged on prompt steering predictions on the Persona Vectors sycophantic evaluation data for Llama-3.2-3B. performs constant steering in the single-layer setting and for 3 out of 4 language models in the all-layer setting. The bottom part of view at source ↗

**Figure 4.** Figure 4: Prompt intervention magnitude across layers, measured locally (per-layer) and accumulatively, in both absolute and relative terms. The relative magnitudes are computed by dividing by the norm of the prompt-steered activations ∥AP S∥. 15 view at source ↗

**Figure 5.** Figure 5: Heatmaps of prompt intervention magnitude per token across layers for different traits on randomly selected examples. (a)–(c) show local interventions ∥∆P Sloc ∥; (d)–(f) show accumulative interventions ∥∆P Sacc ∥, normalized per layer by dividing by their total sum to better reveal the effect in early layers. 16 view at source ↗

**Figure 6.** Figure 6: Example 1, Layer 16: Intervention strength ∥∆X(·)∥2 per token for different steering methods on a sycophantic evaluation example (Llama-3.2-3B-Instruct). 17 view at source ↗

**Figure 7.** Figure 7: Example 1, Layer 26: Intervention strength ∥∆X(·)∥2 per token for different steering methods on a sycophantic evaluation example (Llama-3.2-3B-Instruct). 18 view at source ↗

**Figure 8.** Figure 8: Example 2, Layer 16: Intervention strength ∥∆X(·)∥2 per token for different steering methods on a sycophantic evaluation example (Llama-3.2-3B-Instruct). 19 view at source ↗

**Figure 9.** Figure 9: Example 2, Layer 26: Intervention strength ∥∆X(·)∥2 per token for different steering methods on a sycophantic evaluation example (Llama-3.2-3B-Instruct). 20 view at source ↗

**Figure 10.** Figure 10: Prompt template for coherence judge Jcoher (Chen et al., 2025) that is used for the Persona Vectors benchmark. I am evaluating a different language model. I want to assess the coherence of its answer - please focus solely on coherence, treating this as independent of other traits. I asked the model the following evaluation question: [QUESTION START] {question} [QUESTION END] The model answered with: [ANSW… view at source ↗

**Figure 11.** Figure 11: Slight variation on the coherence judge prompt template from Chen et al. (2025), which is used on the IFEval benchmark. 21 view at source ↗

**Figure 12.** Figure 12: Trait alignment on the Persona Vectors dataset after steering Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Qwen2.5-7bInstruct. Trait alignments for all steering methods are computed at prompt steering coherence. E.2. IFEval Additional Results view at source ↗

**Figure 13.** Figure 13: Trait alignment-coherence curves for different steering methods on the Persona Vectors dataset. Some curves have a region where trait alignment and coherence both go down, this points to oversteering (i.e., when α values are set too high for a method). 25 view at source ↗

**Figure 14.** Figure 14: shows the relative RMSE between activations produced by prompt steering versus other steering methods, averaged on prompt steering predictions on the Persona Vectors evaluation data for Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Qwen2.5-7b-Instruct, respectively. (a) Llama-3.2-3B-Instruct. (b) Llama-3.1-8B-Instruct. (c) Qwen2.5-7b-Instruct view at source ↗

**Figure 15.** Figure 15: Cosine similarity between steering vectors learned by different methods for sycophancy on Llama-3.2-3B-Instruct. 30 view at source ↗

read the original abstract

Large language models can be steered at inference time through prompting or activation interventions, but activation steering methods often underperform compared to prompt-based approaches. We propose a framework that formulates prompt steering as a form of activation steering and investigates whether distilling successful prompt steering behavior into simpler, interpretable models can close this gap. Our analysis reveals that popular activation steering methods are not faithful to the mechanics of prompt steering, which applies strong interventions on some tokens while barely affecting others. Based on these insights, we introduce Prompt Steering Replacement (PSR) models that estimate token-specific steering coefficients from the activations themselves and are trained to imitate prompt-based interventions. Experiments on three steering benchmarks across multiple language models show that PSR models outperform existing activation steering methods, especially when controlling for high-coherence completions, and also compare favorably to prompting on AxBench and persona steering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PSR models learn token-specific activation steering to better match prompting's selective effects, with benchmark gains, but generalization beyond training data remains unverified.

read the letter

The main point is that this paper frames prompting as token-varying activation steering and trains PSR models to predict those varying coefficients directly from activations, aiming to close the gap with full prompting while keeping the efficiency of steering vectors. They correctly note that standard activation methods apply uniform strength, unlike prompts which hit some tokens hard and others lightly. The experiments report that PSR beats prior activation techniques on three benchmarks, holds up better when controlling for coherent outputs, and comes close to prompting on AxBench and persona tasks across several models. That is a concrete step forward in making activation steering more faithful to how prompts actually work. The soft spot is generalization. The training process imitates prompt behavior on held-out data, but if those prompts or models overlap with the test distributions, the per-token predictions could be capturing benchmark-specific patterns rather than a robust mapping. The abstract does not spell out the exact data splits, cross-model training, or ablations that would rule this out, so the claim that PSR captures the general mechanics stays provisional. This is aimed at people building inference-time control tools who want something lighter than prompting but stronger than fixed vectors. Readers focused on interpretability or practical steering will get value from the token-specific formulation and the direct comparisons. It deserves a serious referee because the idea is new, the mismatch with existing methods is well-motivated, and the results point to a workable direction even if more checks on robustness are required. Send it to review and ask for the training details and extra generalization tests.

Referee Report

3 major / 2 minor

Summary. The paper introduces Prompt Steering Replacement (PSR) models that estimate token-specific steering coefficients from activations to mimic the behavior of prompt-based steering in LLMs. It argues that existing activation steering methods fail to replicate the uneven, token-dependent interventions of prompting, and demonstrates through experiments on three steering benchmarks across multiple models that PSR models outperform prior activation methods, particularly under coherence controls, and perform comparably to prompting on tasks like AxBench and persona steering.

Significance. If the results hold and the PSR approach generalizes, this could provide a practical way to achieve prompt-like steering effects through activation interventions, improving both performance and interpretability in LLM control. The work highlights a key mismatch in intervention mechanics and offers a data-driven way to bridge prompting and activation steering.

major comments (3)

[Methods (PSR training)] The training data for the PSR models must be explicitly described with respect to overlap with the three steering benchmarks used in evaluation. If the prompts or activations used for training PSR are drawn from the same distributions as the test sets, the reported outperformance may reflect memorization of token-specific patterns rather than learning a general mapping from activations to prompt-like steering coefficients, undermining the claim of capturing the mechanics of prompt steering.
[Experiments and Results] The abstract and results claim superior performance 'especially when controlling for high-coherence completions,' but without details on how coherence is measured, the exact quantitative improvements (e.g., effect sizes, standard errors), and statistical significance tests in the relevant tables or figures, it is difficult to evaluate whether the gap is robust or driven by the token-specific estimation.
[Analysis of intervention strength] The paper identifies that prompt steering applies strong interventions on some tokens while barely affecting others as the key difference from existing methods. However, there is no ablation study isolating whether training a per-token estimator (vs. uniform) is sufficient to close the gap, or if other aspects of the PSR architecture contribute; this is load-bearing for attributing the success to mimicking the token-specific mechanics.

minor comments (2)

[Notation] The definition of the PSR model architecture and how coefficients are applied during inference could be clarified with a pseudocode or equation in the methods section to improve reproducibility.
[References] Ensure all prior activation steering methods mentioned (e.g., in the comparison) are cited with specific papers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Methods (PSR training)] The training data for the PSR models must be explicitly described with respect to overlap with the three steering benchmarks used in evaluation. If the prompts or activations used for training PSR are drawn from the same distributions as the test sets, the reported outperformance may reflect memorization of token-specific patterns rather than learning a general mapping from activations to prompt-like steering coefficients, undermining the claim of capturing the mechanics of prompt steering.

Authors: We agree that explicit description of the training data is essential to rule out memorization concerns. The PSR models were trained on activations and prompt-steering pairs generated from a held-out corpus of general instruction-following and persona-related prompts that do not overlap with the AxBench, persona steering, or coherence-controlled evaluation sets. In the revised manuscript we will add a dedicated subsection under Methods that lists the exact data sources, sizes, and generation procedures, along with a statement confirming zero prompt/activation overlap with the test distributions. This will substantiate that PSR learns a general activation-to-coefficient mapping rather than benchmark-specific patterns. revision: yes
Referee: [Experiments and Results] The abstract and results claim superior performance 'especially when controlling for high-coherence completions,' but without details on how coherence is measured, the exact quantitative improvements (e.g., effect sizes, standard errors), and statistical significance tests in the relevant tables or figures, it is difficult to evaluate whether the gap is robust or driven by the token-specific estimation.

Authors: We acknowledge that the current presentation lacks sufficient quantitative detail. In the revision we will expand the Experiments section to: (1) specify the coherence metric (perplexity under a reference model with a fixed threshold), (2) report effect sizes, standard errors, and p-values from paired statistical tests for all PSR vs. baseline comparisons under the high-coherence filter, and (3) update the relevant tables and figures with these statistics. These additions will allow readers to assess the robustness of the reported gains independently of the token-specific modeling choice. revision: yes
Referee: [Analysis of intervention strength] The paper identifies that prompt steering applies strong interventions on some tokens while barely affecting others as the key difference from existing methods. However, there is no ablation study isolating whether training a per-token estimator (vs. uniform) is sufficient to close the gap, or if other aspects of the PSR architecture contribute; this is load-bearing for attributing the success to mimicking the token-specific mechanics.

Authors: We appreciate the request for a targeted ablation. While the performance advantage of PSR over existing uniform-coefficient methods already suggests that token-specific estimation is the primary driver, an explicit controlled comparison would strengthen the causal claim. In the revised manuscript we will add an ablation that trains a uniform-coefficient variant of the PSR architecture (i.e., a single scalar per layer instead of per-token coefficients) and directly compares it to the full per-token PSR on the same benchmarks. This will isolate the contribution of the token-specific estimator while holding other architectural elements fixed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training and evaluation on held-out benchmarks

full rationale

The paper trains PSR models on data to imitate observed prompt-based token-specific interventions and evaluates the resulting models on separate steering benchmarks (AxBench, persona steering, and coherence-controlled tasks) across multiple LLMs. This is a standard supervised learning setup with held-out test distributions rather than any derivation that reduces the claimed performance gains to fitted parameters of the target claim itself or to self-referential definitions. The analysis identifying mismatches in existing activation methods is presented as an empirical observation of intervention strength differences, not a definitional equivalence. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked that would collapse the results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that prompt steering can be faithfully approximated by per-token activation interventions learned from data, plus the empirical claim that existing methods fail to match this pattern.

axioms (1)

domain assumption Prompt steering applies strong interventions on some tokens while barely affecting others.
Stated in the abstract as the key insight motivating PSR.

invented entities (1)

Prompt Steering Replacement (PSR) models no independent evidence
purpose: Estimate token-specific steering coefficients from activations to imitate prompt interventions.
New model class introduced to close the gap between prompting and activation steering.

pith-pipeline@v0.9.0 · 5442 in / 1251 out tokens · 45705 ms · 2026-05-07T16:14:50.564417+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 24 canonical work pages · 5 internal anchors

[1]

and Potts, Christopher , month = mar, year =

Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D. and Potts, Christopher , month = mar, year =. Proceedings of the 42nd
[2]

Chen, Runjin and Arditi, Andy and Sleight, Henry and Evans, Owain and Lindsey, Jack , month = sep, year =. Persona. doi:10.48550/arXiv.2507.21509 , abstract =

work page internal anchor Pith review doi:10.48550/arxiv.2507.21509
[3]

doi:10.48550/arXiv.2502.04556 , abstract =

Wang, Hanyu and Cao, Bochuan and Cao, Yuanpu and Chen, Jinghui , month = feb, year =. doi:10.48550/arXiv.2502.04556 , abstract =

work page doi:10.48550/arxiv.2502.04556
[4]

Nguyen, Duy and Prasad, Archiki and Stengel-Eskin, Elias and Bansal, Mohit , month = jul, year =. Multi-. Proceedings of the 63rd. doi:10.18653/v1/2025.acl-long.1007 , abstract =

work page doi:10.18653/v1/2025.acl-long.1007 2025
[5]

doi:10.48550/arXiv.2506.03292 , abstract =

Sun, Jiuding and Baskaran, Sidharth and Wu, Zhengxuan and Sklar, Michael and Potts, Christopher and Geiger, Atticus , month = jun, year =. doi:10.48550/arXiv.2506.03292 , abstract =

work page doi:10.48550/arxiv.2506.03292
[6]

Improved

Wu, Zhengxuan and Yu, Qinan and Arora, Aryaman and Manning, Christopher D and Potts, Christopher , year =. Improved. Proceedings in
[7]

Oozeer, Narmeen and Marks, Luke and Barez, Fazl and Abdullah, Amirali , file =. Beyond. Findings of the
[8]

Learning without training: The implicit dynamics of in-context learning.arXiv preprint arXiv:2507.16003, 2025

Dherin, Benoit and Munn, Michael and Mazzawi, Hanna and Wunder, Michael and Gonzalvo, Javier , month = dec, year =. Learning without training:. doi:10.48550/arXiv.2507.16003 , abstract =

work page doi:10.48550/arxiv.2507.16003
[9]

Explainability for

Zhao, Haiyan and Chen, Hanjie and Yang, Fan and Liu, Ninghao and Deng, Huiqi and Cai, Hengyi and Wang, Shuaiqiang and Yin, Dawei and Du, Mengnan , month = nov, year =. Explainability for. doi:10.48550/arXiv.2309.01029 , abstract =

work page doi:10.48550/arxiv.2309.01029
[10]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...

work page internal anchor Pith review doi:10.48550/arxiv.2310.01405
[11]

Transformer Circuits Thread , author =

Circuit. Transformer Circuits Thread , author =
[12]

Improving

Stolfo, Alessandro and Balachandran, Vidhisha and Yousefi, Safoora and Horvitz, Eric and Nushi, Besmira , year =. Improving. The
[13]

Evaluating

Zeng, Zhiyuan and Yu, Jiatong and Gao, Tianyu and Meng, Yu and Goyal, Tanya and Chen, Danqi , year =. Evaluating. The
[14]

Instruction-Following Evaluation for Large Language Models

Zhou, Jeffrey and Lu, Tianjian and Mishra, Swaroop and Brahma, Siddhartha and Basu, Sujoy and Luan, Yi and Zhou, Denny and Hou, Le , year =. Instruction-. doi:10.48550/ARXIV.2311.07911 , journal =

work page Pith review doi:10.48550/arxiv.2311.07911
[15]

Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Kexin and Y...

work page Pith review doi:10.48550/arxiv.2412.15115
[16]

The Llama 3 Herd of Models

The. arXiv preprint arXiv:2407.21783 , author =

work page internal anchor Pith review arXiv
[17]

Extracting Latent Steering Vectors from Pretrained Language Models

Subramani, Nishant and Suresh, Nivedita and Peters, Matthew E. , editor =. Extracting. 2022 , pages =. doi:10.18653/V1/2022.FINDINGS-ACL.48 , booktitle =

work page doi:10.18653/v1/2022.findings-acl.48 2022
[18]

Steering Language Models With Activation Engineering

Turner, Alexander Matt and Thiergart, Lisa and Udell, David and Leech, Gavin and Mini, Ulisse and MacDiarmid, Monte , year =. Activation. doi:10.48550/ARXIV.2308.10248 , journal =

work page internal anchor Pith review doi:10.48550/arxiv.2308.10248
[19]

and Pfister, Hanspeter and Wattenberg, Martin , editor =

Li, Kenneth and Patel, Oam and Viégas, Fernanda B. and Pfister, Hanspeter and Wattenberg, Martin , editor =. Inference-. Advances in. 2023 , file =

2023
[20]

Advancing

Wu, Muling and Liu, Wenhao and Wang, Xiaohua and Li, Tianlong and Lv, Changze and Ling, Zixuan and Zhu, Jianhao and Zhang, Cenyuan and Zheng, Xiaoqing and Huang, Xuanjing , editor =. Advancing. 2024 , pages =. doi:10.18653/V1/2024.ACL-LONG.726 , booktitle =

work page doi:10.18653/v1/2024.acl-long.726 2024
[21]

Plug and

Dathathri, Sumanth and Madotto, Andrea and Lan, Janice and Hung, Jane and Frank, Eric and Molino, Piero and Yosinski, Jason and Liu, Rosanne , year =. Plug and. 8th
[22]

, year =

Liu, Sheng and Ye, Haotian and Xing, Lei and Zou, James Y. , year =. In-context. Forty-first
[23]

Marks, Samuel and Tegmark, Max , year =. The. doi:10.48550/ARXIV.2310.06824 , journal =

work page internal anchor Pith review doi:10.48550/arxiv.2310.06824
[24]

Extending activation steering to broad skills and multiple behaviours.arXiv preprint arXiv:2403.05767,

Weij, Teun van der and Poesio, Massimo and Schoots, Nandi , year =. Extending. doi:10.48550/ARXIV.2403.05767 , journal =

work page doi:10.48550/arxiv.2403.05767
[25]

Steering Llama 2 via Contrastive Activation Addition

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander Matt , editor =. Steering. 2024 , pages =. doi:10.18653/V1/2024.ACL-LONG.828 , booktitle =

work page doi:10.18653/v1/2024.acl-long.828 2024
[26]

Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information

Giulianelli, Mario and Harding, Jack and Mohnert, Florian and Hupkes, Dieuwke and Zuidema, Willem H. , editor =. Under the. 2018 , pages =. doi:10.18653/V1/W18-5426 , booktitle =

work page doi:10.18653/v1/w18-5426 2018
[27]

and Freeman, William T

Bau, David and Zhu, Jun-Yan and Strobelt, Hendrik and Zhou, Bolei and Tenenbaum, Joshua B. and Freeman, William T. and Torralba, Antonio , year =. Visualizing and. Deep
[28]

Thomas and Linzen, Tal and Smolensky, Paul , editor =

Soulos, Paul and McCoy, R. Thomas and Linzen, Tal and Smolensky, Paul , editor =. Discovering the. 2020 , pages =. doi:10.18653/V1/2020.BLACKBOXNLP-1.23 , booktitle =

work page doi:10.18653/v1/2020.blackboxnlp-1.23 2020
[29]

Counterfactuals uncover the modular structure of deep generative models , url =

Besserve, Michel and Mehrjou, Arash and Sun, Rémy and Schölkopf, Bernhard , year =. Counterfactuals uncover the modular structure of deep generative models , url =. 8th
[30]

Semantics-

Wang, Weixuan and Yang, Jingyuan and Peng, Wei , year =. Semantics-. The
[31]

Li, Xiang Lisa and Liang, Percy , editor =. Prefix-. 2021 , pages =. doi:10.18653/V1/2021.ACL-LONG.353 , booktitle =

work page doi:10.18653/v1/2021.acl-long.353 2021
[32]

Lester, Brian and Al-Rfou, Rami and Constant, Noah , editor =. The. 2021 , pages =. doi:10.18653/V1/2021.EMNLP-MAIN.243 , booktitle =

work page doi:10.18653/v1/2021.emnlp-main.243 2021
[33]

Language

Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , year =. Language
[34]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...
[35]

Logan IV, Eric Wallace, and Sameer Singh

Shin, Taylor and Razeghi, Yasaman and IV, Robert L. Logan and Wallace, Eric and Singh, Sameer , editor =. 2020 , pages =. doi:10.18653/V1/2020.EMNLP-MAIN.346 , booktitle =

work page doi:10.18653/v1/2020.emnlp-main.346 2020
[36]

Zhou, Yongchao and Muresanu, Andrei Ioan and Han, Ziwen and Paster, Keiran and Pitis, Silviu and Chan, Harris and Ba, Jimmy , year =. Large. The
[37]

Journal of Machine Learning Research , author =

Causal abstraction:. Journal of Machine Learning Research , author =. 2025 , pages =

2025
[38]

Foundational. Trans. Mach. Learn. Res. , author =
[39]

and Potts, Christopher , editor =

Wu, Zhengxuan and Arora, Aryaman and Wang, Zheng and Geiger, Atticus and Jurafsky, Dan and Manning, Christopher D. and Potts, Christopher , editor =. Advances in. 2024 , file =

2024
[40]

Fine-tuning aligned language models compromises safety, even when users do not intend to! , journal =

Qi, Xiangyu and Zeng, Yi and Xie, Tinghao and Chen, Pin-Yu and Jia, Ruoxi and Mittal, Prateek and Henderson, Peter , year =. Fine-tuning aligned language models compromises safety, even when users do not intend to! , journal =
[41]

Park, Kiho and Choe, Yo Joong and Veitch, Victor , year =. The. Forty-first
[42]

Videodpo: Omni- preference alignment for video diffusion generation

Wang, Han and Wang, Gang and Zhang, Huan , year =. Steering. doi:10.1109/CVPR52734.2025.02787 , booktitle =

work page doi:10.1109/cvpr52734.2025.02787 2025
[43]

2021 , note =

CoRR , author =. 2021 , note =

2021
[44]

and Bewley, Tom and Mishra, Saumitra and Veloso, Manuela , editor =

Hedström, Anna and Amoukou, Salim I. and Bewley, Tom and Mishra, Saumitra and Veloso, Manuela , editor =. To. Forty-second
[45]

Vogels, Arthur and Wong, Benjamin and Choho, Yann and Blangero, Annabelle and Bhan, Milan , year =. In-. doi:10.48550/ARXIV.2510.13285 , journal =

work page doi:10.48550/arxiv.2510.13285
[46]

arXiv preprint arXiv:2511.00617 , year=

Bigelow, Eric J. and Wurgaft, Daniel and Wang, YingQiao and Goodman, Noah D. and Ullman, Tomer D. and Tanaka, Hidenori and Lubana, Ekdeep Singh , year =. Belief. doi:10.48550/ARXIV.2511.00617 , journal =

work page doi:10.48550/arxiv.2511.00617