pith. machine review for the scientific record. sign in

arxiv: 2605.03907 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Steer Like the LLM: Activation Steering that Mimics Prompting

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords activation steeringprompt steeringtoken-specific interventionslanguage model controlinference-time editingPSR modelssteering benchmarks
0
0 comments X

The pith

Models that learn token-specific steering strengths from prompts outperform standard activation methods on control tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that prompting steers language models by applying large changes to some tokens and almost none to others, whereas common activation techniques use the same strength everywhere. It introduces prompt steering replacement models trained to predict those varying strengths directly from activations so they can copy the selective pattern. This matters because activation steering runs faster and requires no input rewriting, yet it has lagged prompting; matching the selectivity could make lightweight control reliable. Experiments across three benchmarks and several models show the new models beat prior activation baselines, especially on coherent outputs, and reach prompting levels on persona and AxBench tasks. The central insight is that the performance gap stems from mismatched intervention patterns rather than steering vectors themselves.

Core claim

Prompt steering applies strong interventions on certain tokens while leaving others nearly untouched. Existing activation methods apply a fixed coefficient across all tokens. Prompt Steering Replacement models are small networks trained on prompt-steered versus baseline activations to output per-token scaling factors, thereby reproducing the selective intervention pattern and achieving stronger steering results than fixed-coefficient baselines.

What carries the argument

Prompt Steering Replacement (PSR) models: lightweight predictors that read layer activations and output token-specific coefficients to scale a steering vector.

If this is right

  • PSR models achieve higher success rates than prior activation steering on three standard benchmarks.
  • They maintain higher output coherence when results are filtered for quality.
  • On AxBench and persona steering tasks they reach performance levels comparable to direct prompting.
  • The token-specific approach works across multiple language models without architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap between prompting and activation steering may shrink further if PSR training data includes more diverse prompt strategies.
  • Similar per-token prediction could be applied to other control objectives such as safety or style transfer.
  • One could test whether PSR models remain effective when the underlying steering vector is itself learned rather than hand-crafted.
  • This framing suggests activation interventions can be made more expressive without increasing inference cost.

Load-bearing premise

The performance difference between prompting and activation steering is caused mainly by the lack of token-specific strength variation, and training on observed prompt behavior will transfer to new tasks and models.

What would settle it

Evaluate PSR models on a fresh steering benchmark or an unseen language model and measure whether they still outperform fixed-coefficient activation steering and match prompting.

Figures

Figures reproduced from arXiv: 2605.03907 by Frederik Vandeputte, Geert Heyman.

Figure 1
Figure 1. Figure 1: Illustration of how prompt steering interventions ∆P S can be computed by subtracting prompt-steered activations from the corresponding unsteered activations (left and center). Prompt Steering Replacement (PSR) models approximate these interventions, but only on cases where prompt steering successfully elicits the target attribute (right) view at source ↗
Figure 2
Figure 2. Figure 2: Strength of prompt steering interventions on Llama-3.2- 3B, layer 16, across token positions (x-axis) for randomly sampled completions that are prompt-steered towards sycophancy (y-axis). intervening on the LLM’s internal activations. Formally, if Ayi denotes the activations for the i th response token at a given layer l, 3 then we can write single-attribute activation steering as follows:4 Ayi|AS = Ayi + … view at source ↗
Figure 3
Figure 3. Figure 3: Relative RMSE between the accumulated interventions of prompt steering (∆P Sacc ) versus those of other steering meth￾ods (∆Xacc ), averaged on prompt steering predictions on the Per￾sona Vectors sycophantic evaluation data for Llama-3.2-3B. performs constant steering in the single-layer setting and for 3 out of 4 language models in the all-layer setting. The bot￾tom part of view at source ↗
Figure 4
Figure 4. Figure 4: Prompt intervention magnitude across layers, measured locally (per-layer) and accumulatively, in both absolute and relative terms. The relative magnitudes are computed by dividing by the norm of the prompt-steered activations ∥AP S∥. 15 view at source ↗
Figure 5
Figure 5. Figure 5: Heatmaps of prompt intervention magnitude per token across layers for different traits on randomly selected examples. (a)–(c) show local interventions ∥∆P Sloc ∥; (d)–(f) show accumulative interventions ∥∆P Sacc ∥, normalized per layer by dividing by their total sum to better reveal the effect in early layers. 16 view at source ↗
Figure 6
Figure 6. Figure 6: Example 1, Layer 16: Intervention strength ∥∆X(·)∥2 per token for different steering methods on a sycophantic evaluation example (Llama-3.2-3B-Instruct). 17 view at source ↗
Figure 7
Figure 7. Figure 7: Example 1, Layer 26: Intervention strength ∥∆X(·)∥2 per token for different steering methods on a sycophantic evaluation example (Llama-3.2-3B-Instruct). 18 view at source ↗
Figure 8
Figure 8. Figure 8: Example 2, Layer 16: Intervention strength ∥∆X(·)∥2 per token for different steering methods on a sycophantic evaluation example (Llama-3.2-3B-Instruct). 19 view at source ↗
Figure 9
Figure 9. Figure 9: Example 2, Layer 26: Intervention strength ∥∆X(·)∥2 per token for different steering methods on a sycophantic evaluation example (Llama-3.2-3B-Instruct). 20 view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for coherence judge Jcoher (Chen et al., 2025) that is used for the Persona Vectors benchmark. I am evaluating a different language model. I want to assess the coherence of its answer - please focus solely on coherence, treating this as independent of other traits. I asked the model the following evaluation question: [QUESTION START] {question} [QUESTION END] The model answered with: [ANSW… view at source ↗
Figure 11
Figure 11. Figure 11: Slight variation on the coherence judge prompt template from Chen et al. (2025), which is used on the IFEval benchmark. 21 view at source ↗
Figure 12
Figure 12. Figure 12: Trait alignment on the Persona Vectors dataset after steering Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Qwen2.5-7b￾Instruct. Trait alignments for all steering methods are computed at prompt steering coherence. E.2. IFEval Additional Results view at source ↗
Figure 13
Figure 13. Figure 13: Trait alignment-coherence curves for different steering methods on the Persona Vectors dataset. Some curves have a region where trait alignment and coherence both go down, this points to oversteering (i.e., when α values are set too high for a method). 25 view at source ↗
Figure 14
Figure 14. Figure 14: shows the relative RMSE between activations produced by prompt steering versus other steering methods, averaged on prompt steering predictions on the Persona Vectors evaluation data for Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Qwen2.5-7b-Instruct, respectively. (a) Llama-3.2-3B-Instruct. (b) Llama-3.1-8B-Instruct. (c) Qwen2.5-7b-Instruct view at source ↗
Figure 15
Figure 15. Figure 15: Cosine similarity between steering vectors learned by different methods for sycophancy on Llama-3.2-3B-Instruct. 30 view at source ↗
read the original abstract

Large language models can be steered at inference time through prompting or activation interventions, but activation steering methods often underperform compared to prompt-based approaches. We propose a framework that formulates prompt steering as a form of activation steering and investigates whether distilling successful prompt steering behavior into simpler, interpretable models can close this gap. Our analysis reveals that popular activation steering methods are not faithful to the mechanics of prompt steering, which applies strong interventions on some tokens while barely affecting others. Based on these insights, we introduce Prompt Steering Replacement (PSR) models that estimate token-specific steering coefficients from the activations themselves and are trained to imitate prompt-based interventions. Experiments on three steering benchmarks across multiple language models show that PSR models outperform existing activation steering methods, especially when controlling for high-coherence completions, and also compare favorably to prompting on AxBench and persona steering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Prompt Steering Replacement (PSR) models that estimate token-specific steering coefficients from activations to mimic the behavior of prompt-based steering in LLMs. It argues that existing activation steering methods fail to replicate the uneven, token-dependent interventions of prompting, and demonstrates through experiments on three steering benchmarks across multiple models that PSR models outperform prior activation methods, particularly under coherence controls, and perform comparably to prompting on tasks like AxBench and persona steering.

Significance. If the results hold and the PSR approach generalizes, this could provide a practical way to achieve prompt-like steering effects through activation interventions, improving both performance and interpretability in LLM control. The work highlights a key mismatch in intervention mechanics and offers a data-driven way to bridge prompting and activation steering.

major comments (3)
  1. [Methods (PSR training)] The training data for the PSR models must be explicitly described with respect to overlap with the three steering benchmarks used in evaluation. If the prompts or activations used for training PSR are drawn from the same distributions as the test sets, the reported outperformance may reflect memorization of token-specific patterns rather than learning a general mapping from activations to prompt-like steering coefficients, undermining the claim of capturing the mechanics of prompt steering.
  2. [Experiments and Results] The abstract and results claim superior performance 'especially when controlling for high-coherence completions,' but without details on how coherence is measured, the exact quantitative improvements (e.g., effect sizes, standard errors), and statistical significance tests in the relevant tables or figures, it is difficult to evaluate whether the gap is robust or driven by the token-specific estimation.
  3. [Analysis of intervention strength] The paper identifies that prompt steering applies strong interventions on some tokens while barely affecting others as the key difference from existing methods. However, there is no ablation study isolating whether training a per-token estimator (vs. uniform) is sufficient to close the gap, or if other aspects of the PSR architecture contribute; this is load-bearing for attributing the success to mimicking the token-specific mechanics.
minor comments (2)
  1. [Notation] The definition of the PSR model architecture and how coefficients are applied during inference could be clarified with a pseudocode or equation in the methods section to improve reproducibility.
  2. [References] Ensure all prior activation steering methods mentioned (e.g., in the comparison) are cited with specific papers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Methods (PSR training)] The training data for the PSR models must be explicitly described with respect to overlap with the three steering benchmarks used in evaluation. If the prompts or activations used for training PSR are drawn from the same distributions as the test sets, the reported outperformance may reflect memorization of token-specific patterns rather than learning a general mapping from activations to prompt-like steering coefficients, undermining the claim of capturing the mechanics of prompt steering.

    Authors: We agree that explicit description of the training data is essential to rule out memorization concerns. The PSR models were trained on activations and prompt-steering pairs generated from a held-out corpus of general instruction-following and persona-related prompts that do not overlap with the AxBench, persona steering, or coherence-controlled evaluation sets. In the revised manuscript we will add a dedicated subsection under Methods that lists the exact data sources, sizes, and generation procedures, along with a statement confirming zero prompt/activation overlap with the test distributions. This will substantiate that PSR learns a general activation-to-coefficient mapping rather than benchmark-specific patterns. revision: yes

  2. Referee: [Experiments and Results] The abstract and results claim superior performance 'especially when controlling for high-coherence completions,' but without details on how coherence is measured, the exact quantitative improvements (e.g., effect sizes, standard errors), and statistical significance tests in the relevant tables or figures, it is difficult to evaluate whether the gap is robust or driven by the token-specific estimation.

    Authors: We acknowledge that the current presentation lacks sufficient quantitative detail. In the revision we will expand the Experiments section to: (1) specify the coherence metric (perplexity under a reference model with a fixed threshold), (2) report effect sizes, standard errors, and p-values from paired statistical tests for all PSR vs. baseline comparisons under the high-coherence filter, and (3) update the relevant tables and figures with these statistics. These additions will allow readers to assess the robustness of the reported gains independently of the token-specific modeling choice. revision: yes

  3. Referee: [Analysis of intervention strength] The paper identifies that prompt steering applies strong interventions on some tokens while barely affecting others as the key difference from existing methods. However, there is no ablation study isolating whether training a per-token estimator (vs. uniform) is sufficient to close the gap, or if other aspects of the PSR architecture contribute; this is load-bearing for attributing the success to mimicking the token-specific mechanics.

    Authors: We appreciate the request for a targeted ablation. While the performance advantage of PSR over existing uniform-coefficient methods already suggests that token-specific estimation is the primary driver, an explicit controlled comparison would strengthen the causal claim. In the revised manuscript we will add an ablation that trains a uniform-coefficient variant of the PSR architecture (i.e., a single scalar per layer instead of per-token coefficients) and directly compares it to the full per-token PSR on the same benchmarks. This will isolate the contribution of the token-specific estimator while holding other architectural elements fixed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training and evaluation on held-out benchmarks

full rationale

The paper trains PSR models on data to imitate observed prompt-based token-specific interventions and evaluates the resulting models on separate steering benchmarks (AxBench, persona steering, and coherence-controlled tasks) across multiple LLMs. This is a standard supervised learning setup with held-out test distributions rather than any derivation that reduces the claimed performance gains to fitted parameters of the target claim itself or to self-referential definitions. The analysis identifying mismatches in existing activation methods is presented as an empirical observation of intervention strength differences, not a definitional equivalence. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked that would collapse the results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that prompt steering can be faithfully approximated by per-token activation interventions learned from data, plus the empirical claim that existing methods fail to match this pattern.

axioms (1)
  • domain assumption Prompt steering applies strong interventions on some tokens while barely affecting others.
    Stated in the abstract as the key insight motivating PSR.
invented entities (1)
  • Prompt Steering Replacement (PSR) models no independent evidence
    purpose: Estimate token-specific steering coefficients from activations to imitate prompt interventions.
    New model class introduced to close the gap between prompting and activation steering.

pith-pipeline@v0.9.0 · 5442 in / 1251 out tokens · 45705 ms · 2026-05-07T16:14:50.564417+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 24 canonical work pages · 5 internal anchors

  1. [1]

    and Potts, Christopher , month = mar, year =

    Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D. and Potts, Christopher , month = mar, year =. Proceedings of the 42nd

  2. [2]

    Chen, Runjin and Arditi, Andy and Sleight, Henry and Evans, Owain and Lindsey, Jack , month = sep, year =. Persona. doi:10.48550/arXiv.2507.21509 , abstract =

  3. [3]

    doi:10.48550/arXiv.2502.04556 , abstract =

    Wang, Hanyu and Cao, Bochuan and Cao, Yuanpu and Chen, Jinghui , month = feb, year =. doi:10.48550/arXiv.2502.04556 , abstract =

  4. [4]

    Nguyen, Duy and Prasad, Archiki and Stengel-Eskin, Elias and Bansal, Mohit , month = jul, year =. Multi-. Proceedings of the 63rd. doi:10.18653/v1/2025.acl-long.1007 , abstract =

  5. [5]

    doi:10.48550/arXiv.2506.03292 , abstract =

    Sun, Jiuding and Baskaran, Sidharth and Wu, Zhengxuan and Sklar, Michael and Potts, Christopher and Geiger, Atticus , month = jun, year =. doi:10.48550/arXiv.2506.03292 , abstract =

  6. [6]

    Improved

    Wu, Zhengxuan and Yu, Qinan and Arora, Aryaman and Manning, Christopher D and Potts, Christopher , year =. Improved. Proceedings in

  7. [7]

    Oozeer, Narmeen and Marks, Luke and Barez, Fazl and Abdullah, Amirali , file =. Beyond. Findings of the

  8. [8]

    Learning without training: The implicit dynamics of in-context learning.arXiv preprint arXiv:2507.16003, 2025

    Dherin, Benoit and Munn, Michael and Mazzawi, Hanna and Wunder, Michael and Gonzalvo, Javier , month = dec, year =. Learning without training:. doi:10.48550/arXiv.2507.16003 , abstract =

  9. [9]

    Explainability for

    Zhao, Haiyan and Chen, Hanjie and Yang, Fan and Liu, Ninghao and Deng, Huiqi and Cai, Hengyi and Wang, Shuaiqiang and Yin, Dawei and Du, Mengnan , month = nov, year =. Explainability for. doi:10.48550/arXiv.2309.01029 , abstract =

  10. [10]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...

  11. [11]

    Transformer Circuits Thread , author =

    Circuit. Transformer Circuits Thread , author =

  12. [12]

    Improving

    Stolfo, Alessandro and Balachandran, Vidhisha and Yousefi, Safoora and Horvitz, Eric and Nushi, Besmira , year =. Improving. The

  13. [13]

    Evaluating

    Zeng, Zhiyuan and Yu, Jiatong and Gao, Tianyu and Meng, Yu and Goyal, Tanya and Chen, Danqi , year =. Evaluating. The

  14. [14]

    Instruction-Following Evaluation for Large Language Models

    Zhou, Jeffrey and Lu, Tianjian and Mishra, Swaroop and Brahma, Siddhartha and Basu, Sujoy and Luan, Yi and Zhou, Denny and Hou, Le , year =. Instruction-. doi:10.48550/ARXIV.2311.07911 , journal =

  15. [15]

    Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Kexin and Y...

  16. [16]

    The Llama 3 Herd of Models

    The. arXiv preprint arXiv:2407.21783 , author =

  17. [17]

    Extracting Latent Steering Vectors from Pretrained Language Models

    Subramani, Nishant and Suresh, Nivedita and Peters, Matthew E. , editor =. Extracting. 2022 , pages =. doi:10.18653/V1/2022.FINDINGS-ACL.48 , booktitle =

  18. [18]

    Steering Language Models With Activation Engineering

    Turner, Alexander Matt and Thiergart, Lisa and Udell, David and Leech, Gavin and Mini, Ulisse and MacDiarmid, Monte , year =. Activation. doi:10.48550/ARXIV.2308.10248 , journal =

  19. [19]

    and Pfister, Hanspeter and Wattenberg, Martin , editor =

    Li, Kenneth and Patel, Oam and Viégas, Fernanda B. and Pfister, Hanspeter and Wattenberg, Martin , editor =. Inference-. Advances in. 2023 , file =

  20. [20]

    Advancing

    Wu, Muling and Liu, Wenhao and Wang, Xiaohua and Li, Tianlong and Lv, Changze and Ling, Zixuan and Zhu, Jianhao and Zhang, Cenyuan and Zheng, Xiaoqing and Huang, Xuanjing , editor =. Advancing. 2024 , pages =. doi:10.18653/V1/2024.ACL-LONG.726 , booktitle =

  21. [21]

    Plug and

    Dathathri, Sumanth and Madotto, Andrea and Lan, Janice and Hung, Jane and Frank, Eric and Molino, Piero and Yosinski, Jason and Liu, Rosanne , year =. Plug and. 8th

  22. [22]

    , year =

    Liu, Sheng and Ye, Haotian and Xing, Lei and Zou, James Y. , year =. In-context. Forty-first

  23. [23]

    Marks, Samuel and Tegmark, Max , year =. The. doi:10.48550/ARXIV.2310.06824 , journal =

  24. [24]

    Extending activation steering to broad skills and multiple behaviours.arXiv preprint arXiv:2403.05767,

    Weij, Teun van der and Poesio, Massimo and Schoots, Nandi , year =. Extending. doi:10.48550/ARXIV.2403.05767 , journal =

  25. [25]

    Steering Llama 2 via Contrastive Activation Addition

    Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander Matt , editor =. Steering. 2024 , pages =. doi:10.18653/V1/2024.ACL-LONG.828 , booktitle =

  26. [26]

    Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information

    Giulianelli, Mario and Harding, Jack and Mohnert, Florian and Hupkes, Dieuwke and Zuidema, Willem H. , editor =. Under the. 2018 , pages =. doi:10.18653/V1/W18-5426 , booktitle =

  27. [27]

    and Freeman, William T

    Bau, David and Zhu, Jun-Yan and Strobelt, Hendrik and Zhou, Bolei and Tenenbaum, Joshua B. and Freeman, William T. and Torralba, Antonio , year =. Visualizing and. Deep

  28. [28]

    Thomas and Linzen, Tal and Smolensky, Paul , editor =

    Soulos, Paul and McCoy, R. Thomas and Linzen, Tal and Smolensky, Paul , editor =. Discovering the. 2020 , pages =. doi:10.18653/V1/2020.BLACKBOXNLP-1.23 , booktitle =

  29. [29]

    Counterfactuals uncover the modular structure of deep generative models , url =

    Besserve, Michel and Mehrjou, Arash and Sun, Rémy and Schölkopf, Bernhard , year =. Counterfactuals uncover the modular structure of deep generative models , url =. 8th

  30. [30]

    Semantics-

    Wang, Weixuan and Yang, Jingyuan and Peng, Wei , year =. Semantics-. The

  31. [31]

    Li, Xiang Lisa and Liang, Percy , editor =. Prefix-. 2021 , pages =. doi:10.18653/V1/2021.ACL-LONG.353 , booktitle =

  32. [32]

    Lester, Brian and Al-Rfou, Rami and Constant, Noah , editor =. The. 2021 , pages =. doi:10.18653/V1/2021.EMNLP-MAIN.243 , booktitle =

  33. [33]

    Language

    Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , year =. Language

  34. [34]

    Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

  35. [35]

    Logan IV, Eric Wallace, and Sameer Singh

    Shin, Taylor and Razeghi, Yasaman and IV, Robert L. Logan and Wallace, Eric and Singh, Sameer , editor =. 2020 , pages =. doi:10.18653/V1/2020.EMNLP-MAIN.346 , booktitle =

  36. [36]

    Zhou, Yongchao and Muresanu, Andrei Ioan and Han, Ziwen and Paster, Keiran and Pitis, Silviu and Chan, Harris and Ba, Jimmy , year =. Large. The

  37. [37]

    Journal of Machine Learning Research , author =

    Causal abstraction:. Journal of Machine Learning Research , author =. 2025 , pages =

  38. [38]

    Foundational. Trans. Mach. Learn. Res. , author =

  39. [39]

    and Potts, Christopher , editor =

    Wu, Zhengxuan and Arora, Aryaman and Wang, Zheng and Geiger, Atticus and Jurafsky, Dan and Manning, Christopher D. and Potts, Christopher , editor =. Advances in. 2024 , file =

  40. [40]

    Fine-tuning aligned language models compromises safety, even when users do not intend to! , journal =

    Qi, Xiangyu and Zeng, Yi and Xie, Tinghao and Chen, Pin-Yu and Jia, Ruoxi and Mittal, Prateek and Henderson, Peter , year =. Fine-tuning aligned language models compromises safety, even when users do not intend to! , journal =

  41. [41]

    Park, Kiho and Choe, Yo Joong and Veitch, Victor , year =. The. Forty-first

  42. [42]

    Videodpo: Omni- preference alignment for video diffusion generation

    Wang, Han and Wang, Gang and Zhang, Huan , year =. Steering. doi:10.1109/CVPR52734.2025.02787 , booktitle =

  43. [43]

    2021 , note =

    CoRR , author =. 2021 , note =

  44. [44]

    and Bewley, Tom and Mishra, Saumitra and Veloso, Manuela , editor =

    Hedström, Anna and Amoukou, Salim I. and Bewley, Tom and Mishra, Saumitra and Veloso, Manuela , editor =. To. Forty-second

  45. [45]

    Vogels, Arthur and Wong, Benjamin and Choho, Yann and Blangero, Annabelle and Bhan, Milan , year =. In-. doi:10.48550/ARXIV.2510.13285 , journal =

  46. [46]

    arXiv preprint arXiv:2511.00617 , year=

    Bigelow, Eric J. and Wurgaft, Daniel and Wang, YingQiao and Goodman, Noah D. and Ullman, Tomer D. and Tanaka, Hidenori and Lubana, Ekdeep Singh , year =. Belief. doi:10.48550/ARXIV.2511.00617 , journal =