Fine-tuning aligned language models compromises safety, even when users do not intend to!

Qi, Xiangyu, Zeng, Yi, Xie, Tinghao, Chen, Pin-Yu, Jia, Ruoxi, Mittal, Prateek

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Steer Like the LLM: Activation Steering that Mimics Prompting

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

Steer Like the LLM: Activation Steering that Mimics Prompting cs.CL · 2026-05-05 · unverdicted · none · ref 40
PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.

Fine-tuning aligned language models compromises safety, even when users do not intend to!

fields

years

verdicts

representative citing papers

citing papers explorer