Toward Preference-aligned Large Language Models via Residual-based Model Steering

Andrea Tagarelli; Lucio La Cava

arxiv: 2509.23982 · v2 · pith:4DNH3RPZnew · submitted 2025-09-28 · 💻 cs.CL · cs.AI· cs.CY· cs.LG· cs.NE

Toward Preference-aligned Large Language Models via Residual-based Model Steering

Lucio La Cava , Andrea Tagarelli This is my paper

classification 💻 cs.CL cs.AIcs.CYcs.LGcs.NE

keywords modelspreferencepalrsalignmentlanguagelargellmsoptimization

0 comments

read the original abstract

Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to models aligned with DPO and SimPO, they perform better with great time-savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Convex Optimization for Alignment and Preference Learning on a Single GPU
cs.LG 2026-05 unverdicted novelty 6.0

COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models...