Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Adithya Bhaskar; Boris Hanin; Danqi Chen; Noam Razin; Sadhika Malladi; Sanjeev Arora

arxiv: 2410.08847 · v4 · pith:IXR4SN42new · submitted 2024-10-11 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Noam Razin , Sadhika Malladi , Adithya Bhaskar , Danqi Chen , Sanjeev Arora , Boris Hanin This is my paper

classification 💻 cs.LG cs.AIcs.CLstat.ML

keywords responsesdisplacementlikelihoodpreferredchesmodelpreferencesprobability

0 comments

read the original abstract

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
cs.LG 2026-05 unverdicted novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment
cs.CL 2026-06 unverdicted novelty 6.0

The Piggyback Hypothesis attributes emergent misalignment to chat-template tokens piggybacking finetuned behavior; Token-Regularized Finetuning (TReFT) mitigates it by regularizing prefix token representations.
Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation
q-bio.BM 2026-05 unverdicted novelty 4.0

Moirain models use multimodal SFT and DPO to generate novel RNA sequences with superior protein binding affinities in a zero-shot conditional setting.