Unintentional unalignment: Likelihood displacement in direct preference optimization.arXiv preprint arXiv:2410.08847

Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin · arXiv 2410.08847

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

read on arXiv browse 1 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.

citing papers explorer

Showing 1 of 1 citing paper.

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits cs.LG · 2026-05-09 · unverdicted · none · ref 15
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.

Unintentional unalignment: Likelihood displacement in direct preference optimization.arXiv preprint arXiv:2410.08847

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer