GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.
Disentangling length from quality in direct preference optimization
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.
citing papers explorer
-
General Preference Reinforcement Learning
GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.
-
RVPO: Risk-Sensitive Alignment via Variance Regularization
RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.