hub

Is dpo superior to ppo for llm alignment? a comprehensive study

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu · 2024 · arXiv 2404.10719

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vision demands.

Retrieval Augmented Conversational Recommendation with Reinforcement Learning

cs.IR · 2026-04-06 · unverdicted · novelty 7.0

RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.

Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.

Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

cs.CL · 2025-07-08 · unverdicted · novelty 6.0

REFORM uses reward-guided controlled decoding to generate adversarial failures and augments training data to improve reward model robustness on preference datasets.

The Differences Between Direct Alignment Algorithms are a Blur

cs.LG · 2025-02-03 · unverdicted · novelty 6.0

A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

cs.AI · 2026-05-18 · unverdicted · novelty 5.0

Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

cs.LG · 2026-02-05 · unverdicted · novelty 5.0 · 2 refs

PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.

Reinforcement Learning for LLM Post-Training: A Survey

cs.CL · 2024-07-23 · unverdicted · novelty 3.0

A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.

citing papers explorer

Showing 11 of 11 citing papers.

Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization cs.CV · 2026-05-07 · unverdicted · none · ref 50
DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vision demands.
Retrieval Augmented Conversational Recommendation with Reinforcement Learning cs.IR · 2026-04-06 · unverdicted · none · ref 62
RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 24
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 11 · 2 links
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient cs.LG · 2026-04-28 · unverdicted · none · ref 94
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling cs.CL · 2025-07-08 · unverdicted · none · ref 33
REFORM uses reward-guided controlled decoding to generate adversarial failures and augments training data to improve reward model robustness on preference datasets.
The Differences Between Direct Alignment Algorithms are a Blur cs.LG · 2025-02-03 · unverdicted · none · ref 43
A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.
HybridFlow: A Flexible and Efficient RLHF Framework cs.LG · 2024-09-28 · unverdicted · none · ref 93
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction cs.AI · 2026-05-18 · unverdicted · none · ref 46
Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution cs.LG · 2026-02-05 · unverdicted · none · ref 34 · 2 links
PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.
Reinforcement Learning for LLM Post-Training: A Survey cs.CL · 2024-07-23 · unverdicted · none · ref 53
A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.

Is dpo superior to ppo for llm alignment? a comprehensive study

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer