hub Canonical reference

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White · 2024 · cs.CL · arXiv 2402.13228

Canonical reference. 75% of citing Pith papers cite this work as background.

21 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 21 citing papers arXiv PDF

abstract

Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. Furthermore, we find that the DPOP-tuned model outperforms the DPO-tuned model (all else equal) on benchmarks independent of the fine-tuning data, such as MT-Bench. Finally, using DPOP, we create and open-source Smaug-34B and Smaug-72B, with the latter becoming the first open-source LLM to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 1 dataset 1

citation-polarity summary

background 6 baseline 1 use dataset 1

representative citing papers

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.

Visual Preference Optimization with Rubric Rewards

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

cs.CL · 2026-01-07 · unverdicted · novelty 7.0

DiffCoT applies diffusion-style iterative denoising to chain-of-thought steps with a causal noise schedule, outperforming standard CoT optimization methods on multi-step reasoning benchmarks.

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

cs.MM · 2026-05-11 · unverdicted · novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.

Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

Future Policy Approximation for Offline Reinforcement Learning Improves Mathematical Reasoning

cs.CL · 2025-09-24 · unverdicted · novelty 6.0

Future Policy Approximation (FPA) improves offline RL for LLM mathematical reasoning by extrapolating future policies in logit space to proactively reweight gradients, yielding consistent gains over DPO, RPO, KTO and vanilla offline RL while matching online RL accuracy at far lower compute cost.

Mitigating Object Hallucinations via Sentence-Level Early Intervention

cs.CV · 2025-07-16 · conditional · novelty 6.0

SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.

The Differences Between Direct Alignment Algorithms are a Blur

cs.LG · 2025-02-03 · unverdicted · novelty 6.0

A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

cs.CL · 2024-11-15 · conditional · novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

cs.CL · 2024-06-03 · conditional · novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.

Generating Place-Based Compromises Between Two Points of View

cs.CL · 2026-04-27 · unverdicted · novelty 5.0

Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.

From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

cs.AI · 2026-04-25 · unverdicted · novelty 5.0

AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

cs.CV · 2024-07-03 · conditional · novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

Learning to Reason at the Frontier of Learnability

cs.LG · 2025-02-17 · unverdicted · novelty 4.0

A curriculum sampling questions with high variance in success rate improves reinforcement learning performance for LLM reasoning tasks.

Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

cs.LG · 2026-05-08 · unverdicted · novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

citing papers explorer

Showing 21 of 21 citing papers.

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits cs.LG · 2026-05-09 · unverdicted · none · ref 14 · internal anchor
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning cs.LG · 2026-04-22 · unverdicted · none · ref 54 · internal anchor
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 68 · internal anchor
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs cs.CL · 2026-01-07 · unverdicted · none · ref 4 · internal anchor
DiffCoT applies diffusion-style iterative denoising to chain-of-thought steps with a causal noise schedule, outperforming standard CoT optimization methods on multi-step reasoning benchmarks.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination cs.MM · 2026-05-11 · unverdicted · none · ref 71 · internal anchor
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs cs.CL · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning cs.AI · 2026-05-02 · unverdicted · none · ref 9 · internal anchor
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner cs.LG · 2026-04-20 · unverdicted · none · ref 8 · internal anchor
A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.
GroupDPO: Memory efficient Group-wise Direct Preference Optimization cs.CL · 2026-04-17 · unverdicted · none · ref 30 · internal anchor
GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
Future Policy Approximation for Offline Reinforcement Learning Improves Mathematical Reasoning cs.CL · 2025-09-24 · unverdicted · none · ref 5 · internal anchor
Future Policy Approximation (FPA) improves offline RL for LLM mathematical reasoning by extrapolating future policies in logit space to proactively reweight gradients, yielding consistent gains over DPO, RPO, KTO and vanilla offline RL while matching online RL accuracy at far lower compute cost.
Mitigating Object Hallucinations via Sentence-Level Early Intervention cs.CV · 2025-07-16 · conditional · none · ref 46 · internal anchor
SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.
The Differences Between Direct Alignment Algorithms are a Blur cs.LG · 2025-02-03 · unverdicted · none · ref 28 · internal anchor
A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 75 · internal anchor
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 225 · internal anchor
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 29 · internal anchor
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance cs.LG · 2026-05-14 · unverdicted · none · ref 64 · internal anchor
FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.
Generating Place-Based Compromises Between Two Points of View cs.CL · 2026-04-27 · unverdicted · none · ref 52 · internal anchor
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents cs.AI · 2026-04-25 · unverdicted · none · ref 14 · internal anchor
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 117 · internal anchor
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Learning to Reason at the Frontier of Learnability cs.LG · 2025-02-17 · unverdicted · none · ref 29 · internal anchor
A curriculum sampling questions with high variance in success rate improves reinforcement learning performance for LLM reasoning tasks.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems cs.LG · 2026-05-08 · unverdicted · none · ref 84 · internal anchor
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer