What matters in data for dpo?

What Matters in Data for DPO? , author= · 2025 · arXiv 2508.18312

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Which Pairs to Compare for LLM Post-Training?

cs.AI · 2026-06-17 · unverdicted · novelty 7.0

Matching upper and lower bounds on DPO policy optimality gap are derived that depend on a single design-dependent information matrix linking pair selection to estimation error and suboptimality.

Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs

cs.CV · 2026-06-24 · unverdicted · novelty 6.0

ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

ξ-DPO rewrites the preference objective as minimizing distance to optimal margins and defines reward as a chosen-to-rejected ratio, yielding a bounded, interpretable margin ξ set directly from the initial reward-gap distribution.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Which Pairs to Compare for LLM Post-Training? cs.AI · 2026-06-17 · unverdicted · none · ref 15
Matching upper and lower bounds on DPO policy optimality gap are derived that depend on a single design-dependent information matrix linking pair selection to estimation error and suboptimality.
Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs cs.CV · 2026-06-24 · unverdicted · none · ref 45
ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.
$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin cs.LG · 2026-05-09 · unverdicted · none · ref 23
ξ-DPO rewrites the preference objective as minimizing distance to optimal margins and defines reward as a chosen-to-rejected ratio, yielding a bounded, interpretable margin ξ set directly from the initial reward-gap distribution.

What matters in data for dpo?

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer