Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Stefano Ermon, Chelsea Finn, Christopher D Manning, Eric Mitchell, Rafael Rafailov, Archit Sharma · 2023 · Advances in Neural Information Processing Systems 36 · DOI 10.52202/075280-2338

3 Pith papers cite this work, alongside 110 external citations. Polarity classification is still indexing.

3 Pith papers citing it

110 external citations · Crossref

open at publisher browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Using frontier models to synthesize plausible-but-wrong FIM completions as hard negatives for SFT improves Delulu exact match by +18.8 and edit similarity by +0.22 on Qwen2.5-Coder-7B while also lifting HumanEval-Infilling and SAFIM.

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

SuperMemory-VQA provides 4,853 human-verified QA pairs from 52.9 hours of egocentric AI glasses recordings to benchmark AI systems on realistic long-horizon memory tasks including an unanswerable option.

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation cs.LG · 2026-06-02 · unverdicted · none · ref 11
Using frontier models to synthesize plausible-but-wrong FIM completions as hard negatives for SFT improves Delulu exact match by +18.8 and edit similarity by +0.22 on Qwen2.5-Coder-7B while also lifting HumanEval-Infilling and SAFIM.
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory cs.CV · 2026-05-30 · unverdicted · none · ref 42
SuperMemory-VQA provides 4,853 human-verified QA pairs from 52.9 hours of egocentric AI glasses recordings to benchmark AI systems on realistic long-horizon memory tasks including an unanswerable option.
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance cs.LG · 2026-05-14 · unverdicted · none · ref 68
FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer