DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
baseline 1
method 1
citation-polarity summary
years
2023 2verdicts
ACCEPT 2representative citing papers
EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.
citing papers explorer
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.