DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
baseline 1
method 1
citation-polarity summary
years
2023 2verdicts
ACCEPT 2representative citing papers
EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.
citing papers explorer
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
-
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.