Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Learning to summarize with human feedback
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
Tutorial on a GP-based framework for preference and choice learning that unifies random utility models, limits of discernment, and multi-utility scenarios via customized likelihoods for object and label preferences.
citing papers explorer
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
-
A tutorial on learning from preferences and choices with Gaussian Processes
Tutorial on a GP-based framework for preference and choice learning that unifies random utility models, limits of discernment, and multi-utility scenarios via customized likelihoods for object and label preferences.