Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul F Christiano · 2020

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

cs.LG · 2023-04-13 · unverdicted · novelty 5.0

RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.

A tutorial on learning from preferences and choices with Gaussian Processes

cs.LG · 2024-03-18 · unverdicted · novelty 3.0

Tutorial on a GP-based framework for preference and choice learning that unifies random utility models, limits of discernment, and multi-utility scenarios via customized likelihoods for object and label preferences.

citing papers explorer

Showing 3 of 3 citing papers.

Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 115
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment cs.LG · 2023-04-13 · unverdicted · none · ref 53
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
A tutorial on learning from preferences and choices with Gaussian Processes cs.LG · 2024-03-18 · unverdicted · none · ref 120
Tutorial on a GP-based framework for preference and choice learning that unifies random utility models, limits of discernment, and multi-utility scenarios via customized likelihoods for object and label preferences.

Learning to summarize with human feedback

fields

years

verdicts

representative citing papers

citing papers explorer