Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Towards coherent and engaging spoken dialog response generation us- ing automatic conversation evaluators
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3representative citing papers
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
MTDS with TokenMoE improves inform rate by 8.1% and success rate by 0.8% over single-module baselines on a benchmark dataset.
citing papers explorer
-
Learning to summarize from human feedback
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
-
Fine-Tuning Language Models from Human Preferences
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
-
A Modular Task-oriented Dialogue System Using a Neural Mixture-of-Experts
MTDS with TokenMoE improves inform rate by 8.1% and success rate by 0.8% over single-module baselines on a benchmark dataset.