arxiv: 2009.01325 · v3 · pith:DJ6ELZMMnew · submitted 2020-09-02 · 💻 cs.CL · cs.AI· cs.LG

Learning to summarize from human feedback

Nisan Stiennon , Long Ouyang , Jeff Wu , Daniel M. Ziegler , Ryan Lowe , Chelsea Voss , Alec Radford , Dario Amodei

show 1 more author

Paul Christiano

This is my paper

Pith reviewed 2026-05-18 01:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords summarizationhuman feedbackreinforcement learningreward modelpreference learningTL;DR datasetsummary qualitypolicy optimization

0 comments

The pith

Training summarization models to optimize a reward model learned from human preferences produces summaries that humans rate higher than both reference summaries and larger supervised models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training and evaluating summarization models has relied on matching human reference summaries and scoring with ROUGE, but these are only rough proxies for the quality people actually want. The paper instead gathers many human comparisons between alternative summaries of the same text, trains a reward model to predict which summary a person would choose, and then uses reinforcement learning to fine-tune a policy that maximizes that reward. On the TL;DR Reddit dataset the resulting models are preferred by humans over the original reference summaries and over much larger models trained only with supervised learning. The same models also produce near-reference quality on CNN/DM news articles with no additional news-specific training. The work establishes that the learned reward generalizes across datasets and that optimizing it yields better human judgments than optimizing ROUGE.

Core claim

We show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. Applied to a version of the TL;DR dataset, our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone, and transfer to CNN/DM news articles without any news-specific fine-tuning.

What carries the argument

A reward model trained to predict which of two summaries a human would prefer, used as the objective for reinforcement learning to update the summarization policy.

If this is right

Summaries from the human-preference-optimized policy are rated higher than human-written references on Reddit posts.
The same policy produces summaries nearly as good as human references on news articles with no domain-specific training.
The reward model trained on one dataset generalizes to new summarization datasets.
Optimizing the learned reward produces summaries that humans judge better than those obtained by directly optimizing ROUGE.
Human comparison data can replace or improve upon proxy metrics for training generation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preference-collection and reward-model approach could be applied to other open-ended generation tasks such as dialogue or creative writing.
If the reward model misses certain human criteria, repeated optimization could amplify those gaps over time.
Collecting more diverse human comparisons or using larger base models might further widen the gap over supervised baselines.
This method offers a concrete way to align model output with nuanced human judgment rather than surface-level reference matching.

Load-bearing premise

The reward model trained on the collected human comparisons will continue to predict human preferences accurately on new summaries, and optimizing the policy against it will improve quality without introducing undetected biases or reward gaming.

What would settle it

A large blind human evaluation in which raters consistently prefer summaries from supervised fine-tuning or the original human references over the reinforcement-learning versions would show the central claim does not hold.

read the original abstract

As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about -- summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that training summarization models to optimize for human preferences using reinforcement learning leads to significantly better summaries than supervised fine-tuning or human references. Specifically, on the TL;DR dataset, their RLHF models outperform human summaries and larger SFT models in human evaluations, and the approach transfers to CNN/DM news articles producing near-human quality summaries without domain-specific training. They also show the reward model generalizes and that RM optimization is preferred over ROUGE by humans.

Significance. This result, if substantiated, is significant for the field as it provides concrete evidence that human feedback can be used effectively to align language model outputs with desired qualities beyond what supervised learning achieves. The transfer results highlight the potential for generalizable preference models. The extensive analyses of the dataset and models add value by showing RM generalization and superiority to proxy metrics. These findings support shifting from proxy-based training to direct human preference optimization in NLP tasks.

major comments (2)

[§4] §4 (Experiments and human evaluations): The reported outperformance over human references and SFT baselines in human preference judgments lacks details on statistical significance tests, precise evaluation data splits, and explicit controls for confounds such as summary length or stylistic artifacts. These elements are load-bearing for the central claim that RM-optimized policies yield genuinely superior quality.
[§5] §5 (Analyses and generalization): The manuscript shows the reward model generalizes to new datasets and that RM optimization beats ROUGE per human judges, but provides insufficient direct tests (e.g., blinded held-out evaluations or adversarial examples) confirming that RL policy optimization does not exploit spurious correlations or introduce undetected gaming of the RM.

minor comments (2)

[Abstract] The abstract could more explicitly preview the key analysis findings (RM generalization and ROUGE comparison) to improve standalone clarity.
[Method] Notation in the reward model section would benefit from an explicit equation defining how comparison pairs are formatted as input.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our work. We address each major comment below in detail and indicate where revisions have been made to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experiments and human evaluations): The reported outperformance over human references and SFT baselines in human preference judgments lacks details on statistical significance tests, precise evaluation data splits, and explicit controls for confounds such as summary length or stylistic artifacts. These elements are load-bearing for the central claim that RM-optimized policies yield genuinely superior quality.

Authors: We agree that clearer reporting of statistical details and confound controls would strengthen the presentation of our human evaluation results. In the revised manuscript we have added bootstrap resampling to compute 95% confidence intervals and paired significance tests for all reported preference rates. We have also specified the exact held-out evaluation splits (distinct from both the supervised fine-tuning data and the reward model training comparisons). For length confounds we now report length distributions for all models and include a length-matched subset analysis showing that the preference advantage persists. Stylistic artifacts are inherently harder to isolate; we have added a qualitative discussion of observed stylistic differences and note this as a limitation. These changes directly address the load-bearing aspects of the central claim. revision: yes
Referee: [§5] §5 (Analyses and generalization): The manuscript shows the reward model generalizes to new datasets and that RM optimization beats ROUGE per human judges, but provides insufficient direct tests (e.g., blinded held-out evaluations or adversarial examples) confirming that RL policy optimization does not exploit spurious correlations or introduce undetected gaming of the RM.

Authors: We acknowledge that explicit tests for reward hacking would provide additional reassurance. The original manuscript already demonstrates generalization of the reward model to CNN/DM and shows that human judges prefer RM-optimized summaries over ROUGE-optimized ones. In revision we have expanded the analysis section with further held-out evaluations on additional Reddit posts and an examination of common gaming indicators (e.g., length inflation, repetition). We did not include dedicated adversarial example suites in the original work; such tests would require new data collection and are noted as future work. The transfer results and human preference data provide supporting evidence against severe exploitation, but we agree that stronger direct tests would be valuable. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper collects an independent dataset of human summary comparisons on TL;DR posts, trains a reward model to predict preferences from these labels, and applies RL (PPO) to optimize a policy against the resulting reward. Final quality claims rest on fresh human preference judgments collected separately from the training comparisons, plus transfer experiments on CNN/DM without task-specific fine-tuning. No equation or step equates a prediction to its own training input by construction, no uniqueness theorem is imported from self-citations to force the method, and no fitted parameter is relabeled as an out-of-sample prediction. The derivation remains empirically grounded in distinct human data at each stage.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the human feedback pipeline rather than new mathematical axioms or invented entities. Free parameters are the weights of the reward model and policy, fitted to human data and RL objectives.

free parameters (2)

reward model weights
Trained on human comparison data to predict preferences.
RL policy parameters
Fine-tuned via reinforcement learning to maximize predicted reward.

pith-pipeline@v0.9.0 · 5793 in / 1091 out tokens · 31681 ms · 2026-05-18T01:41:19.753132+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimizing our reward model results in better summaries than optimizing ROUGE according to humans

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ORPO: Monolithic Preference Optimization without Reference Model
cs.CL 2024-03 conditional novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Discovering Latent Knowledge in Language Models Without Supervision
cs.CL 2022-12 conditional novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 conditional novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 unverdicted novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
cs.LG 2026-05 unverdicted novelty 7.0

Approximate Next Policy Sampling approximates the next policy's state distribution during training to enable larger safe policy updates in deep RL, demonstrated by SV-PPO matching or exceeding standard PPO on Atari an...
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
cs.LG 2026-05 unverdicted novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
cs.CR 2023-08 unverdicted novelty 6.0

Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
Aligning Text-to-Image Models using Human Feedback
cs.LG 2023-02 unverdicted novelty 6.0

A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
Efficient Training of Language Models to Fill in the Middle
cs.CL 2022-07 unverdicted novelty 6.0

Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
Scaling Laws and Interpretability of Learning from Repeated Data
cs.LG 2022-05 accept novelty 6.0

Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Failure Modes of Maximum Entropy RLHF
cs.LG 2025-09 unverdicted novelty 5.0

Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 17 Pith papers · 24 internal anchors

[1]

An Actor-Critic Algorithm for Sequence Prediction

D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y . Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

B. T. Bartell, G. W. Cottrell, and R. K. Belew. Automatic combination of multiple ranked retrieval systems. In SIGIR’94, pages 173–181. Springer, 1994

work page 1994
[3]

F. Böhm, Y . Gao, C. M. Meyer, O. Shapira, I. Dagan, and I. Gurevych. Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214, 2019

work page arXiv 1909
[4]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

work page 2020
[5]

S. Cabi, S. Gómez Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y . Aytar, D. Budden, M. Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv, pages arXiv–1909, 2019

work page 1909
[6]

A. T. Chaganty, S. Mussman, and P. Liang. The price of debiasing automatic metrics in natural language evaluation. arXiv preprint arXiv:1807.02202, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

W. S. Cho, P. Zhang, Y . Zhang, X. Li, M. Galley, C. Brockett, M. Wang, and J. Gao. Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Chopra, M

S. Chopra, M. Auli, and A. M. Rush. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 93–98, 2016

work page 2016
[9]

Supervising strong learners by amplifying weak experts

P. Christiano, B. Shlegeris, and D. Amodei. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems , pages 4299–4307, 2017

work page 2017
[11]

Covington, J

P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pages 191–198, 2016

work page 2016
[12]

A. M. Dai and Q. V . Le. Semi-supervised sequence learning. InAdvances in neural information processing systems, pages 3079–3087, 2015

work page 2015
[13]

Dodge, G

J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020

work page arXiv 2002
[14]

L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y . Wang, J. Gao, M. Zhou, and H.-W. Hon. Uniﬁed language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, 2019

work page 2019
[15]

Y . Dong, Y . Shen, E. Crawford, H. van Hoof, and J. C. K. Cheung. Banditsum: Extractive summarization as a contextual bandit. arXiv preprint arXiv:1809.09672, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

B. Dorr, D. Zajic, and R. Schwartz. Hedge trimmer: A parse-and-trim approach to headline generation. In Proceedings of the HLT-NAACL 03 on Text summarization workshop-Volume 5, pages 1–8. Association for Computational Linguistics, 2003

work page 2003
[17]

Fidler et al

S. Fidler et al. Teaching machines to describe images with natural language feedback. In Advances in Neural Information Processing Systems, pages 5068–5078, 2017. 11

work page 2017
[18]

N. Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM Transactions on Information Systems (TOIS), 7(3):183–204, 1989

work page 1989
[19]

Y . Gao, C. M. Meyer, M. Mesgar, and I. Gurevych. Reward learning for efﬁcient reinforcement learning in extractive document summarisation. arXiv preprint arXiv:1907.12894, 2019

work page arXiv 1907
[20]

Glorot and Y

X. Glorot and Y . Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pages 249–256, 2010

work page 2010
[21]

Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

B. Hancock, A. Bordes, P.-E. Mazare, and J. Weston. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[22]

K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701, 2015

work page 2015
[23]

The Curious Case of Neural Text Degeneration

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[24]

Ibarz, J

B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei. Reward learning from human preferences and demonstrations in atari. In Advances in neural information processing systems, pages 8011–8023, 2018

work page 2018
[25]

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[26]

Jaques, S

N. Jaques, S. Gu, D. Bahdanau, J. M. Hernández-Lobato, R. E. Turner, and D. Eck. Sequence tutor: Conservative ﬁne-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645–1654. PMLR, 2017

work page 2017
[27]

Jaques, S

N. Jaques, S. Gu, R. E. Turner, and D. Eck. Tuning recurrent neural networks with reinforcement learning. 2017

work page 2017
[28]

H. J. Jeon, S. Milli, and A. D. Dragan. Reward-rational (implicit) choice: A unifying formalism for reward learning. arXiv preprint arXiv:2002.04833, 2020

work page arXiv 2002
[29]

Joachims

T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , pages 133–142, 2002

work page 2002
[30]

Joachims, L

T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting click- through data as implicit feedback. In ACM SIGIR Forum, volume 51, pages 4–11. Acm New York, NY , USA, 2005

work page 2005
[31]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[32]

Can Neural Machine Translation be Improved with User Feedback?

J. Kreutzer, S. Khadivi, E. Matusov, and S. Riezler. Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Kryscinski, N

W. Kryscinski, N. S. Keskar, B. McCann, C. Xiong, and R. Socher. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 540–551, 2019

work page 2019
[34]

Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback

C. Lawrence and S. Riezler. Improving a neural semantic parser by counterfactual learning from human bandit feedback. arXiv preprint arXiv:1805.01252, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Scalable agent alignment via reward modeling: a research direction

J. Leike, D. Krueger, T. Everitt, M. Martic, V . Maini, and S. Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[37]

M. Li, J. Weston, and S. Roller. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087, 2019

work page arXiv 1909
[38]

R. Likert. A technique for the measurement of attitudes. Archives of psychology, 1932. 12

work page 1932
[39]

Lin and F

C.-Y . Lin and F. J. Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 605. Association for Computational Linguistics, 2004

work page 2004
[40]

Liu.Learning to rank for information retrieval

T.-Y . Liu.Learning to rank for information retrieval . Springer Science & Business Media, 2011

work page 2011
[41]

Maynez, S

J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization, 2020

work page 2020
[42]

The Natural Language Decathlon: Multitask Learning as Question Answering

B. McCann, N. S. Keskar, C. Xiong, and R. Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

K. Nguyen, H. Daumé III, and J. Boyd-Graber. Reinforcement learning for bandit neural machine translation with simulated human feedback. arXiv preprint arXiv:1707.07402, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Niu and M

T. Niu and M. Bansal. Polite dialogue generation without parallel data. Transactions of the Association for Computational Linguistics, 6:373–389, 2018

work page 2018
[45]

A Deep Reinforced Model for Abstractive Summarization

R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

Perez, S

E. Perez, S. Karamcheti, R. Fergus, J. Weston, D. Kiela, and K. Cho. Finding generalizable evidence by learning to convince q&a models. arXiv preprint arXiv:1909.05863, 2019

work page arXiv 1909
[47]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language under- standing by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018

work page 2018
[48]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019

work page 2019
[49]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer.arXiv preprint arXiv:1910.10683, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[50]

Sequence Level Training with Recurrent Neural Networks

M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[51]

D. R. Reddy et al. Speech understanding systems: A summary of results of the ﬁve-year research effort. department of computer science, 1977

work page 1977
[52]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pages 627–635, 2011

work page 2011
[53]

Rothe, S

S. Rothe, S. Narayan, and A. Severyn. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 2020

work page 2020
[54]

A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[55]

Schluter

N. Schluter. The limits of automatic summarisation according to rouge. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 41–45, 2017

work page 2017
[56]

F. Schmidt. Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292, 2019

work page arXiv 1910
[57]

Schulman, P

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016

work page 2016
[58]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

A. See, P. J. Liu, and C. D. Manning. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[60]

K. Song, X. Tan, T. Qin, J. Lu, and T.-Y . Liu. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450, 2019. 13

work page internal anchor Pith review Pith/arXiv arXiv 1905
[61]

Tambwekar, M

P. Tambwekar, M. Dhuliawala, A. Mehta, L. J. Martin, B. Harrison, and M. O. Riedl. Con- trollable neural story generation via reinforcement learning. arXiv preprint arXiv:1809.10736, 2018

work page arXiv 2018
[62]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017

work page 2017
[63]

Völske, M

M. Völske, M. Potthast, S. Syed, and B. Stein. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017

work page 2017
[64]

Welleck, I

S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019

work page arXiv 1908
[65]

Wu and B

Y . Wu and B. Hu. Learning to extract coherent summary via deep reinforcement learning. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018
[66]

Y . Wu, M. Schuster, Z. Chen, Q. V . Le, M. Norouzi, W. Macherey, M. Krikun, Y . Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[67]

Y . Yan, W. Qi, Y . Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and M. Zhou. Prophetnet: Pre- dicting future n-gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063, 2020

work page arXiv 2001
[68]

S. Yi, R. Goel, C. Khatri, A. Cervone, T. Chung, B. Hedayatnia, A. Venkatesh, R. Gabriel, and D. Hakkani-Tur. Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. arXiv preprint arXiv:1904.13015, 2019

work page arXiv 1904
[69]

Zhang, D

H. Zhang, D. Duckworth, D. Ippolito, and A. Neelakantan. Trading off diversity and quality in natural language generation. arXiv preprint arXiv:2004.10450, 2020

work page arXiv 2004
[70]

Zhang, Y

J. Zhang, Y . Zhao, M. Saleh, and P. J. Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777, 2019

work page arXiv 1912
[71]

Zhang, D

Y . Zhang, D. Li, Y . Wang, Y . Fang, and W. Xiao. Abstract text summarization with a convolu- tional seq2seq model. Applied Sciences, 9(8):1665, 2019

work page 2019
[72]

Zhou and K

W. Zhou and K. Xu. Learning to compare for better training and evaluation of open domain natural language generation models. arXiv preprint arXiv:2002.05058, 2020

work page arXiv 2002
[73]

Fine-Tuning Language Models from Human Preferences

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irv- ing. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 14 Appendix Table of Contents A TL;DR dataset details 16 B Further model training details 17 B.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 1909