pith. machine review for the scientific record. sign in

arxiv: 2009.01325 · v3 · pith:DJ6ELZMMnew · submitted 2020-09-02 · 💻 cs.CL · cs.AI· cs.LG

Learning to summarize from human feedback

Pith reviewed 2026-05-18 01:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords summarizationhuman feedbackreinforcement learningreward modelpreference learningTL;DR datasetsummary qualitypolicy optimization
0
0 comments X

The pith

Training summarization models to optimize a reward model learned from human preferences produces summaries that humans rate higher than both reference summaries and larger supervised models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training and evaluating summarization models has relied on matching human reference summaries and scoring with ROUGE, but these are only rough proxies for the quality people actually want. The paper instead gathers many human comparisons between alternative summaries of the same text, trains a reward model to predict which summary a person would choose, and then uses reinforcement learning to fine-tune a policy that maximizes that reward. On the TL;DR Reddit dataset the resulting models are preferred by humans over the original reference summaries and over much larger models trained only with supervised learning. The same models also produce near-reference quality on CNN/DM news articles with no additional news-specific training. The work establishes that the learned reward generalizes across datasets and that optimizing it yields better human judgments than optimizing ROUGE.

Core claim

We show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. Applied to a version of the TL;DR dataset, our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone, and transfer to CNN/DM news articles without any news-specific fine-tuning.

What carries the argument

A reward model trained to predict which of two summaries a human would prefer, used as the objective for reinforcement learning to update the summarization policy.

If this is right

  • Summaries from the human-preference-optimized policy are rated higher than human-written references on Reddit posts.
  • The same policy produces summaries nearly as good as human references on news articles with no domain-specific training.
  • The reward model trained on one dataset generalizes to new summarization datasets.
  • Optimizing the learned reward produces summaries that humans judge better than those obtained by directly optimizing ROUGE.
  • Human comparison data can replace or improve upon proxy metrics for training generation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preference-collection and reward-model approach could be applied to other open-ended generation tasks such as dialogue or creative writing.
  • If the reward model misses certain human criteria, repeated optimization could amplify those gaps over time.
  • Collecting more diverse human comparisons or using larger base models might further widen the gap over supervised baselines.
  • This method offers a concrete way to align model output with nuanced human judgment rather than surface-level reference matching.

Load-bearing premise

The reward model trained on the collected human comparisons will continue to predict human preferences accurately on new summaries, and optimizing the policy against it will improve quality without introducing undetected biases or reward gaming.

What would settle it

A large blind human evaluation in which raters consistently prefer summaries from supervised fine-tuning or the original human references over the reinforcement-learning versions would show the central claim does not hold.

read the original abstract

As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about -- summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that training summarization models to optimize for human preferences using reinforcement learning leads to significantly better summaries than supervised fine-tuning or human references. Specifically, on the TL;DR dataset, their RLHF models outperform human summaries and larger SFT models in human evaluations, and the approach transfers to CNN/DM news articles producing near-human quality summaries without domain-specific training. They also show the reward model generalizes and that RM optimization is preferred over ROUGE by humans.

Significance. This result, if substantiated, is significant for the field as it provides concrete evidence that human feedback can be used effectively to align language model outputs with desired qualities beyond what supervised learning achieves. The transfer results highlight the potential for generalizable preference models. The extensive analyses of the dataset and models add value by showing RM generalization and superiority to proxy metrics. These findings support shifting from proxy-based training to direct human preference optimization in NLP tasks.

major comments (2)
  1. [§4] §4 (Experiments and human evaluations): The reported outperformance over human references and SFT baselines in human preference judgments lacks details on statistical significance tests, precise evaluation data splits, and explicit controls for confounds such as summary length or stylistic artifacts. These elements are load-bearing for the central claim that RM-optimized policies yield genuinely superior quality.
  2. [§5] §5 (Analyses and generalization): The manuscript shows the reward model generalizes to new datasets and that RM optimization beats ROUGE per human judges, but provides insufficient direct tests (e.g., blinded held-out evaluations or adversarial examples) confirming that RL policy optimization does not exploit spurious correlations or introduce undetected gaming of the RM.
minor comments (2)
  1. [Abstract] The abstract could more explicitly preview the key analysis findings (RM generalization and ROUGE comparison) to improve standalone clarity.
  2. [Method] Notation in the reward model section would benefit from an explicit equation defining how comparison pairs are formatted as input.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our work. We address each major comment below in detail and indicate where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments and human evaluations): The reported outperformance over human references and SFT baselines in human preference judgments lacks details on statistical significance tests, precise evaluation data splits, and explicit controls for confounds such as summary length or stylistic artifacts. These elements are load-bearing for the central claim that RM-optimized policies yield genuinely superior quality.

    Authors: We agree that clearer reporting of statistical details and confound controls would strengthen the presentation of our human evaluation results. In the revised manuscript we have added bootstrap resampling to compute 95% confidence intervals and paired significance tests for all reported preference rates. We have also specified the exact held-out evaluation splits (distinct from both the supervised fine-tuning data and the reward model training comparisons). For length confounds we now report length distributions for all models and include a length-matched subset analysis showing that the preference advantage persists. Stylistic artifacts are inherently harder to isolate; we have added a qualitative discussion of observed stylistic differences and note this as a limitation. These changes directly address the load-bearing aspects of the central claim. revision: yes

  2. Referee: [§5] §5 (Analyses and generalization): The manuscript shows the reward model generalizes to new datasets and that RM optimization beats ROUGE per human judges, but provides insufficient direct tests (e.g., blinded held-out evaluations or adversarial examples) confirming that RL policy optimization does not exploit spurious correlations or introduce undetected gaming of the RM.

    Authors: We acknowledge that explicit tests for reward hacking would provide additional reassurance. The original manuscript already demonstrates generalization of the reward model to CNN/DM and shows that human judges prefer RM-optimized summaries over ROUGE-optimized ones. In revision we have expanded the analysis section with further held-out evaluations on additional Reddit posts and an examination of common gaming indicators (e.g., length inflation, repetition). We did not include dedicated adversarial example suites in the original work; such tests would require new data collection and are noted as future work. The transfer results and human preference data provide supporting evidence against severe exploitation, but we agree that stronger direct tests would be valuable. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper collects an independent dataset of human summary comparisons on TL;DR posts, trains a reward model to predict preferences from these labels, and applies RL (PPO) to optimize a policy against the resulting reward. Final quality claims rest on fresh human preference judgments collected separately from the training comparisons, plus transfer experiments on CNN/DM without task-specific fine-tuning. No equation or step equates a prediction to its own training input by construction, no uniqueness theorem is imported from self-citations to force the method, and no fitted parameter is relabeled as an out-of-sample prediction. The derivation remains empirically grounded in distinct human data at each stage.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the human feedback pipeline rather than new mathematical axioms or invented entities. Free parameters are the weights of the reward model and policy, fitted to human data and RL objectives.

free parameters (2)
  • reward model weights
    Trained on human comparison data to predict preferences.
  • RL policy parameters
    Fine-tuned via reinforcement learning to maximize predicted reward.

pith-pipeline@v0.9.0 · 5793 in / 1091 out tokens · 31681 ms · 2026-05-18T01:41:19.753132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ORPO: Monolithic Preference Optimization without Reference Model

    cs.CL 2024-03 conditional novelty 8.0

    ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

  2. Discovering Latent Knowledge in Language Models Without Supervision

    cs.CL 2022-12 conditional novelty 8.0

    An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...

  3. Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

    cs.CL 2026-05 conditional novelty 7.0

    Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.

  4. Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.

  5. Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

    cs.LG 2026-05 unverdicted novelty 7.0

    Approximate Next Policy Sampling approximates the next policy's state distribution during training to enable larger safe policy updates in deep RL, demonstrated by SV-PPO matching or exceeding standard PPO on Atari an...

  6. The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

    cs.LG 2026-05 unverdicted novelty 7.0

    An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.

  7. Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...

  8. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.

  9. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.

  10. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  11. SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.

  12. "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

    cs.CR 2023-08 unverdicted novelty 6.0

    Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.

  13. Aligning Text-to-Image Models using Human Feedback

    cs.LG 2023-02 unverdicted novelty 6.0

    A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.

  14. Efficient Training of Language Models to Fill in the Middle

    cs.CL 2022-07 unverdicted novelty 6.0

    Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.

  15. Scaling Laws and Interpretability of Learning from Repeated Data

    cs.LG 2022-05 accept novelty 6.0

    Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.

  16. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  17. Scaling Laws for Transfer

    cs.LG 2021-02 unverdicted novelty 6.0

    Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

  18. Failure Modes of Maximum Entropy RLHF

    cs.LG 2025-09 unverdicted novelty 5.0

    Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.

  19. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 17 Pith papers · 24 internal anchors

  1. [1]

    An Actor-Critic Algorithm for Sequence Prediction

    D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y . Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016

  2. [2]

    B. T. Bartell, G. W. Cottrell, and R. K. Belew. Automatic combination of multiple ranked retrieval systems. In SIGIR’94, pages 173–181. Springer, 1994

  3. [3]

    F. Böhm, Y . Gao, C. M. Meyer, O. Shapira, I. Dagan, and I. Gurevych. Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214, 2019

  4. [4]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

  5. [5]

    S. Cabi, S. Gómez Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y . Aytar, D. Budden, M. Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv, pages arXiv–1909, 2019

  6. [6]

    A. T. Chaganty, S. Mussman, and P. Liang. The price of debiasing automatic metrics in natural language evaluation. arXiv preprint arXiv:1807.02202, 2018

  7. [7]

    W. S. Cho, P. Zhang, Y . Zhang, X. Li, M. Galley, C. Brockett, M. Wang, and J. Gao. Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511, 2018

  8. [8]

    Chopra, M

    S. Chopra, M. Auli, and A. M. Rush. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 93–98, 2016

  9. [9]

    Supervising strong learners by amplifying weak experts

    P. Christiano, B. Shlegeris, and D. Amodei. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018

  10. [10]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems , pages 4299–4307, 2017

  11. [11]

    Covington, J

    P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pages 191–198, 2016

  12. [12]

    A. M. Dai and Q. V . Le. Semi-supervised sequence learning. InAdvances in neural information processing systems, pages 3079–3087, 2015

  13. [13]

    Dodge, G

    J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020

  14. [14]

    L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y . Wang, J. Gao, M. Zhou, and H.-W. Hon. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, 2019

  15. [15]

    Y . Dong, Y . Shen, E. Crawford, H. van Hoof, and J. C. K. Cheung. Banditsum: Extractive summarization as a contextual bandit. arXiv preprint arXiv:1809.09672, 2018

  16. [16]

    B. Dorr, D. Zajic, and R. Schwartz. Hedge trimmer: A parse-and-trim approach to headline generation. In Proceedings of the HLT-NAACL 03 on Text summarization workshop-Volume 5, pages 1–8. Association for Computational Linguistics, 2003

  17. [17]

    Fidler et al

    S. Fidler et al. Teaching machines to describe images with natural language feedback. In Advances in Neural Information Processing Systems, pages 5068–5078, 2017. 11

  18. [18]

    N. Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM Transactions on Information Systems (TOIS), 7(3):183–204, 1989

  19. [19]

    Y . Gao, C. M. Meyer, M. Mesgar, and I. Gurevych. Reward learning for efficient reinforcement learning in extractive document summarisation. arXiv preprint arXiv:1907.12894, 2019

  20. [20]

    Glorot and Y

    X. Glorot and Y . Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010

  21. [21]

    Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

    B. Hancock, A. Bordes, P.-E. Mazare, and J. Weston. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415, 2019

  22. [22]

    K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701, 2015

  23. [23]

    The Curious Case of Neural Text Degeneration

    A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

  24. [24]

    Ibarz, J

    B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei. Reward learning from human preferences and demonstrations in atari. In Advances in neural information processing systems, pages 8011–8023, 2018

  25. [25]

    Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

    N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019

  26. [26]

    Jaques, S

    N. Jaques, S. Gu, D. Bahdanau, J. M. Hernández-Lobato, R. E. Turner, and D. Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645–1654. PMLR, 2017

  27. [27]

    Jaques, S

    N. Jaques, S. Gu, R. E. Turner, and D. Eck. Tuning recurrent neural networks with reinforcement learning. 2017

  28. [28]

    H. J. Jeon, S. Milli, and A. D. Dragan. Reward-rational (implicit) choice: A unifying formalism for reward learning. arXiv preprint arXiv:2002.04833, 2020

  29. [29]

    Joachims

    T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , pages 133–142, 2002

  30. [30]

    Joachims, L

    T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting click- through data as implicit feedback. In ACM SIGIR Forum, volume 51, pages 4–11. Acm New York, NY , USA, 2005

  31. [31]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  32. [32]

    Can Neural Machine Translation be Improved with User Feedback?

    J. Kreutzer, S. Khadivi, E. Matusov, and S. Riezler. Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958, 2018

  33. [33]

    Kryscinski, N

    W. Kryscinski, N. S. Keskar, B. McCann, C. Xiong, and R. Socher. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 540–551, 2019

  34. [34]

    Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback

    C. Lawrence and S. Riezler. Improving a neural semantic parser by counterfactual learning from human bandit feedback. arXiv preprint arXiv:1805.01252, 2018

  35. [35]

    Scalable agent alignment via reward modeling: a research direction

    J. Leike, D. Krueger, T. Everitt, M. Martic, V . Maini, and S. Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018

  36. [36]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019

  37. [37]

    M. Li, J. Weston, and S. Roller. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087, 2019

  38. [38]

    R. Likert. A technique for the measurement of attitudes. Archives of psychology, 1932. 12

  39. [39]

    Lin and F

    C.-Y . Lin and F. J. Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 605. Association for Computational Linguistics, 2004

  40. [40]

    Liu.Learning to rank for information retrieval

    T.-Y . Liu.Learning to rank for information retrieval . Springer Science & Business Media, 2011

  41. [41]

    Maynez, S

    J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization, 2020

  42. [42]

    The Natural Language Decathlon: Multitask Learning as Question Answering

    B. McCann, N. S. Keskar, C. Xiong, and R. Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018

  43. [43]

    Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

    K. Nguyen, H. Daumé III, and J. Boyd-Graber. Reinforcement learning for bandit neural machine translation with simulated human feedback. arXiv preprint arXiv:1707.07402, 2017

  44. [44]

    Niu and M

    T. Niu and M. Bansal. Polite dialogue generation without parallel data. Transactions of the Association for Computational Linguistics, 6:373–389, 2018

  45. [45]

    A Deep Reinforced Model for Abstractive Summarization

    R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017

  46. [46]

    Perez, S

    E. Perez, S. Karamcheti, R. Fergus, J. Weston, D. Kiela, and K. Cho. Finding generalizable evidence by learning to convince q&a models. arXiv preprint arXiv:1909.05863, 2019

  47. [47]

    Radford, K

    A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language under- standing by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018

  48. [48]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019

  49. [49]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv preprint arXiv:1910.10683, 2019

  50. [50]

    Sequence Level Training with Recurrent Neural Networks

    M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015

  51. [51]

    D. R. Reddy et al. Speech understanding systems: A summary of results of the five-year research effort. department of computer science, 1977

  52. [52]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011

  53. [53]

    Rothe, S

    S. Rothe, S. Narayan, and A. Severyn. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 2020

  54. [54]

    A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015

  55. [55]

    Schluter

    N. Schluter. The limits of automatic summarisation according to rouge. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 41–45, 2017

  56. [56]

    F. Schmidt. Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292, 2019

  57. [57]

    Schulman, P

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016

  58. [58]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  59. [59]

    A. See, P. J. Liu, and C. D. Manning. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017

  60. [60]

    K. Song, X. Tan, T. Qin, J. Lu, and T.-Y . Liu. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450, 2019. 13

  61. [61]

    Tambwekar, M

    P. Tambwekar, M. Dhuliawala, A. Mehta, L. J. Martin, B. Harrison, and M. O. Riedl. Con- trollable neural story generation via reinforcement learning. arXiv preprint arXiv:1809.10736, 2018

  62. [62]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017

  63. [63]

    Völske, M

    M. Völske, M. Potthast, S. Syed, and B. Stein. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017

  64. [64]

    Welleck, I

    S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019

  65. [65]

    Wu and B

    Y . Wu and B. Hu. Learning to extract coherent summary via deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018

  66. [66]

    Y . Wu, M. Schuster, Z. Chen, Q. V . Le, M. Norouzi, W. Macherey, M. Krikun, Y . Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016

  67. [67]

    Y . Yan, W. Qi, Y . Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and M. Zhou. Prophetnet: Pre- dicting future n-gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063, 2020

  68. [68]

    S. Yi, R. Goel, C. Khatri, A. Cervone, T. Chung, B. Hedayatnia, A. Venkatesh, R. Gabriel, and D. Hakkani-Tur. Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. arXiv preprint arXiv:1904.13015, 2019

  69. [69]

    Zhang, D

    H. Zhang, D. Duckworth, D. Ippolito, and A. Neelakantan. Trading off diversity and quality in natural language generation. arXiv preprint arXiv:2004.10450, 2020

  70. [70]

    Zhang, Y

    J. Zhang, Y . Zhao, M. Saleh, and P. J. Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777, 2019

  71. [71]

    Zhang, D

    Y . Zhang, D. Li, Y . Wang, Y . Fang, and W. Xiao. Abstract text summarization with a convolu- tional seq2seq model. Applied Sciences, 9(8):1665, 2019

  72. [72]

    Zhou and K

    W. Zhou and K. Xu. Learning to compare for better training and evaluation of open domain natural language generation models. arXiv preprint arXiv:2002.05058, 2020

  73. [73]

    Fine-Tuning Language Models from Human Preferences

    D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irv- ing. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 14 Appendix Table of Contents A TL;DR dataset details 16 B Further model training details 17 B.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . ...