Learning to summarize from human feedback
Pith reviewed 2026-05-18 01:41 UTC · model grok-4.3
The pith
Training summarization models to optimize a reward model learned from human preferences produces summaries that humans rate higher than both reference summaries and larger supervised models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. Applied to a version of the TL;DR dataset, our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone, and transfer to CNN/DM news articles without any news-specific fine-tuning.
What carries the argument
A reward model trained to predict which of two summaries a human would prefer, used as the objective for reinforcement learning to update the summarization policy.
If this is right
- Summaries from the human-preference-optimized policy are rated higher than human-written references on Reddit posts.
- The same policy produces summaries nearly as good as human references on news articles with no domain-specific training.
- The reward model trained on one dataset generalizes to new summarization datasets.
- Optimizing the learned reward produces summaries that humans judge better than those obtained by directly optimizing ROUGE.
- Human comparison data can replace or improve upon proxy metrics for training generation models.
Where Pith is reading between the lines
- The same preference-collection and reward-model approach could be applied to other open-ended generation tasks such as dialogue or creative writing.
- If the reward model misses certain human criteria, repeated optimization could amplify those gaps over time.
- Collecting more diverse human comparisons or using larger base models might further widen the gap over supervised baselines.
- This method offers a concrete way to align model output with nuanced human judgment rather than surface-level reference matching.
Load-bearing premise
The reward model trained on the collected human comparisons will continue to predict human preferences accurately on new summaries, and optimizing the policy against it will improve quality without introducing undetected biases or reward gaming.
What would settle it
A large blind human evaluation in which raters consistently prefer summaries from supervised fine-tuning or the original human references over the reinforcement-learning versions would show the central claim does not hold.
read the original abstract
As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about -- summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that training summarization models to optimize for human preferences using reinforcement learning leads to significantly better summaries than supervised fine-tuning or human references. Specifically, on the TL;DR dataset, their RLHF models outperform human summaries and larger SFT models in human evaluations, and the approach transfers to CNN/DM news articles producing near-human quality summaries without domain-specific training. They also show the reward model generalizes and that RM optimization is preferred over ROUGE by humans.
Significance. This result, if substantiated, is significant for the field as it provides concrete evidence that human feedback can be used effectively to align language model outputs with desired qualities beyond what supervised learning achieves. The transfer results highlight the potential for generalizable preference models. The extensive analyses of the dataset and models add value by showing RM generalization and superiority to proxy metrics. These findings support shifting from proxy-based training to direct human preference optimization in NLP tasks.
major comments (2)
- [§4] §4 (Experiments and human evaluations): The reported outperformance over human references and SFT baselines in human preference judgments lacks details on statistical significance tests, precise evaluation data splits, and explicit controls for confounds such as summary length or stylistic artifacts. These elements are load-bearing for the central claim that RM-optimized policies yield genuinely superior quality.
- [§5] §5 (Analyses and generalization): The manuscript shows the reward model generalizes to new datasets and that RM optimization beats ROUGE per human judges, but provides insufficient direct tests (e.g., blinded held-out evaluations or adversarial examples) confirming that RL policy optimization does not exploit spurious correlations or introduce undetected gaming of the RM.
minor comments (2)
- [Abstract] The abstract could more explicitly preview the key analysis findings (RM generalization and ROUGE comparison) to improve standalone clarity.
- [Method] Notation in the reward model section would benefit from an explicit equation defining how comparison pairs are formatted as input.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on our work. We address each major comment below in detail and indicate where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments and human evaluations): The reported outperformance over human references and SFT baselines in human preference judgments lacks details on statistical significance tests, precise evaluation data splits, and explicit controls for confounds such as summary length or stylistic artifacts. These elements are load-bearing for the central claim that RM-optimized policies yield genuinely superior quality.
Authors: We agree that clearer reporting of statistical details and confound controls would strengthen the presentation of our human evaluation results. In the revised manuscript we have added bootstrap resampling to compute 95% confidence intervals and paired significance tests for all reported preference rates. We have also specified the exact held-out evaluation splits (distinct from both the supervised fine-tuning data and the reward model training comparisons). For length confounds we now report length distributions for all models and include a length-matched subset analysis showing that the preference advantage persists. Stylistic artifacts are inherently harder to isolate; we have added a qualitative discussion of observed stylistic differences and note this as a limitation. These changes directly address the load-bearing aspects of the central claim. revision: yes
-
Referee: [§5] §5 (Analyses and generalization): The manuscript shows the reward model generalizes to new datasets and that RM optimization beats ROUGE per human judges, but provides insufficient direct tests (e.g., blinded held-out evaluations or adversarial examples) confirming that RL policy optimization does not exploit spurious correlations or introduce undetected gaming of the RM.
Authors: We acknowledge that explicit tests for reward hacking would provide additional reassurance. The original manuscript already demonstrates generalization of the reward model to CNN/DM and shows that human judges prefer RM-optimized summaries over ROUGE-optimized ones. In revision we have expanded the analysis section with further held-out evaluations on additional Reddit posts and an examination of common gaming indicators (e.g., length inflation, repetition). We did not include dedicated adversarial example suites in the original work; such tests would require new data collection and are noted as future work. The transfer results and human preference data provide supporting evidence against severe exploitation, but we agree that stronger direct tests would be valuable. revision: partial
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper collects an independent dataset of human summary comparisons on TL;DR posts, trains a reward model to predict preferences from these labels, and applies RL (PPO) to optimize a policy against the resulting reward. Final quality claims rest on fresh human preference judgments collected separately from the training comparisons, plus transfer experiments on CNN/DM without task-specific fine-tuning. No equation or step equates a prediction to its own training input by construction, no uniqueness theorem is imported from self-citations to force the method, and no fitted parameter is relabeled as an out-of-sample prediction. The derivation remains empirically grounded in distinct human data at each stage.
Axiom & Free-Parameter Ledger
free parameters (2)
- reward model weights
- RL policy parameters
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
optimizing our reward model results in better summaries than optimizing ROUGE according to humans
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
-
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
-
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
-
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
Approximate Next Policy Sampling approximates the next policy's state distribution during training to enable larger safe policy updates in deep RL, demonstrated by SV-PPO matching or exceeding standard PPO on Atari an...
-
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
-
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
-
Aligning Text-to-Image Models using Human Feedback
A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
-
Efficient Training of Language Models to Fill in the Middle
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
-
Scaling Laws and Interpretability of Learning from Repeated Data
Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Scaling Laws for Transfer
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
-
Failure Modes of Maximum Entropy RLHF
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
An Actor-Critic Algorithm for Sequence Prediction
D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y . Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
B. T. Bartell, G. W. Cottrell, and R. K. Belew. Automatic combination of multiple ranked retrieval systems. In SIGIR’94, pages 173–181. Springer, 1994
work page 1994
- [3]
-
[4]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...
work page 2020
-
[5]
S. Cabi, S. Gómez Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y . Aytar, D. Budden, M. Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv, pages arXiv–1909, 2019
work page 1909
-
[6]
A. T. Chaganty, S. Mussman, and P. Liang. The price of debiasing automatic metrics in natural language evaluation. arXiv preprint arXiv:1807.02202, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
W. S. Cho, P. Zhang, Y . Zhang, X. Li, M. Galley, C. Brockett, M. Wang, and J. Gao. Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
S. Chopra, M. Auli, and A. M. Rush. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 93–98, 2016
work page 2016
-
[9]
Supervising strong learners by amplifying weak experts
P. Christiano, B. Shlegeris, and D. Amodei. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems , pages 4299–4307, 2017
work page 2017
-
[11]
P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pages 191–198, 2016
work page 2016
-
[12]
A. M. Dai and Q. V . Le. Semi-supervised sequence learning. InAdvances in neural information processing systems, pages 3079–3087, 2015
work page 2015
- [13]
-
[14]
L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y . Wang, J. Gao, M. Zhou, and H.-W. Hon. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, 2019
work page 2019
-
[15]
Y . Dong, Y . Shen, E. Crawford, H. van Hoof, and J. C. K. Cheung. Banditsum: Extractive summarization as a contextual bandit. arXiv preprint arXiv:1809.09672, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
B. Dorr, D. Zajic, and R. Schwartz. Hedge trimmer: A parse-and-trim approach to headline generation. In Proceedings of the HLT-NAACL 03 on Text summarization workshop-Volume 5, pages 1–8. Association for Computational Linguistics, 2003
work page 2003
-
[17]
S. Fidler et al. Teaching machines to describe images with natural language feedback. In Advances in Neural Information Processing Systems, pages 5068–5078, 2017. 11
work page 2017
-
[18]
N. Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM Transactions on Information Systems (TOIS), 7(3):183–204, 1989
work page 1989
- [19]
-
[20]
X. Glorot and Y . Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010
work page 2010
-
[21]
Learning from Dialogue after Deployment: Feed Yourself, Chatbot!
B. Hancock, A. Bordes, P.-E. Mazare, and J. Weston. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[22]
K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701, 2015
work page 2015
-
[23]
The Curious Case of Neural Text Degeneration
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
- [24]
-
[25]
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
- [26]
- [27]
- [28]
- [29]
-
[30]
T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting click- through data as implicit feedback. In ACM SIGIR Forum, volume 51, pages 4–11. Acm New York, NY , USA, 2005
work page 2005
-
[31]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[32]
Can Neural Machine Translation be Improved with User Feedback?
J. Kreutzer, S. Khadivi, E. Matusov, and S. Riezler. Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[33]
W. Kryscinski, N. S. Keskar, B. McCann, C. Xiong, and R. Socher. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 540–551, 2019
work page 2019
-
[34]
Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback
C. Lawrence and S. Riezler. Improving a neural semantic parser by counterfactual learning from human bandit feedback. arXiv preprint arXiv:1805.01252, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Scalable agent alignment via reward modeling: a research direction
J. Leike, D. Krueger, T. Everitt, M. Martic, V . Maini, and S. Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
- [37]
-
[38]
R. Likert. A technique for the measurement of attitudes. Archives of psychology, 1932. 12
work page 1932
-
[39]
C.-Y . Lin and F. J. Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 605. Association for Computational Linguistics, 2004
work page 2004
-
[40]
Liu.Learning to rank for information retrieval
T.-Y . Liu.Learning to rank for information retrieval . Springer Science & Business Media, 2011
work page 2011
- [41]
-
[42]
The Natural Language Decathlon: Multitask Learning as Question Answering
B. McCann, N. S. Keskar, C. Xiong, and R. Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[43]
Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback
K. Nguyen, H. Daumé III, and J. Boyd-Graber. Reinforcement learning for bandit neural machine translation with simulated human feedback. arXiv preprint arXiv:1707.07402, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [44]
-
[45]
A Deep Reinforced Model for Abstractive Summarization
R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [46]
-
[47]
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language under- standing by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018
work page 2018
-
[48]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019
work page 2019
-
[49]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv preprint arXiv:1910.10683, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[50]
Sequence Level Training with Recurrent Neural Networks
M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[51]
D. R. Reddy et al. Speech understanding systems: A summary of results of the five-year research effort. department of computer science, 1977
work page 1977
-
[52]
S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011
work page 2011
- [53]
-
[54]
A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [55]
- [56]
-
[57]
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016
work page 2016
-
[58]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[59]
A. See, P. J. Liu, and C. D. Manning. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[60]
K. Song, X. Tan, T. Qin, J. Lu, and T.-Y . Liu. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450, 2019. 13
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[61]
P. Tambwekar, M. Dhuliawala, A. Mehta, L. J. Martin, B. Harrison, and M. O. Riedl. Con- trollable neural story generation via reinforcement learning. arXiv preprint arXiv:1809.10736, 2018
-
[62]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017
work page 2017
- [63]
-
[64]
S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019
- [65]
-
[66]
Y . Wu, M. Schuster, Z. Chen, Q. V . Le, M. Norouzi, W. Macherey, M. Krikun, Y . Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [67]
- [68]
- [69]
- [70]
- [71]
-
[72]
W. Zhou and K. Xu. Learning to compare for better training and evaluation of open domain natural language generation models. arXiv preprint arXiv:2002.05058, 2020
-
[73]
Fine-Tuning Language Models from Human Preferences
D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irv- ing. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 14 Appendix Table of Contents A TL;DR dataset details 16 B Further model training details 17 B.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . ...
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.