Fine-Tuning Language Models from Human Preferences
Pith reviewed 2026-05-10 20:54 UTC · model grok-4.3
The pith
Language models can be fine-tuned via reinforcement learning on reward signals learned from human preference comparisons.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training a reward model on human pairwise comparisons of language-model outputs and then applying reinforcement learning with that reward model, pretrained language models can be fine-tuned to continue text in desired styles or to produce summaries that focus on relevant content from long documents.
What carries the argument
A reward model trained on human pairwise comparisons of model outputs, which supplies the scalar reward signal used by proximal policy optimization to update the language model parameters.
If this is right
- Stylistic continuation tasks reach good performance with only a few thousand human comparisons.
- Summarization models learn to select and copy key sentences while discarding introductory material.
- Reward learning from preferences succeeds on real language tasks where hand-crafted rewards are difficult to define.
- The same pipeline can be reused for other tasks in which quality is best judged by humans rather than automatic metrics.
Where Pith is reading between the lines
- The approach may require additional safeguards if labelers consistently favor easy-to-detect patterns that do not reflect deeper quality.
- Scaling the number of comparisons or selecting them more efficiently could reduce the influence of any single heuristic in the learned reward model.
- The method provides a concrete route for aligning language models to subjective criteria across domains beyond the four tasks tested.
- Models trained this way might still need periodic re-training as human preferences shift over time or across populations.
Load-bearing premise
That human preference labels supply a consistent and generalizable measure of output quality rather than simply rewarding superficial patterns such as sentence length or verbatim copying.
What would settle it
If a model trained on the collected human preferences produces lower-quality outputs than a simple rule-based baseline (such as always copying the first few sentences) when both are evaluated by new human raters on held-out data, the claim that preferences provide a robust training signal would fail.
read the original abstract
Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that reward learning from human preference comparisons can be used to fine-tune pre-trained language models on natural language tasks. It reports good performance on stylistic text continuation using only 5,000 human comparisons and, for summarization on TL;DR and CNN/Daily Mail, reasonable ROUGE scores plus strong human ratings with 60,000 comparisons, where models copy full sentences from the source while skipping preamble; the authors note this may exploit labeler heuristics rather than demonstrate genuine summarization skill.
Significance. If the results can be shown to reflect genuine preference-based learning rather than heuristic imitation, the work would be significant as an early demonstration that modest human feedback data can steer generative language models toward desired behaviors in open-ended tasks, supporting the broader goal of aligning language models with human values via RL.
major comments (2)
- [Abstract] Abstract: the central claim that the method yields 'very good performance' on summarization is immediately qualified by the observation that models copy whole sentences from the input (omitting preamble) and that this 'may be exploiting the fact that labelers rely on simple heuristics.' If labelers reward sentence copying, the 60k comparisons do not establish that the reward model learns summarization skill; this directly undermines the paper's assertion that the approach works for complex language tasks.
- [Evaluation] Evaluation sections: no error bars, confidence intervals, or statistical tests are reported for the human judgments or ROUGE scores, and the manuscript provides insufficient detail on the exact protocol for collecting the 5k/60k comparisons or on how the reward model is trained and applied in RL fine-tuning. These omissions make it impossible to evaluate the reliability or reproducibility of the reported gains.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, agreeing where revisions are warranted to improve clarity and rigor.
read point-by-point responses
-
Referee: The abstract claims 'very good performance' on summarization but qualifies it by noting sentence copying that may exploit labeler heuristics, undermining the claim that the approach works for complex tasks.
Authors: We agree the abstract phrasing risks overstating the summarization results. The observed behavior demonstrates that the reward model successfully captures human preferences (leading to high human ratings and reasonable ROUGE), but as noted in the paper, this may rely on heuristics rather than deep summarization skill. We will revise the abstract to remove the unqualified 'very good performance' claim, explicitly state the copying behavior, and clarify that the results validate preference-based steering even when preferences align with simple heuristics. revision: yes
-
Referee: No error bars or statistical tests for human judgments or ROUGE; insufficient details on comparison collection protocol, reward model training, and RL fine-tuning.
Authors: We acknowledge these omissions reduce reproducibility. In revision we will add error bars and confidence intervals to all reported human evaluation and ROUGE results, include statistical significance tests where appropriate, and expand the methods sections with precise protocols for collecting the 5k/60k comparisons, reward model training details (including architecture, loss, and hyperparameters), and the exact RL fine-tuning procedure (PPO settings, KL coefficient, etc.). revision: yes
Circularity Check
No circularity: empirical pipeline grounded in independent human evaluations
full rationale
The paper's core contribution is an empirical pipeline: collect human preference comparisons, train a reward model on them, then apply RL (with KL penalty) to fine-tune a pretrained LM. Results on stylistic continuation and summarization are reported via separate human labelers and ROUGE scores. No derivation, equation, or 'prediction' reduces to the training data by construction; the method does not rename a fit as a forecast or import uniqueness via self-citation chains. The paper itself notes the summarization heuristic risk, treating it as an empirical observation rather than a definitional loop. The derivation chain is therefore self-contained against external human judgments.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of human comparisons
axioms (1)
- domain assumption Human preferences over model outputs can be captured by a learned reward model that generalizes to new generations.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LogicAsFunctionalEquationSatisfiesLawsOfLogic echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions.
-
IndisputableMonolith.Foundation.LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Efficient Preference Poisoning Attack on Offline RLHF
Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
-
Measuring Safety Alignment Effects in Autonomous Security Agents
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security...
-
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generatio...
-
From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery
QuantEvolver applies reinforcement fine-tuning to evolve an LLM policy for generating executable alpha factor expressions, yielding higher-quality and more complementary factors than prompt-based baselines on market b...
-
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Reinforce Adjoint Matching derives a simple consistency loss for RL post-training of diffusion models by tilting the clean distribution toward higher-reward samples under KL regularization while keeping the noising pr...
-
Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning
Risk-sensitive preference games retain monotonicity via translation-invariant risk measures, enabling convergent self-play algorithms with stability bounds and empirical robustness across data strata.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
-
Convex Optimization with Nested Evolving Feasible Sets
For convex losses in nested evolving feasible sets, a lazy algorithm balances O(T^{1-β}) regret with O(T^β) movement for any β; for strongly convex or sharp losses, Frugal achieves zero regret with O(log T) movement, ...
-
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
-
$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses
The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.
-
Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
-
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured uplift on a frozen executor, outperforming execution-only training on math and code b...
-
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...
-
Multi-User Dueling Bandits: A Fair Approach using Nash Social Welfare
The work establishes a regret lower bound of Ω(T^{2/3} min(K,D)^{1/3}) for fair multi-user dueling bandits with heterogeneous Condorcet winners and gives algorithms achieving matching upper bounds up to logs.
-
Three Models of RLHF Annotation: Extension, Evidence, and Authority
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
-
Interactive Episodic Memory with User Feedback
Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.
-
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
-
Autogenesis: A Self-Evolving Agent Protocol
Autogenesis Protocol defines structured resource management and closed-loop self-evolution for multi-agent LLM systems, with the resulting AGS showing gains over baselines on long-horizon benchmarks.
-
Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning
Reinforcement learning with a multi-part reward teaches LLMs to output independent, meaning-preserving sentence edits that raise argument appropriateness close to full rewriting.
-
E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning
E2E-REME outperforms nine LLMs in accuracy and efficiency for end-to-end microservice remediation by using experience-simulation reinforcement fine-tuning on a new benchmark called MicroRemed.
-
From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence
Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
-
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
-
Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework
Proposes a task taxonomy for functional diversity in LLM outputs, validates it via user study, introduces targeted sampling to boost diversity only where needed, and presents evidence that the diversity-quality tradeo...
-
Incentivizing High-Quality Human Annotations with Golden Questions
The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
-
Improving LLM Unlearning Robustness via Random Perturbations
LLM unlearning is reframed as inadvertently installing backdoor triggers on forget-tokens; Random Noise Augmentation is introduced as a defense that improves robustness with theoretical guarantees.
-
KTO: Model Alignment as Prospect Theoretic Optimization
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
Measuring Faithfulness in Chain-of-Thought Reasoning
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
-
Towards Measuring the Representation of Subjective Global Opinions in Language Models
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliab...
-
Let's Verify Step by Step
Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
-
Red Teaming Language Models with Language Models
One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.
-
Learning to summarize from human feedback
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
-
Convex Optimization for Alignment and Preference Learning on a Single GPU
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models...
-
Hierarchical Variational Policies for Reward-Guided Diffusion
A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.
-
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
-
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
-
Reinforcement Learning Assisted Quantum Simulation of Many-Body Excited States and Real-Time Dynamics
The work generalizes RL-CQE to excited states and time evolution via adaptive operator selection and a constant-scaling ansatz, reporting chemical accuracy on chemical systems with compact representations.
-
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment
Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on...
-
Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?
Fine-tuning LLMs on essays reduces variance in IPIP-NEO responses across models but does not raise full five-trait profile accuracy above near-chance levels from unguided text.
-
Active Learning MPC Objective Functions from Preferences
Active learning strategies for preference-based MPC objective learning achieve better closed-loop alignment with human preferences using fewer queries than random sampling in numerical tests.
-
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.
-
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
-
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
-
When Vision Speaks for Sound
Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention pe...
-
Driving Intents Amplify Planning-Oriented Reinforcement Learning
DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).
-
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
A new RL objective adapts trust-region and off-policy handling automatically via normalized effective sample size of batch policy ratios, matching tuned baselines without new hyperparameters.
-
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
-
PriorZero: Bridging Language Priors and World Models for Decision Making
PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.
Reference graph
Works this paper leans on
-
[1]
Deep batch active learning by diverse, uncertain gradient lower bounds
Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch ac- tive learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671,
-
[2]
Learning to understand goal specifications by mod- elling reward
Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Arian Hosseini, Pushmeet Kohli, and Edward Grefen- stette. Learning to understand goal specifications by mod- elling reward. arXiv preprint arXiv:1806.01946,
-
[3]
Supervising strong learners by amplifying weak experts
Paul Christiano, Buck Shlegeris, and Dario Amodei. Super- vising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575,
-
[4]
Preference-based interactive multi-document summarisa- tion
Yang Gao, Christian M Meyer, and Iryna Gurevych. Preference-based interactive multi-document summarisa- tion. arXiv preprint arXiv:1906.02923, 2019a. Yang Gao, Christian M Meyer, Mohsen Mesgar, and Iryna Gurevych. Reward learning for efficient reinforcement learning in extractive document summarisation. arXiv preprint arXiv:1907.12894, 2019b. Sebastian Geh...
-
[5]
Discriminative Active Learning
Daniel Gissin and Shai Shalev-Shwartz. Discriminative active learning. arXiv preprint arXiv:1907.06347,
work page Pith review arXiv 1907
-
[6]
Learning from Dialogue after Deployment: Feed Yourself, Chatbot!
Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. Learning from dialogue af- ter deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415,
work page Pith review arXiv 1901
-
[7]
Universal Language Model Fine-tuning for Text Classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.arXiv preprint arXiv:1801.06146,
-
[8]
Active Learning for Speech Recognition: the Power of Gradients
Jiaji Huang, Rewon Child, Vinay Rao, Hairong Liu, Sanjeev Satheesh, and Adam Coates. Active learning for speech recognition: the power of gradients. arXiv preprint arXiv:1612.03226,
-
[9]
Reward learning from human preferences and demonstrations in Atari
URL https://arxiv.org/abs/1811.06521. Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. arXiv preprint arXiv:1805.00899,
-
[10]
URL https://arxiv.org/abs/1805.00899. Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In Pro- ceedings of the 34th International Conference on Ma- chine Learning-Volume 70, pages 1645–1654. JMLR. org,
work page internal anchor Pith review arXiv
-
[11]
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456,
work page Pith review arXiv 1907
-
[12]
Sample efficient text summarization using a single pre-trained transformer
Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, and Lukasz Kaiser. Sample efficient text summarization using a single pre-trained transformer. arXiv preprint arXiv:1905.08836,
-
[13]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Julia Kreutzer, Joshua Uyheng, and Stefan Riezler. Reli- ability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627,
-
[15]
Neural text summarization: A critical evaluation
Wojciech Kry´sci´nski, Nitish Shirish Keskar, Bryan Mc- Cann, Caiming Xiong, and Richard Socher. Neural text summarization: A critical evaluation. arXiv preprint arXiv:1908.08960,
-
[16]
Scalable agent alignment via reward modeling: a research direction
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent align- ment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871,
-
[17]
Dialogue Learning With Human-In-The-Loop
Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. Dialogue learning with human-in-the-loop. arXiv preprint arXiv:1611.09823 ,
-
[18]
Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback
Fine-Tuning Language Models from Human Preferences Khanh Nguyen, Hal Daumé III, and Jordan Boyd-Graber. Reinforcement learning for bandit neural machine trans- lation with simulated human feedback. arXiv preprint arXiv:1707.07402,
-
[19]
A Deep Reinforced Model for Abstractive Summarization
Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304,
-
[20]
Finding gener- alizable evidence by learning to convince Q&A models
Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason We- ston, Douwe Kiela, and Kyunghyun Cho. Finding gener- alizable evidence by learning to convince Q&A models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, November
work page 2019
-
[21]
Deep contextualized word representations
Association for Computational Linguistics. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gard- ner, Christopher Clark, Kenton Lee, and Luke Zettle- moyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,
-
[22]
Learning to Generate Reviews and Discovering Sentiment
Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learn- ing to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444,
-
[23]
Sequence Level Training with Recurrent Neural Networks
URL https://d4mucfpksywv.cloudfront. net/better-language-models/language_ models_are_unsupervised_multitask_ learners.pdf. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732,
-
[24]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Get To The Point: Summarization with Pointer-Generator Networks
Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with pointer-generator net- works. arXiv preprint arXiv:1704.04368,
-
[26]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909,
work page internal anchor Pith review arXiv
-
[27]
Pradyumna Tambwekar, Murtaza Dhuliawala, Animesh Mehta, Lara J Martin, Brent Harrison, and Mark O Riedl. Controllable neural story generation via reinforcement learning. arXiv preprint arXiv:1809.10736,
-
[28]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mo- hammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap be- tween human and machine translation. arXiv preprint arXiv:1609.08144,
work page internal anchor Pith review arXiv
-
[29]
Sanghyun Yi, Rahul Goel, Chandra Khatri, Tagyoung Chung, Behnam Hedayatnia, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tur. Towards coherent and engaging spoken dialog response generation us- ing automatic conversation evaluators. arXiv preprint arXiv:1904.13015,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.