Fine-Tuning Language Models from Human Preferences

Alec Radford; Daniel M. Ziegler; Dario Amodei; Geoffrey Irving; Jeffrey Wu; Nisan Stiennon; Paul Christiano; Tom B. Brown

arxiv: 1909.08593 · v2 · submitted 2019-09-18 · 💻 cs.CL · cs.LG· stat.ML

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler , Nisan Stiennon , Jeffrey Wu , Tom B. Brown , Alec Radford , Dario Amodei , Paul Christiano , Geoffrey Irving This is my paper

Pith reviewed 2026-05-10 20:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LGstat.ML

keywords language modelshuman preferencesreward learningreinforcement learningfine-tuningsummarizationtext generationpreference modeling

0 comments

The pith

Language models can be fine-tuned via reinforcement learning on reward signals learned from human preference comparisons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to apply reward learning to language by collecting human judgments on pairs of model outputs, training a reward model from those judgments, and then using the reward model to guide reinforcement learning updates to a pretrained language model. For continuing text with a target style such as positive sentiment, the method produces good results after only 5,000 human comparisons. For summarization on the TL;DR and CNN/Daily Mail datasets, the resulting models extract whole sentences from the source while skipping preamble, which yields reasonable automatic scores and high human ratings.

Core claim

By training a reward model on human pairwise comparisons of language-model outputs and then applying reinforcement learning with that reward model, pretrained language models can be fine-tuned to continue text in desired styles or to produce summaries that focus on relevant content from long documents.

What carries the argument

A reward model trained on human pairwise comparisons of model outputs, which supplies the scalar reward signal used by proximal policy optimization to update the language model parameters.

If this is right

Stylistic continuation tasks reach good performance with only a few thousand human comparisons.
Summarization models learn to select and copy key sentences while discarding introductory material.
Reward learning from preferences succeeds on real language tasks where hand-crafted rewards are difficult to define.
The same pipeline can be reused for other tasks in which quality is best judged by humans rather than automatic metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may require additional safeguards if labelers consistently favor easy-to-detect patterns that do not reflect deeper quality.
Scaling the number of comparisons or selecting them more efficiently could reduce the influence of any single heuristic in the learned reward model.
The method provides a concrete route for aligning language models to subjective criteria across domains beyond the four tasks tested.
Models trained this way might still need periodic re-training as human preferences shift over time or across populations.

Load-bearing premise

That human preference labels supply a consistent and generalizable measure of output quality rather than simply rewarding superficial patterns such as sentence length or verbatim copying.

What would settle it

If a model trained on the collected human preferences produces lower-quality outputs than a simple rule-based baseline (such as always copying the first few sentences) when both are evaluated by new human raters on held-out data, the claim that preferences provide a robust training signal would fail.

read the original abstract

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This early RLHF paper shows human comparisons can steer pretrained models on style and summarization with modest data, but the summarization results likely exploit labeler shortcuts like sentence copying.

read the letter

The punchline is that reward models trained on human preference comparisons can improve language model outputs on stylistic continuation with only 5,000 labels, and on summarization with 60,000. The work combines generative pretraining with this reward learning setup and reports human evaluations on TL;DR and CNN/DM tasks. That combination on concrete language tasks was not standard in the cited prior work at the time. The paper does a solid job of giving specific numbers and noting that the summarization outputs copy full sentences while skipping preamble, which produces reasonable ROUGE and strong human scores. It is also direct about the risk that labelers are using simple heuristics. That honesty is useful. The main soft spot is that the summarization claim rests on whether the reward signal captures real quality or just the copying pattern the labelers reward. The abstract itself flags this possibility, so the concern lands. Without full methods, training curves, or error bars it is hard to judge how much the 60k comparisons actually teach summarization skill versus imitation of a shallow cue. The stylistic results look more robust on the limited evidence given. This paper is for readers who want to see the experimental origins of preference tuning on language models. Anyone working on alignment techniques or early RL applications to text will get concrete setup details and numbers to build from. It deserves a serious referee because the core method is grounded in external human labels and the authors surface the key limitation themselves rather than hiding it. I would send it to peer review so the methods and statistical details can be checked.

Referee Report

2 major / 0 minor

Summary. The paper claims that reward learning from human preference comparisons can be used to fine-tune pre-trained language models on natural language tasks. It reports good performance on stylistic text continuation using only 5,000 human comparisons and, for summarization on TL;DR and CNN/Daily Mail, reasonable ROUGE scores plus strong human ratings with 60,000 comparisons, where models copy full sentences from the source while skipping preamble; the authors note this may exploit labeler heuristics rather than demonstrate genuine summarization skill.

Significance. If the results can be shown to reflect genuine preference-based learning rather than heuristic imitation, the work would be significant as an early demonstration that modest human feedback data can steer generative language models toward desired behaviors in open-ended tasks, supporting the broader goal of aligning language models with human values via RL.

major comments (2)

[Abstract] Abstract: the central claim that the method yields 'very good performance' on summarization is immediately qualified by the observation that models copy whole sentences from the input (omitting preamble) and that this 'may be exploiting the fact that labelers rely on simple heuristics.' If labelers reward sentence copying, the 60k comparisons do not establish that the reward model learns summarization skill; this directly undermines the paper's assertion that the approach works for complex language tasks.
[Evaluation] Evaluation sections: no error bars, confidence intervals, or statistical tests are reported for the human judgments or ROUGE scores, and the manuscript provides insufficient detail on the exact protocol for collecting the 5k/60k comparisons or on how the reward model is trained and applied in RL fine-tuning. These omissions make it impossible to evaluate the reliability or reproducibility of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, agreeing where revisions are warranted to improve clarity and rigor.

read point-by-point responses

Referee: The abstract claims 'very good performance' on summarization but qualifies it by noting sentence copying that may exploit labeler heuristics, undermining the claim that the approach works for complex tasks.

Authors: We agree the abstract phrasing risks overstating the summarization results. The observed behavior demonstrates that the reward model successfully captures human preferences (leading to high human ratings and reasonable ROUGE), but as noted in the paper, this may rely on heuristics rather than deep summarization skill. We will revise the abstract to remove the unqualified 'very good performance' claim, explicitly state the copying behavior, and clarify that the results validate preference-based steering even when preferences align with simple heuristics. revision: yes
Referee: No error bars or statistical tests for human judgments or ROUGE; insufficient details on comparison collection protocol, reward model training, and RL fine-tuning.

Authors: We acknowledge these omissions reduce reproducibility. In revision we will add error bars and confidence intervals to all reported human evaluation and ROUGE results, include statistical significance tests where appropriate, and expand the methods sections with precise protocols for collecting the 5k/60k comparisons, reward model training details (including architecture, loss, and hyperparameters), and the exact RL fine-tuning procedure (PPO settings, KL coefficient, etc.). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline grounded in independent human evaluations

full rationale

The paper's core contribution is an empirical pipeline: collect human preference comparisons, train a reward model on them, then apply RL (with KL penalty) to fine-tune a pretrained LM. Results on stylistic continuation and summarization are reported via separate human labelers and ROUGE scores. No derivation, equation, or 'prediction' reduces to the training data by construction; the method does not rename a fit as a forecast or import uniqueness via self-citation chains. The paper itself notes the summarization heuristic risk, treating it as an empirical observation rather than a definitional loop. The derivation chain is therefore self-contained against external human judgments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that human pairwise preferences can be modeled as a scalar reward function that generalizes beyond the collected comparisons. No new physical entities or mathematical axioms beyond standard RL and supervised learning are introduced.

free parameters (1)

number of human comparisons
5,000 for stylistic tasks and 60,000 for summarization are chosen quantities that determine reported performance.

axioms (1)

domain assumption Human preferences over model outputs can be captured by a learned reward model that generalizes to new generations.
Invoked when training the reward model from comparisons and then optimizing the policy against it.

pith-pipeline@v0.9.0 · 5492 in / 1308 out tokens · 56913 ms · 2026-05-10T20:54:19.472350+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LogicAsFunctionalEquation SatisfiesLawsOfLogic echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions.
IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Preference Poisoning Attack on Offline RLHF
cs.LG 2026-05 unverdicted novelty 8.0

Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
ORPO: Monolithic Preference Optimization without Reference Model
cs.CL 2024-03 conditional novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Decision Transformer: Reinforcement Learning via Sequence Modeling
cs.LG 2021-06 accept novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
cs.AI 2026-05 conditional novelty 7.0

DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
Measuring Safety Alignment Effects in Autonomous Security Agents
cs.CR 2026-05 conditional novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security...
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
cs.AI 2026-05 unverdicted novelty 7.0

PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generatio...
From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery
cs.CE 2026-05 unverdicted novelty 7.0

QuantEvolver applies reinforcement fine-tuning to evolve an LLM policy for generating executable alpha factor expressions, yielding higher-quality and more complementary factors than prompt-based baselines on market b...
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
cs.LG 2026-05 unverdicted novelty 7.0

Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
cs.LG 2026-05 unverdicted novelty 7.0

Reinforce Adjoint Matching derives a simple consistency loss for RL post-training of diffusion models by tilting the clean distribution toward higher-reward samples under KL regularization while keeping the noising pr...
Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning
cs.GT 2026-05 unverdicted novelty 7.0

Risk-sensitive preference games retain monotonicity via translation-invariant risk measures, enabling convergent self-play algorithms with stability bounds and empirical robustness across data strata.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 7.0

BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
Convex Optimization with Nested Evolving Feasible Sets
cs.LG 2026-05 unverdicted novelty 7.0

For convex losses in nested evolving feasible sets, a lazy algorithm balances O(T^{1-β}) regret with O(T^β) movement for any β; for strongly convex or sharp losses, Frugal achieves zero regret with O(log T) movement, ...
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
cs.CL 2026-05 unverdicted novelty 7.0

RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses
cs.LG 2026-05 unverdicted novelty 7.0

The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.
Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
cs.SE 2026-05 unverdicted novelty 7.0

Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
cs.AI 2026-05 unverdicted novelty 7.0

TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured uplift on a frozen executor, outperforming execution-only training on math and code b...
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
cs.LG 2026-05 unverdicted novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...
Multi-User Dueling Bandits: A Fair Approach using Nash Social Welfare
cs.LG 2026-05 unverdicted novelty 7.0

The work establishes a regret lower bound of Ω(T^{2/3} min(K,D)^{1/3}) for fair multi-user dueling bandits with heterogeneous Condorcet winners and gives algorithms achieving matching upper bounds up to logs.
Three Models of RLHF Annotation: Extension, Evidence, and Authority
cs.CY 2026-04 unverdicted novelty 7.0

RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Interactive Episodic Memory with User Feedback
cs.CV 2026-04 unverdicted novelty 7.0

Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
cs.CL 2026-04 unverdicted novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Autogenesis: A Self-Evolving Agent Protocol
cs.AI 2026-04 unverdicted novelty 7.0

Autogenesis Protocol defines structured resource management and closed-loop self-evolution for multi-agent LLM systems, with the resulting AGS showing gains over baselines on long-horizon benchmarks.
Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

Reinforcement learning with a multi-part reward teaches LLMs to output independent, meaning-preserving sentence edits that raise argument appropriateness close to full rewriting.
E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning
cs.SE 2026-04 unverdicted novelty 7.0

E2E-REME outperforms nine LLMs in accuracy and efficiency for end-to-end microservice remediation by using experience-simulation reinforcement fine-tuning on a new benchmark called MicroRemed.
From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence
cs.SE 2026-04 conditional novelty 7.0

Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
cs.LG 2026-03 unverdicted novelty 7.0

Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework
cs.CL 2025-09 conditional novelty 7.0

Proposes a task taxonomy for functional diversity in LLM outputs, validates it via user study, introduces targeted sampling to boost diversity only where needed, and presents evidence that the diversity-quality tradeo...
Incentivizing High-Quality Human Annotations with Golden Questions
cs.GT 2025-05 unverdicted novelty 7.0

The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
Improving LLM Unlearning Robustness via Random Perturbations
cs.CL 2025-01 unverdicted novelty 7.0

LLM unlearning is reframed as inadvertently installing backdoor triggers on forget-tokens; Random Noise Augmentation is introduced as a defense that improves robustness with theoretical guarantees.
KTO: Model Alignment as Prospect Theoretic Optimization
cs.LG 2024-02 conditional novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Measuring Faithfulness in Chain-of-Thought Reasoning
cs.AI 2023-07 conditional novelty 7.0

Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
Towards Measuring the Representation of Subjective Global Opinions in Language Models
cs.CL 2023-06 conditional novelty 7.0

LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliab...
Let's Verify Step by Step
cs.LG 2023-05 accept novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
Red Teaming Language Models with Language Models
cs.CL 2022-02 conditional novelty 7.0

One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.
Learning to summarize from human feedback
cs.CL 2020-09 conditional novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Convex Optimization for Alignment and Preference Learning on a Single GPU
cs.LG 2026-05 unverdicted novelty 6.0

COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models...
Hierarchical Variational Policies for Reward-Guided Diffusion
cs.LG 2026-05 conditional novelty 6.0

A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
cs.AI 2026-05 unverdicted novelty 6.0

POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
cs.AI 2026-05 unverdicted novelty 6.0

SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
Reinforcement Learning Assisted Quantum Simulation of Many-Body Excited States and Real-Time Dynamics
quant-ph 2026-05 unverdicted novelty 6.0

The work generalizes RL-CQE to excited states and time evolution via adaptive operator selection and a constant-scaling ansatz, reporting chemical accuracy on chemical systems with compact representations.
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment
cs.CL 2026-05 unverdicted novelty 6.0

Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on...
Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?
cs.CL 2026-05 unverdicted novelty 6.0

Fine-tuning LLMs on essays reduces variance in IPIP-NEO responses across models but does not raise full five-trait profile accuracy above near-chance levels from unguided text.
Active Learning MPC Objective Functions from Preferences
eess.SY 2026-05 unverdicted novelty 6.0

Active learning strategies for preference-based MPC objective learning achieve better closed-loop alignment with human preferences using fewer queries than random sampling in numerical tests.
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
cs.CL 2026-05 unverdicted novelty 6.0

Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
cs.AI 2026-05 unverdicted novelty 6.0

MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
When Vision Speaks for Sound
cs.CV 2026-05 unverdicted novelty 6.0

Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention pe...
Driving Intents Amplify Planning-Oriented Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

A new RL objective adapts trust-region and off-policy handling automatically via normalized effective sample size of batch policy ratios, matching tuned baselines without new hyperparameters.
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
PriorZero: Bridging Language Priors and World Models for Decision Making
cs.LG 2026-05 unverdicted novelty 6.0

PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
cs.LG 2026-05 unverdicted novelty 6.0

Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
cs.LG 2026-05 unverdicted novelty 6.0

Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 162 Pith papers · 5 internal anchors

[1]

Deep batch active learning by diverse, uncertain gradient lower bounds

Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch ac- tive learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671,

work page arXiv 1906
[2]

Learning to understand goal speciﬁcations by mod- elling reward

Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Arian Hosseini, Pushmeet Kohli, and Edward Grefen- stette. Learning to understand goal speciﬁcations by mod- elling reward. arXiv preprint arXiv:1806.01946,

work page arXiv
[3]

Supervising strong learners by amplifying weak experts

Paul Christiano, Buck Shlegeris, and Dario Amodei. Super- vising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575,

work page Pith review arXiv
[4]

Preference-based interactive multi-document summarisa- tion

Yang Gao, Christian M Meyer, and Iryna Gurevych. Preference-based interactive multi-document summarisa- tion. arXiv preprint arXiv:1906.02923, 2019a. Yang Gao, Christian M Meyer, Mohsen Mesgar, and Iryna Gurevych. Reward learning for efﬁcient reinforcement learning in extractive document summarisation. arXiv preprint arXiv:1907.12894, 2019b. Sebastian Geh...

work page arXiv 1906
[5]

Discriminative Active Learning

Daniel Gissin and Shai Shalev-Shwartz. Discriminative active learning. arXiv preprint arXiv:1907.06347,

work page Pith review arXiv 1907
[6]

Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. Learning from dialogue af- ter deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415,

work page Pith review arXiv 1901
[7]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation.arXiv preprint arXiv:1801.06146,

work page Pith review arXiv
[8]

Active Learning for Speech Recognition: the Power of Gradients

Jiaji Huang, Rewon Child, Vinay Rao, Hairong Liu, Sanjeev Satheesh, and Adam Coates. Active learning for speech recognition: the power of gradients. arXiv preprint arXiv:1612.03226,

work page Pith review arXiv
[9]

Reward learning from human preferences and demonstrations in Atari

URL https://arxiv.org/abs/1811.06521. Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. arXiv preprint arXiv:1805.00899,

work page Pith review arXiv
[10]

AI safety via debate

URL https://arxiv.org/abs/1805.00899. Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative ﬁne-tuning of sequence generation models with kl-control. In Pro- ceedings of the 34th International Conference on Ma- chine Learning-Volume 70, pages 1645–1654. JMLR. org,

work page internal anchor Pith review arXiv
[11]

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456,

work page Pith review arXiv 1907
[12]

Sample efﬁcient text summarization using a single pre-trained transformer

Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, and Lukasz Kaiser. Sample efﬁcient text summarization using a single pre-trained transformer. arXiv preprint arXiv:1905.08836,

work page arXiv 1905
[13]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

Julia Kreutzer, Joshua Uyheng, and Stefan Riezler. Reli- ability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627,

work page Pith review arXiv
[15]

Neural text summarization: A critical evaluation

Wojciech Kry´sci´nski, Nitish Shirish Keskar, Bryan Mc- Cann, Caiming Xiong, and Richard Socher. Neural text summarization: A critical evaluation. arXiv preprint arXiv:1908.08960,

work page arXiv 1908
[16]

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent align- ment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871,

work page Pith review arXiv
[17]

Dialogue Learning With Human-In-The-Loop

Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. Dialogue learning with human-in-the-loop. arXiv preprint arXiv:1611.09823 ,

work page Pith review arXiv
[18]

Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

Fine-Tuning Language Models from Human Preferences Khanh Nguyen, Hal Daumé III, and Jordan Boyd-Graber. Reinforcement learning for bandit neural machine trans- lation with simulated human feedback. arXiv preprint arXiv:1707.07402,

work page Pith review arXiv
[19]

A Deep Reinforced Model for Abstractive Summarization

Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304,

work page Pith review arXiv
[20]

Finding gener- alizable evidence by learning to convince Q&A models

Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason We- ston, Douwe Kiela, and Kyunghyun Cho. Finding gener- alizable evidence by learning to convince Q&A models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, November

work page 2019
[21]

Deep contextualized word representations

Association for Computational Linguistics. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gard- ner, Christopher Clark, Kenton Lee, and Luke Zettle- moyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,

work page Pith review arXiv
[22]

Learning to Generate Reviews and Discovering Sentiment

Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learn- ing to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444,

work page Pith review arXiv
[23]

Sequence Level Training with Recurrent Neural Networks

URL https://d4mucfpksywv.cloudfront. net/better-language-models/language_ models_are_unsupervised_multitask_ learners.pdf. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732,

work page Pith review arXiv
[24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Get To The Point: Summarization with Pointer-Generator Networks

Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with pointer-generator net- works. arXiv preprint arXiv:1704.04368,

work page Pith review arXiv
[26]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909,

work page internal anchor Pith review arXiv
[27]

Tambwekar, M

Pradyumna Tambwekar, Murtaza Dhuliawala, Animesh Mehta, Lara J Martin, Brent Harrison, and Mark O Riedl. Controllable neural story generation via reinforcement learning. arXiv preprint arXiv:1809.10736,

work page arXiv
[28]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mo- hammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap be- tween human and machine translation. arXiv preprint arXiv:1609.08144,

work page internal anchor Pith review arXiv
[29]

Towards coherent and engaging spoken dialog response generation us- ing automatic conversation evaluators

Sanghyun Yi, Rahul Goel, Chandra Khatri, Tagyoung Chung, Behnam Hedayatnia, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tur. Towards coherent and engaging spoken dialog response generation us- ing automatic conversation evaluators. arXiv preprint arXiv:1904.13015,

work page arXiv 1904

[1] [1]

Deep batch active learning by diverse, uncertain gradient lower bounds

Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch ac- tive learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671,

work page arXiv 1906

[2] [2]

Learning to understand goal speciﬁcations by mod- elling reward

Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Arian Hosseini, Pushmeet Kohli, and Edward Grefen- stette. Learning to understand goal speciﬁcations by mod- elling reward. arXiv preprint arXiv:1806.01946,

work page arXiv

[3] [3]

Supervising strong learners by amplifying weak experts

Paul Christiano, Buck Shlegeris, and Dario Amodei. Super- vising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575,

work page Pith review arXiv

[4] [4]

Preference-based interactive multi-document summarisa- tion

Yang Gao, Christian M Meyer, and Iryna Gurevych. Preference-based interactive multi-document summarisa- tion. arXiv preprint arXiv:1906.02923, 2019a. Yang Gao, Christian M Meyer, Mohsen Mesgar, and Iryna Gurevych. Reward learning for efﬁcient reinforcement learning in extractive document summarisation. arXiv preprint arXiv:1907.12894, 2019b. Sebastian Geh...

work page arXiv 1906

[5] [5]

Discriminative Active Learning

Daniel Gissin and Shai Shalev-Shwartz. Discriminative active learning. arXiv preprint arXiv:1907.06347,

work page Pith review arXiv 1907

[6] [6]

Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. Learning from dialogue af- ter deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415,

work page Pith review arXiv 1901

[7] [7]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation.arXiv preprint arXiv:1801.06146,

work page Pith review arXiv

[8] [8]

Active Learning for Speech Recognition: the Power of Gradients

Jiaji Huang, Rewon Child, Vinay Rao, Hairong Liu, Sanjeev Satheesh, and Adam Coates. Active learning for speech recognition: the power of gradients. arXiv preprint arXiv:1612.03226,

work page Pith review arXiv

[9] [9]

Reward learning from human preferences and demonstrations in Atari

URL https://arxiv.org/abs/1811.06521. Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. arXiv preprint arXiv:1805.00899,

work page Pith review arXiv

[10] [10]

AI safety via debate

URL https://arxiv.org/abs/1805.00899. Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative ﬁne-tuning of sequence generation models with kl-control. In Pro- ceedings of the 34th International Conference on Ma- chine Learning-Volume 70, pages 1645–1654. JMLR. org,

work page internal anchor Pith review arXiv

[11] [11]

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456,

work page Pith review arXiv 1907

[12] [12]

Sample efﬁcient text summarization using a single pre-trained transformer

Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, and Lukasz Kaiser. Sample efﬁcient text summarization using a single pre-trained transformer. arXiv preprint arXiv:1905.08836,

work page arXiv 1905

[13] [13]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

Julia Kreutzer, Joshua Uyheng, and Stefan Riezler. Reli- ability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627,

work page Pith review arXiv

[15] [15]

Neural text summarization: A critical evaluation

Wojciech Kry´sci´nski, Nitish Shirish Keskar, Bryan Mc- Cann, Caiming Xiong, and Richard Socher. Neural text summarization: A critical evaluation. arXiv preprint arXiv:1908.08960,

work page arXiv 1908

[16] [16]

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent align- ment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871,

work page Pith review arXiv

[17] [17]

Dialogue Learning With Human-In-The-Loop

Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. Dialogue learning with human-in-the-loop. arXiv preprint arXiv:1611.09823 ,

work page Pith review arXiv

[18] [18]

Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback

Fine-Tuning Language Models from Human Preferences Khanh Nguyen, Hal Daumé III, and Jordan Boyd-Graber. Reinforcement learning for bandit neural machine trans- lation with simulated human feedback. arXiv preprint arXiv:1707.07402,

work page Pith review arXiv

[19] [19]

A Deep Reinforced Model for Abstractive Summarization

Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304,

work page Pith review arXiv

[20] [20]

Finding gener- alizable evidence by learning to convince Q&A models

Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason We- ston, Douwe Kiela, and Kyunghyun Cho. Finding gener- alizable evidence by learning to convince Q&A models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, November

work page 2019

[21] [21]

Deep contextualized word representations

Association for Computational Linguistics. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gard- ner, Christopher Clark, Kenton Lee, and Luke Zettle- moyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,

work page Pith review arXiv

[22] [22]

Learning to Generate Reviews and Discovering Sentiment

Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learn- ing to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444,

work page Pith review arXiv

[23] [23]

Sequence Level Training with Recurrent Neural Networks

URL https://d4mucfpksywv.cloudfront. net/better-language-models/language_ models_are_unsupervised_multitask_ learners.pdf. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732,

work page Pith review arXiv

[24] [24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Get To The Point: Summarization with Pointer-Generator Networks

Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with pointer-generator net- works. arXiv preprint arXiv:1704.04368,

work page Pith review arXiv

[26] [26]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909,

work page internal anchor Pith review arXiv

[27] [27]

Tambwekar, M

Pradyumna Tambwekar, Murtaza Dhuliawala, Animesh Mehta, Lara J Martin, Brent Harrison, and Mark O Riedl. Controllable neural story generation via reinforcement learning. arXiv preprint arXiv:1809.10736,

work page arXiv

[28] [28]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mo- hammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap be- tween human and machine translation. arXiv preprint arXiv:1609.08144,

work page internal anchor Pith review arXiv

[29] [29]

Towards coherent and engaging spoken dialog response generation us- ing automatic conversation evaluators

Sanghyun Yi, Rahul Goel, Chandra Khatri, Tagyoung Chung, Behnam Hedayatnia, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tur. Towards coherent and engaging spoken dialog response generation us- ing automatic conversation evaluators. arXiv preprint arXiv:1904.13015,

work page arXiv 1904