Training language models to follow instructions with human feedback

Alex Ray; Amanda Askell; Carroll L. Wainwright; Chong Zhang; Diogo Almeida; Fraser Kelton; Jacob Hilton; Jan Leike; Jeff Wu; John Schulman

arxiv: 2203.02155 · v1 · submitted 2022-03-04 · 💻 cs.CL · cs.AI· cs.LG

Training language models to follow instructions with human feedback

Long Ouyang , Jeff Wu , Xu Jiang , Diogo Almeida , Carroll L. Wainwright , Pamela Mishkin , Chong Zhang , Sandhini Agarwal

show 12 more authors

Katarina Slama Alex Ray John Schulman Jacob Hilton Fraser Kelton Luke Miller Maddie Simens Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan Lowe

This is my paper

Pith reviewed 2026-05-10 16:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords language modelshuman feedbackreinforcement learninginstruction followingmodel alignmentGPT-3truthfulnesstoxicity

0 comments

The pith

Fine-tuning GPT-3 on human demonstrations and output rankings produces InstructGPT models that humans prefer over the original 175B GPT-3 even at 1.3B parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language models can be aligned more closely with user intent by first training them on human-written examples of desired responses to prompts and then further adjusting them using human rankings of different model outputs. This two-step process applied to GPT-3 yields InstructGPT, which human evaluators rate higher than the base model on the authors' prompt set. The aligned models also produce more truthful text and fewer toxic outputs while showing only small drops on standard language benchmarks. A reader would care because the result indicates that careful use of human feedback can improve reliability without requiring ever-larger models.

Core claim

The authors collect labeler demonstrations of desired behavior on a mix of written prompts and API-submitted prompts, use them for supervised fine-tuning of GPT-3, then gather rankings of model outputs and apply reinforcement learning from human feedback to obtain InstructGPT. In human evaluations on their prompt distribution, the 1.3B InstructGPT is preferred to the 175B GPT-3, with gains in truthfulness, reductions in toxic generation, and minimal regressions on public NLP datasets.

What carries the argument

Two-stage fine-tuning that begins with supervised learning on human demonstrations of desired outputs and continues with reinforcement learning from human rankings of model responses.

If this is right

Smaller models aligned this way can outperform much larger unaligned models on human preference judgments.
The resulting models generate more truthful content and fewer toxic outputs.
Standard public NLP benchmarks show only minimal performance regressions after the alignment steps.
Fine-tuning with human feedback offers a practical route to making language models follow user instructions more reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collection and ranking process could be applied to other base models to test whether the preference gains hold beyond the GPT-3 family.
If human feedback can be gathered at scale for more complex or domain-specific prompts, the method might reduce reliance on raw parameter count for capability gains.
Extending the ranking step to capture longer-term user satisfaction rather than single-turn preferences could further tighten alignment.

Load-bearing premise

The preferences expressed by the human labelers on the prompts they saw accurately capture what a wide range of future users will want in real applications.

What would settle it

A new human evaluation on a fresh collection of prompts drawn from actual user interactions where InstructGPT outputs are not rated higher than those from the base GPT-3.

read the original abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows that supervised fine-tuning plus RLHF on human rankings can make a 1.3B model preferred over 175B GPT-3 on the authors' prompt distribution, with gains on truthfulness and toxicity.

read the letter

The main result is straightforward: after supervised fine-tuning on labeler demonstrations and then RLHF on output rankings, the resulting InstructGPT models win human preference comparisons against the base GPT-3, even at 100x smaller scale. They also report better truthfulness scores and lower toxicity while staying close on standard NLP benchmarks. The two-stage pipeline is applied at GPT-3 size for the first time in this exact form, and the human evaluations are run on held-out prompts from their collection process. That is the concrete advance. The evidence for the preference claim comes from direct human judgments rather than derived metrics, and the safety-related improvements are measured separately. The paper presents the numbers clearly enough that the central comparison holds on the tested distribution. The main limitations are practical rather than conceptual. The training data and prompt sources are not released, so exact reproduction requires comparable labeler access and resources. Some win-rate figures lack error bars, which makes it harder to assess how noisy the human ratings are. The assumption that these labelers' preferences will match future users is stated but not tested across different populations. Those are real constraints on how far the findings travel, but they do not undermine the reported results on the authors' own setup. This is useful reading for anyone building or studying instruction-tuned models and alignment methods. It gives a working recipe with measurable human preference gains at scale. The work is coherent on its own terms and shows clear engagement with the practical problem of making large models follow intent. It deserves peer review because the empirical comparison is new at this scale and the evaluation design is direct enough to be worth referee scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces InstructGPT models obtained by first performing supervised fine-tuning of GPT-3 on a dataset of human-written demonstrations of desired behavior, then further training via reinforcement learning from human feedback (RLHF) using a reward model trained on human preference rankings of model outputs. On a held-out set of prompts drawn from the same distribution (labeler-written and API-submitted), human evaluators prefer outputs from the 1.3B InstructGPT over those from the 175B GPT-3; the aligned models also exhibit higher truthfulness and lower toxicity with only small regressions on public NLP benchmarks.

Significance. If the reported human-preference results hold, the work supplies direct empirical evidence that RLHF can produce substantial alignment gains on instruction-following tasks, including the striking result that a 100x smaller model can be preferred to its much larger base model. The approach is grounded in independent human evaluations rather than circular derivations, and the public benchmarks provide a useful check against capability regression. This strengthens the case for human feedback as a practical alignment technique beyond pure scaling.

major comments (2)

[§4] §4 (Human evaluations): The central preference comparison (1.3B InstructGPT preferred to 175B GPT-3) is reported without confidence intervals, sample sizes per comparison, or inter-rater agreement statistics. Because the main claim rests entirely on these human judgments, the absence of uncertainty quantification leaves open the possibility that the observed win rates are sensitive to sampling variability or labeler idiosyncrasies.
[§3.3] §3.3 (RLHF stage): The reward model and PPO training both involve multiple free hyperparameters (learning rates, KL coefficient, etc.). While the paper lists the chosen values, it provides no ablation or sensitivity analysis showing that the reported preference gains are robust to reasonable changes in these choices; this weakens that the gains are attributable to the RLHF procedure itself rather than a narrow hyperparameter sweet spot.

minor comments (2)

[Table 2] Table 2 and Figure 3: the public-benchmark regressions are described as “minimal,” but the absolute deltas (e.g., on MMLU or TruthfulQA) should be stated numerically in the text for quick assessment.
[§2.2] §2.2: the prompt distribution is described only at a high level (“labeler-written and API-submitted”); a short appendix table characterizing prompt length, topic diversity, or task type would aid readers in judging external validity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the work. We address each major comment below, proposing revisions where they strengthen the manuscript without requiring new large-scale experiments.

read point-by-point responses

Referee: [§4] §4 (Human evaluations): The central preference comparison (1.3B InstructGPT preferred to 175B GPT-3) is reported without confidence intervals, sample sizes per comparison, or inter-rater agreement statistics. Because the main claim rests entirely on these human judgments, the absence of uncertainty quantification leaves open the possibility that the observed win rates are sensitive to sampling variability or labeler idiosyncrasies.

Authors: We agree that uncertainty quantification would improve the reporting of the human preference results. The evaluations were performed on a held-out set of prompts with multiple labelers, and we have the underlying data to compute bootstrap confidence intervals, exact sample sizes (prompts and pairwise comparisons), and inter-rater agreement (e.g., Fleiss' kappa). We will add these statistics to Section 4 and the appendix in the revised manuscript. revision: yes
Referee: [§3.3] §3.3 (RLHF stage): The reward model and PPO training both involve multiple free hyperparameters (learning rates, KL coefficient, etc.). While the paper lists the chosen values, it provides no ablation or sensitivity analysis showing that the reported preference gains are robust to reasonable changes in these choices; this weakens that the gains are attributable to the RLHF procedure itself rather than a narrow hyperparameter sweet spot.

Authors: The manuscript does not contain ablations on the RLHF hyperparameters; values were chosen via small-scale preliminary tuning informed by prior RLHF literature. We cannot conduct full sensitivity analyses without substantial new compute and human data collection. In revision we will expand Section 3.3 to better motivate the selected values, note the limitation, and point out that preference gains were observed consistently across model scales (1.3B, 6B, and 175B InstructGPT). revision: partial

Circularity Check

0 steps flagged

No significant circularity in the empirical results or method

full rationale

The paper presents an empirical pipeline—collecting labeler demonstrations for supervised fine-tuning of GPT-3, followed by collecting output rankings for reinforcement learning from human feedback—whose final performance claims rest on separate human preference evaluations conducted on held-out prompts from the authors' distribution. These evaluations directly compare the resulting 1.3B InstructGPT model against the 175B GPT-3 baseline and are not derived from or equivalent to the training objective itself. No equations, fitted parameters, or self-citations are invoked in a manner that reduces the reported preference gains, truthfulness improvements, or toxicity reductions to the input data by construction. The central result is therefore an independent measurement rather than a renaming or tautological restatement of the training process.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that human rankings can be modeled as a reward function and that the collected prompts and labelers are representative; no new physical entities are introduced.

free parameters (2)

reward model training hyperparameters
Architecture size, learning rate, and batch size for the reward model are chosen and fitted to the ranking data.
PPO hyperparameters
Clip range, learning rate, and KL coefficient in the reinforcement learning stage are tuned on the reward model.

axioms (1)

domain assumption Human preferences over text outputs can be accurately represented by a scalar reward function trained on pairwise rankings
Invoked when the reward model is trained and then used as the objective in PPO.

pith-pipeline@v0.9.0 · 5601 in / 1382 out tokens · 71200 ms · 2026-05-10T16:43:46.878664+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback... outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts
cs.SE 2026-05 conditional novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems
cs.MA 2024-10 unverdicted novelty 8.0

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
ORPO: Monolithic Preference Optimization without Reference Model
cs.CL 2024-03 conditional novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Generative Agents: Interactive Simulacra of Human Behavior
cs.HC 2023-04 accept novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
Discovering Latent Knowledge in Language Models Without Supervision
cs.CL 2022-12 conditional novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
Code as Policies: Language Model Programs for Embodied Control
cs.RO 2022-09 accept novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
cs.LG 2026-05 unverdicted novelty 7.0

Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
cs.CL 2026-05 unverdicted novelty 7.0

The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
cs.AI 2026-05 unverdicted novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
cond-mat.stat-mech 2026-05 unverdicted novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets
math.OC 2026-05 unverdicted novelty 7.0

Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends...
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
cs.CL 2026-05 unverdicted novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor
cs.AI 2026-04 conditional novelty 7.0

Political bias audits of LLMs largely capture sycophantic accommodation to the inferred political identity of the asker rather than any fixed model ideology.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Latent Space Probing for Adult Content Detection in Video Generative Models
cs.CV 2026-04 unverdicted novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
Rates of forgetting for the sequentially Markov coalescent
math.PR 2026-04 unverdicted novelty 7.0

SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
cs.LG 2026-04 unverdicted novelty 7.0

R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
cs.AI 2026-04 unverdicted novelty 7.0

HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
Discrete Tilt Matching
cs.LG 2026-04 unverdicted novelty 7.0

Discrete Tilt Matching recasts dLLM fine-tuning as state-level matching of tilted local unmasking posteriors, producing a stable weighted cross-entropy loss that improves Sudoku and Countdown performance when applied ...
Discrete Tilt Matching
cs.LG 2026-04 unverdicted novelty 7.0

DTM recasts dLLM fine-tuning as weighted cross-entropy matching of tilted local posteriors, with demonstrated gains on Sudoku and math tasks.
S-GRPO: Unified Post-Training for Large Vision-Language Models
cs.LG 2026-04 unverdicted novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
Reinforcement Learning via Value Gradient Flow
cs.LG 2026-04 unverdicted novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
cs.CV 2026-04 conditional novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
cs.CL 2026-04 accept novelty 7.0

SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security
cs.CR 2026-04 conditional novelty 7.0

MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
cs.AI 2026-04 unverdicted novelty 7.0

Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

STEER represents videos as time-ordered event schemas and uses Pareto-Frontier guided Advantage Balancing in RL to train a 4B model that matches 7B baselines on video tasks with half the frames.
Alignment midtraining for animals
cs.CL 2026-03 unverdicted novelty 7.0

Midtraining on 3000 synthetic animal compassion documents raises compassionate reasoning scores to 77% on ANIMA benchmark versus 40% for instruction tuning, with generalization to human compassion but degradation afte...
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
cs.SE 2026-03 accept novelty 7.0

LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
Fast Single Nitrogen-Vacancy Center Ramsey Characterization using a Physics-Informed Neural Network
quant-ph 2026-03 conditional novelty 7.0

NVRNet uses pretrained simulation-based U-Nets with attention and parameter-efficient adapters, followed by a transformer estimator, to reconstruct clean Ramsey waveforms and infer hyperfine parameters from minimal-sw...
The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?
cs.CL 2026-03 unverdicted novelty 7.0

The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
cs.LG 2026-02 unverdicted novelty 7.0

CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
"Tab, Tab, Bug": Security Pitfalls of Next Edit Suggestions in AI-Integrated IDEs
cs.CR 2026-02 conditional novelty 7.0

NES systems in AI IDEs expand attack surfaces via context poisoning from imperceptible actions and global codebase retrieval, with professional developers largely unaware of the risks.
ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation
cs.AI 2025-10 unverdicted novelty 7.0

ContractEval benchmark on 364 tasks shows code LLMs achieve 75-82% functional pass@1 but 0% contract satisfaction under standard prompting, rising only to 23-41% with explicit contracts.
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention
cs.SE 2025-08 unverdicted novelty 7.0

EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
cs.AI 2024-06 conditional novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Massive Activations in Large Language Models
cs.CL 2024-02 unverdicted novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Let's Verify Step by Step
cs.LG 2023-05 accept novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
cs.CV 2023-03 accept novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
InCoder: A Generative Model for Code Infilling and Synthesis
cs.SE 2022-04 unverdicted novelty 7.0

InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...
Understanding Goal Generalisation in Sequential Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable...
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
cs.CL 2026-05 unverdicted novelty 6.0

Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggrega...
Token-weighted Direct Preference Optimization with Attention
cs.CL 2026-05 unverdicted novelty 6.0

AttentionPO weights tokens in Direct Preference Optimization using self-attention from pairwise judgments, claiming better results than prior PO methods on AlpacaEval, MT-Bench, and ArenaHard.
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME...
Preference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

PRISM weights target examples by the current model's preference to build a better representation for influence-function scoring of training samples in efficient LLM fine-tuning.
TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health
cs.LG 2026-05 unverdicted novelty 6.0

TimeSRL uses semantic abstractions from time-series data optimized via reinforcement learning to achieve better cross-dataset generalization than standard ML or LLM baselines in mental health prediction.
Reinforcing Human Behavior Simulation via Verbal Feedback
cs.LG 2026-05 unverdicted novelty 6.0

DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 203 Pith papers

[1]

Hey, what are you doing there?

Theo 3. brainstorming Tell me a list of topics related to: - interior design - sustainable ecosystems - fake plants brainstorming Name some rare gems classiﬁcation This is a tweet sentiment classiﬁer. {tweet} Sentiment: negative === {tweet} Sentiment: neutral === {tweet} Sentiment: classiﬁcation The following is a list of products and the kind of product ...

work page 2048
[2]

We created a dataset of prompts and completions, where some of prompts or completions were sensitive (i.e

Agreement on sensitive speech ﬂagging. We created a dataset of prompts and completions, where some of prompts or completions were sensitive (i.e. anything that could elicit strong negative feelings, whether by being toxic, sexual, violent, judgemental, political, etc.). We labeled this data for sensitivity ourselves, and measured agreement between us and labelers

work page
[3]

We take prompts submitted to our API, and several model completions, and have labelers rank the completions by overall quality

Agreement on rankings. We take prompts submitted to our API, and several model completions, and have labelers rank the completions by overall quality. We measure their agreement with researcher labels

work page
[4]

demonstration score

Sensitive demonstration writing. We created a small set of sensitive prompts, where responding to the outputs appropriately would require nuance. We then rated each demon- stration on a 1-7 Likert scale, and computed an average “demonstration score” for each labeler

work page
[5]

For what topics or cultural groups are you comfortable identifying sensitive speech?

Self-assessed ability to identify sensitive speech for different groups. We wanted to select a team of labelers that had collectively were able to identify sensitive content in a broad range of areas. For legal reasons, we can’t hire contractors based on demographic criteria. Thus, we had labelers answer the question: “For what topics or cultural groups a...

work page 2020
[6]

Who announced Ginsburg’s death?

work page
[7]

What was Ginsburg’s statement to her granddaughter?

work page
[8]

Why will Ginsburg’s death have profound consequences for the court and the country? Labeler demonstration

work page
[9]

The Supreme Court announced her death

work page
[11]

Ruth Badr Ginsburg was the leader of the liberal wing, and because the court is about to open a new term, the chief justice no longer holds the controlling vote in contested cases GPT-3 175B completion:

work page
[12]

Chief Justice John Roberts

work page
[13]

My most fervent wish is that I will not be replaced until a new president is installed

"My most fervent wish is that I will not be replaced until a new president is installed."

work page
[14]

InstructGPT 175B completion:

Because of the Supreme Court vacancy Ginsburg’s death creates. InstructGPT 175B completion:

work page
[15]

The Supreme Court announced Ginsburg’s death

work page
[16]

Ginsburg’s statement was a plea for the president not to replace her before his term ends

work page
[17]

anxiety lump

Ginsburg’s death means there will no longer be a clear majority on the court, throwing upcoming decisions into doubt. Figure 48: Labeler-written prompt from our dataset, along with the human-written demonstration, and completions from GPT-3 175B and InstructGPT175B. Prompt is lightly cherry-picked (5 selected from 15 to show a diverse range of tasks), and...

work page

[1] [1]

Hey, what are you doing there?

Theo 3. brainstorming Tell me a list of topics related to: - interior design - sustainable ecosystems - fake plants brainstorming Name some rare gems classiﬁcation This is a tweet sentiment classiﬁer. {tweet} Sentiment: negative === {tweet} Sentiment: neutral === {tweet} Sentiment: classiﬁcation The following is a list of products and the kind of product ...

work page 2048

[2] [2]

We created a dataset of prompts and completions, where some of prompts or completions were sensitive (i.e

Agreement on sensitive speech ﬂagging. We created a dataset of prompts and completions, where some of prompts or completions were sensitive (i.e. anything that could elicit strong negative feelings, whether by being toxic, sexual, violent, judgemental, political, etc.). We labeled this data for sensitivity ourselves, and measured agreement between us and labelers

work page

[3] [3]

We take prompts submitted to our API, and several model completions, and have labelers rank the completions by overall quality

Agreement on rankings. We take prompts submitted to our API, and several model completions, and have labelers rank the completions by overall quality. We measure their agreement with researcher labels

work page

[4] [4]

demonstration score

Sensitive demonstration writing. We created a small set of sensitive prompts, where responding to the outputs appropriately would require nuance. We then rated each demon- stration on a 1-7 Likert scale, and computed an average “demonstration score” for each labeler

work page

[5] [5]

For what topics or cultural groups are you comfortable identifying sensitive speech?

Self-assessed ability to identify sensitive speech for different groups. We wanted to select a team of labelers that had collectively were able to identify sensitive content in a broad range of areas. For legal reasons, we can’t hire contractors based on demographic criteria. Thus, we had labelers answer the question: “For what topics or cultural groups a...

work page 2020

[6] [6]

Who announced Ginsburg’s death?

work page

[7] [7]

What was Ginsburg’s statement to her granddaughter?

work page

[8] [8]

Why will Ginsburg’s death have profound consequences for the court and the country? Labeler demonstration

work page

[9] [9]

The Supreme Court announced her death

work page

[10] [11]

Ruth Badr Ginsburg was the leader of the liberal wing, and because the court is about to open a new term, the chief justice no longer holds the controlling vote in contested cases GPT-3 175B completion:

work page

[11] [12]

Chief Justice John Roberts

work page

[12] [13]

My most fervent wish is that I will not be replaced until a new president is installed

"My most fervent wish is that I will not be replaced until a new president is installed."

work page

[13] [14]

InstructGPT 175B completion:

Because of the Supreme Court vacancy Ginsburg’s death creates. InstructGPT 175B completion:

work page

[14] [15]

The Supreme Court announced Ginsburg’s death

work page

[15] [16]

Ginsburg’s statement was a plea for the president not to replace her before his term ends

work page

[16] [17]

anxiety lump

Ginsburg’s death means there will no longer be a clear majority on the court, throwing upcoming decisions into doubt. Figure 48: Labeler-written prompt from our dataset, along with the human-written demonstration, and completions from GPT-3 175B and InstructGPT175B. Prompt is lightly cherry-picked (5 selected from 15 to show a diverse range of tasks), and...

work page