super hub Canonical reference

Training language models to follow instructions with human feedback

Carroll L. Wainwright, Diogo Almeida, Jeff Wu, Long Ouyang, Pamela Mishkin, Xu Jiang · 2022 · cs.CL · arXiv 2203.02155

Canonical reference. 93% of citing Pith papers cite this work as background.

217 Pith papers citing it

Background 93% of classified citations

open full Pith review browse 217 citing papers more from Carroll L. Wainwright arXiv PDF

abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 54 method 1 other 1

citation-polarity summary

background 52 unclear 3 use method 1

claims ledger

abstract Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we u

authors

Carroll L. Wainwright Diogo Almeida Jeff Wu Long Ouyang Pamela Mishkin Xu Jiang

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

cs.SE · 2026-05-20 · conditional · novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0 · 2 refs

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Code as Policies: Language Model Programs for Embodied Control

cs.RO · 2022-09-16 · accept · novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

cond-mat.stat-mech · 2026-05-11 · unverdicted · novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

math.OC · 2026-05-09 · unverdicted · novelty 7.0

Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.

Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reasoning tasks.

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor

cs.AI · 2026-04-30 · conditional · novelty 7.0

Political bias audits of LLMs largely capture sycophantic accommodation to the inferred political identity of the asker rather than any fixed model ideology.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Latent Space Probing for Adult Content Detection in Video Generative Models

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

citing papers explorer

Showing 32 of 32 citing papers after filters.

ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation cs.AI · 2025-10-14 · unverdicted · none · ref 20 · internal anchor
ContractEval benchmark on 364 tasks shows code LLMs achieve 75-82% functional pass@1 but 0% contract satisfaction under standard prompting, rising only to 23-41% with explicit contracts.
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention cs.SE · 2025-08-22 · unverdicted · none · ref 39 · internal anchor
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 121 · internal anchor
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety cs.CL · 2025-12-08 · unverdicted · none · ref 34 · internal anchor
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
SAM 3D: 3Dfy Anything in Images cs.CV · 2025-11-20 · unverdicted · none · ref 28 · internal anchor
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations cs.CL · 2025-11-09 · conditional · none · ref 47 · internal anchor
TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accuracy-memory ratio across benchmarks.
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning cs.CV · 2025-10-18 · unverdicted · none · ref 43 · internal anchor
SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.
On the optimization dynamics of RLVR: Gradient gap and step size thresholds cs.LG · 2025-10-09 · unverdicted · none · ref 14 · internal anchor
The paper defines a Gradient Gap for RLVR policy gradients and proves a sharp step-size threshold below which training converges and above which it collapses, with predictions for length and success-rate scaling validated in simulations and on Qwen2.5-Math-7B.
Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models cs.CL · 2025-10-09 · unverdicted · none · ref 20 · internal anchor
GTD generates task-adaptive, sparse communication topologies for multi-LLM agents via guided iterative graph diffusion steered by a proxy model predicting accuracy, utility, and cost.
Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning cs.CL · 2025-09-30 · unverdicted · none · ref 28 · internal anchor
KG-R1 trains a single RL agent to retrieve from and reason over knowledge graphs in one loop, achieving higher accuracy with fewer tokens than multi-module baselines and transferring to unseen graphs.
Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework cs.CV · 2025-09-27 · unverdicted · none · ref 22 · internal anchor
DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 56 · internal anchor
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs cs.CL · 2025-04-15 · unverdicted · none · ref 14 · internal anchor
ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
Preference Learning Unlocks LLMs' Psycho-Counseling Skills cs.CL · 2025-02-27 · conditional · none · ref 26 · internal anchor
A new expert-principle preference dataset enables an 8B LLM to reach 87% win rate vs GPT-4o on counseling responses through standard preference optimization.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 137 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming cs.CL · 2025-01-31 · conditional · none · ref 2 · internal anchor
Constitutional Classifiers trained on synthetic data from natural language constitutions defend LLMs against universal jailbreaks, with no successful bypass found in over 3000 hours of red teaming and only minor deployment overhead.
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning cs.LG · 2025-12-12 · unverdicted · none · ref 28 · internal anchor
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs cs.CR · 2025-11-04 · unverdicted · none · ref 36 · internal anchor
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs cs.IT · 2025-11-03 · unverdicted · none · ref 32 · internal anchor
Proposes a semantic information theory for LLMs that substitutes the token for the bit as the atomic carrier of meaning, recasts the Transformer as an energy-based model, and derives directed rate-distortion and rate-reward functions using Massey's directed information.
LLM4Delay: Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation cs.LG · 2025-10-24 · unverdicted · none · ref 33 · internal anchor
LLM4Delay improves flight delay prediction accuracy by using instance-level projection to adapt LLMs for integrating textual aeronautical information with multiple aircraft trajectories.
Polychromic Objectives for Reinforcement Learning cs.LG · 2025-09-29 · unverdicted · none · ref 26 · internal anchor
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
Failure Modes of Maximum Entropy RLHF cs.LG · 2025-09-24 · unverdicted · none · ref 38 · internal anchor
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection cs.CL · 2025-08-31 · unverdicted · none · ref 7 · internal anchor
Censored LLMs achieve 69.0% strict accuracy in hate speech detection versus 64.1% for uncensored models and resist persona-based ideological influence better, but all exhibit overconfidence, irony failures, and group fairness disparities.
Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration cs.LG · 2025-08-19 · unverdicted · none · ref 11 · internal anchor
DARS adaptively increases rollouts on hard problems in RLVR to improve Pass@K, and when paired with batch scaling for breadth, achieves gains in both Pass@K and Pass@1 by treating depth and breadth as complementary exploration dimensions.
MemOS: A Memory OS for AI System cs.CL · 2025-07-04 · unverdicted · none · ref 28 · internal anchor
MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning cs.LG · 2025-06-09 · unverdicted · none · ref 32 · internal anchor
Proposes token-significance and dynamic length rewards in RL to reduce LLM response length while preserving or improving reasoning correctness across benchmarks.
MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning cs.AI · 2025-05-31 · unverdicted · none · ref 38 · internal anchor
MIRROR applies cognitive principles of parallel processing, reconstructive synthesis, and complementary learning to AI, yielding 21% relative gains on multi-turn constraint-maintenance tasks across seven models with supporting ablations.
Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 56 · internal anchor
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
Preference Learning for AI Alignment: a Causal Perspective cs.AI · 2025-06-06 · unverdicted · none · ref 9 · internal anchor
Advocates applying causal inference to preference learning for LLM alignment to diagnose generalization failures and guide better data practices.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 101 · internal anchor
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project cs.DC · 2025-04-14 · unverdicted · none · ref 9 · internal anchor
Engineering report detailing HPC infrastructure, software choices, and performance measurements for training a 7B LLM using 3D parallelism on JUWELS Booster.
LLM Harms: A Taxonomy and Discussion cs.CY · 2025-12-05 · unreviewed · ref 20 · internal anchor

Training language models to follow instructions with human feedback

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer