super hub Canonical reference

Training language models to follow instructions with human feedback

Carroll L. Wainwright, Diogo Almeida, Jeff Wu, Long Ouyang, Pamela Mishkin, Xu Jiang · 2022 · cs.CL · arXiv 2203.02155

Canonical reference. 93% of citing Pith papers cite this work as background.

262 Pith papers citing it

Background 93% of classified citations

open full Pith review browse 262 citing papers more from Carroll L. Wainwright arXiv PDF

abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 54 method 1 other 1

citation-polarity summary

background 52 unclear 3 use method 1

claims ledger

abstract Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we u

authors

Carroll L. Wainwright Diogo Almeida Jeff Wu Long Ouyang Pamela Mishkin Xu Jiang

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

KV cache quantization silently erodes LLM safety alignment via vulnerable low-dimensional subspaces, diagnosed by Per-Channel Reduction into three failure modes and mitigated training-free with up to 97% recovery.

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

cs.SE · 2026-05-20 · conditional · novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0 · 2 refs

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Code as Policies: Language Model Programs for Embodied Control

cs.RO · 2022-09-16 · accept · novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

What Drives Interactive Improvement from Feedback?

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

cs.AI · 2026-06-08 · conditional · novelty 7.0

A reliable-to-expressive curriculum with dynamic rubrics trains a 12B safety judge to achieve 94%+ accuracy with only 0.76 cross-rubric variance on three different rubric prompts.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.

AIP: A Graph Representation for Learning and Governing Agent Skills

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

AIP models skills as graphs of discrete steps connected by typed I/O edges under a validated schema, raising agent mean reward from 0.60 to 0.71 and pass rate from 53% to 67% on 27 SkillsBench tasks while enabling node-level fixes.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Sampling 20,000 stories shows 11 words dominate LLM outputs across models, linked to preference data and demonstrating alignment's disproportionate effect on diversity.

BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

BC Protocol uses dual-expert structured dialogue to elicit more natural CoT than solo expert writing, demonstrated by large gains in naturalness ratings in a controlled fiction-domain experiment.

Large Language Model Selection with Limited Annotations

cs.CL · 2026-05-24 · unverdicted · novelty 7.0

SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.

citing papers explorer

Showing 38 of 38 citing papers after filters.

What Drives Interactive Improvement from Feedback? cs.AI · 2026-06-29 · unverdicted · none · ref 14 · internal anchor
Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.
Tandem Reinforcement Learning with Verifiable Rewards cs.AI · 2026-06-26 · unverdicted · none · ref 14 · internal anchor
TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.
Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges cs.AI · 2026-06-08 · conditional · none · ref 8 · internal anchor
A reliable-to-expressive curriculum with dynamic rubrics trains a 12B safety judge to achieve 94%+ accuracy with only 0.76 cross-rubric variance on three different rubric prompts.
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges cs.AI · 2026-06-03 · unverdicted · none · ref 60 · internal anchor
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
AIP: A Graph Representation for Learning and Governing Agent Skills cs.AI · 2026-06-03 · unverdicted · none · ref 17 · internal anchor
AIP models skills as graphs of discrete steps connected by typed I/O edges under a validated schema, raising agent mean reward from 0.60 to 0.71 and pass rate from 53% to 67% on 27 SkillsBench tasks while enabling node-level fixes.
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization cs.AI · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 29 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor cs.AI · 2026-04-30 · conditional · none · ref 10 · internal anchor
Political bias audits of LLMs largely capture sycophantic accommodation to the inferred political identity of the asker rather than any fixed model ideology.
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs cs.AI · 2026-04-22 · unverdicted · none · ref 10 · internal anchor
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception cs.AI · 2026-04-06 · unverdicted · none · ref 5 · internal anchor
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent self-diagnosed bugs and maintained cross-channel context.
ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation cs.AI · 2025-10-14 · unverdicted · none · ref 20 · internal anchor
ContractEval benchmark on 364 tasks shows code LLMs achieve 75-82% functional pass@1 but 0% contract satisfaction under standard prompting, rising only to 23-41% with explicit contracts.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 244 · internal anchor
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 44 · internal anchor
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR cs.AI · 2026-05-19 · unverdicted · none · ref 26 · internal anchor
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
Towards Human-Level Book-Writing Capability cs.AI · 2026-05-16 · unverdicted · none · ref 4 · 2 links · internal anchor
A supervised fine-tuning approach using inverted multi-resolution planning scaffolds from public-domain novels trains models to generate book-length stories with more human-like literary qualities than standard instruction-tuned LLMs.
Iterative Finetuning is Mostly Idempotent cs.AI · 2026-05-01 · unverdicted · none · ref 7 · internal anchor
Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 69 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
Teaching an Agent to Sketch One Part at a Time cs.AI · 2026-03-19 · unverdicted · none · ref 22 · internal anchor
A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 137 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 14 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
A Roadmap to Pluralistic Alignment cs.AI · 2024-02-07 · unverdicted · none · ref 251 · internal anchor
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents cs.AI · 2023-02-03 · conditional · none · ref 34 · internal anchor
DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning cs.AI · 2023-01-31 · conditional · none · ref 43 · internal anchor
The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.
AI Native Games: A Survey and Roadmap cs.AI · 2026-07-01 · unverdicted · none · ref 63 · internal anchor
The paper proposes a counterfactual definition of AI-native games, screens 53 examples, introduces a G/N taxonomy, and outlines a research roadmap for the field.
POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation cs.AI · 2026-06-22 · unverdicted · none · ref 33 · internal anchor
POTracker fine-tunes an LLM with POTrackerLoss combining textual and structural similarity, achieving up to 86.47% structural accuracy on 1,000 power outage reports and outperforming baselines by up to 51%.
Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents cs.AI · 2026-06-06 · unverdicted · none · ref 34 · internal anchor
CICL scores and compresses context evidence for LLM agents via action-shift and outcome-uplift metrics, lifting hit@1 from 0.58 to 0.78 on 50 SWE-bench retrieval tasks.
Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering cs.AI · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
Introduces the Gumbel Machine and β-Hindsight control algorithm to generate rubric-consistent counterfactual student writing similar to a given reference text.
ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 9 · internal anchor
ECPO is a listwise policy optimization method that couples ranking utility with span-level evidence certificate validity and a deterministic verifier reward on MAVEN-ERE and RAMS datasets.
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities cs.AI · 2026-05-10 · unverdicted · none · ref 14 · internal anchor
Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis cs.AI · 2026-04-12 · unverdicted · none · ref 16 · internal anchor
Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.
MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning cs.AI · 2025-05-31 · unverdicted · none · ref 38 · internal anchor
MIRROR applies cognitive principles of parallel processing, reconstructive synthesis, and complementary learning to AI, yielding 21% relative gains on multi-turn constraint-maintenance tasks across seven models with supporting ablations.
A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI cs.AI · 2026-05-29 · unverdicted · none · ref 5 · internal anchor
Proposes a state-space constrained emulation framework for pluralistic AI evaluation using synthetic cognitive profiles and reports instability in persona coherence under sequential and perturbed inference.
Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks? cs.AI · 2026-05-04 · unverdicted · none · ref 10 · internal anchor
A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning cs.AI · 2026-05-01 · unverdicted · none · ref 21 · 2 links · internal anchor
AEM adaptively modulates response-level entropy in agentic RL to improve credit assignment and exploration-exploitation balance, yielding gains on ALFWorld, WebShop, and SWE-bench.
Preference Learning for AI Alignment: a Causal Perspective cs.AI · 2025-06-06 · unverdicted · none · ref 9 · internal anchor
Advocates applying causal inference to preference learning for LLM alignment to diagnose generalization failures and guide better data practices.
Position: Hippocampal Explicit Memory Is the Cornerstone for AGI cs.AI · 2026-06-05 · unverdicted · none · ref 62 · internal anchor
Explicit memory modeled on the hippocampus is the cornerstone needed to advance LLMs to AGI because their implicit statistical learning cannot produce higher cognitive functions.
From Pixels to Prompts: Vision-Language Models cs.AI · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
An explanatory book that supplies a clear mental map and intuition for how Vision-Language Models combine vision and language capabilities.
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures cs.AI · 2026-04-09 · unreviewed · ref 25 · internal anchor

Training language models to follow instructions with human feedback

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer