super hub Canonical reference

Training language models to follow instructions with human feedback

Carroll L. Wainwright, Diogo Almeida, Jeff Wu, Long Ouyang, Pamela Mishkin, Xu Jiang · 2022 · cs.CL · arXiv 2203.02155

Canonical reference. 93% of citing Pith papers cite this work as background.

291 Pith papers citing it

Background 93% of classified citations

open full Pith review browse 291 citing papers more from Carroll L. Wainwright arXiv PDF

abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 55 method 1 other 1

citation-polarity summary

background 53 unclear 3 use method 1

claims ledger

abstract Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we u

authors

Carroll L. Wainwright Diogo Almeida Jeff Wu Long Ouyang Pamela Mishkin Xu Jiang

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

KV cache quantization silently erodes LLM safety alignment via vulnerable low-dimensional subspaces, diagnosed by Per-Channel Reduction into three failure modes and mitigated training-free with up to 97% recovery.

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

cs.SE · 2026-05-20 · conditional · novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0 · 2 refs

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Code as Policies: Language Model Programs for Embodied Control

cs.RO · 2022-09-16 · accept · novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

What Drives Interactive Improvement from Feedback?

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

Sculpting NeRF Geometry: Human-Preference Fine-Tuning of a 3D-Aware Face GAN

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

Fine-tunes EG3D using a human-preference reward on NeRF density to improve face geometry, achieving 74.4% user preference in pairwise tests with FID rising from 4.09 to 6.66.

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

LLM-as-an-Investigator improves diagnostic accuracy over direct prompting by using an evidence-first protocol of hypothesis generation, clarification questions, and iterative probability updates in technical problem solving.

Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

cs.AI · 2026-06-08 · conditional · novelty 7.0

A reliable-to-expressive curriculum with dynamic rubrics trains a 12B safety judge to achieve 94%+ accuracy with only 0.76 cross-rubric variance on three different rubric prompts.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.

AIP: A Graph Representation for Learning and Governing Agent Skills

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

AIP models skills as graphs of discrete steps connected by typed I/O edges under a validated schema, raising agent mean reward from 0.60 to 0.71 and pass rate from 53% to 67% on 27 SkillsBench tasks while enabling node-level fixes.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Sampling 20,000 stories shows 11 words dominate LLM outputs across models, linked to preference data and demonstrating alignment's disproportionate effect on diversity.

BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

BC Protocol uses dual-expert structured dialogue to elicit more natural CoT than solo expert writing, demonstrated by large gains in naturalness ratings in a controlled fiction-domain experiment.

citing papers explorer

Showing 39 of 39 citing papers after filters.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures cs.CL · 2026-05-31 · conditional · none · ref 1 · internal anchor
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories cs.CL · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
Sampling 20,000 stories shows 11 words dominate LLM outputs across models, linked to preference data and demonstrating alignment's disproportionate effect on diversity.
BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data cs.CL · 2026-05-25 · unverdicted · none · ref 1 · internal anchor
BC Protocol uses dual-expert structured dialogue to elicit more natural CoT than solo expert writing, demonstrated by large gains in naturalness ratings in a controlled fiction-domain experiment.
Large Language Model Selection with Limited Annotations cs.CL · 2026-05-24 · unverdicted · none · ref 87 · internal anchor
SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents cs.CL · 2026-05-18 · unverdicted · none · ref 22 · internal anchor
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media cs.CL · 2026-05-16 · unverdicted · none · ref 246 · internal anchor
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming cs.CL · 2026-05-04 · unverdicted · none · ref 6 · internal anchor
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models cs.CL · 2026-04-12 · unverdicted · none · ref 35 · internal anchor
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation cs.CL · 2026-04-10 · accept · none · ref 23 · internal anchor
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning cs.CL · 2026-04-06 · unverdicted · none · ref 1 · internal anchor
STEER represents videos as time-ordered event schemas and uses Pareto-Frontier guided Advantage Balancing in RL to train a 4B model that matches 7B baselines on video tasks with half the frames.
Alignment midtraining for animals cs.CL · 2026-03-21 · unverdicted · none · ref 4 · internal anchor
Midtraining on 3000 synthetic animal compassion documents raises compassionate reasoning scores to 77% on ANIMA benchmark versus 40% for instruction tuning, with generalization to human compassion but degradation after additional tuning.
The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs? cs.CL · 2026-03-11 · unverdicted · none · ref 17 · internal anchor
The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs cs.CL · 2026-06-30 · unverdicted · none · ref 77 · internal anchor
RLMF uses quality of model self-judgments to refine RL rankings and select training data, achieving SOTA faithful calibration while preserving accuracy and outperforming standard RL by up to 63%.
NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation cs.CL · 2026-06-11 · unverdicted · none · ref 39 · internal anchor
A fluency-aware optimization framework is introduced to minimize inter-chunk silences in simultaneous speech-to-speech translation by leveraging model-internal signals including linguistic diversity and temporal variability.
Substrate Asymmetry in User-Side Memory: A Diagnostic Framework cs.CL · 2026-06-10 · unverdicted · none · ref 61 · internal anchor
User memory in LLMs factors into three orthogonal axes where parametric adapters and retrieval show opposite strengths, with causal evidence from attention interventions and an alignment tax on RLHF models.
What Do People Actually Want From AI? Mapping Preference Plurality cs.CL · 2026-06-04 · unverdicted · none · ref 66 · internal anchor
Open-ended preference data reveals substantial plurality in what people want from AI and divergent interpretations of shared values such as truthfulness.
CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning cs.CL · 2026-06-04 · unverdicted · none · ref 10 · internal anchor
CHASE uses co-evolutionary RL with GRPO to harden LLMs against black-box prompt-rewriting attacks, cutting mean StrongREJECT scores by 43.2% on held-out families while keeping zero false refusals on benign prompts.
Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair cs.CL · 2026-06-03 · unverdicted · none · ref 16 · internal anchor
TRI trains LLMs on goal-conditioned fill-in-the-middle tasks via PSM token rearrangement and symbolic verification to surgically repair erroneous CoT segments.
Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game cs.CL · 2026-06-03 · unverdicted · none · ref 7 · internal anchor
LLMs produce human-like finite bids in the St. Petersburg game but shift toward rational behavior under controlled prompt changes, indicating surface-level outcome resemblance without mechanism-level alignment.
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA cs.CL · 2026-05-21 · unverdicted · none · ref 20 · internal anchor
Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggregate differences.
Token-weighted Direct Preference Optimization with Attention cs.CL · 2026-05-21 · unverdicted · none · ref 24 · internal anchor
AttentionPO weights tokens in DPO using LLM attention as a pairwise judge, yielding better results on AlpacaEval, MT-Bench, and ArenaHard than prior preference optimization methods.
EmbGen: Teaching with Reassembled Corpora cs.CL · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization cs.CL · 2026-05-06 · unverdicted · none · ref 1 · 3 links · internal anchor
RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale cs.CL · 2026-04-13 · unverdicted · none · ref 15 · internal anchor
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models cs.CL · 2026-04-05 · unverdicted · none · ref 2 · internal anchor
GCAN cuts LLM hallucination rates by 27.8% and raises factual accuracy by 16.4% on TruthfulQA and HotpotQA by using causal token graphs and a new Causal Contribution Score.
MemFactory: Unified Inference & Training Framework for Agent Memory cs.CL · 2026-03-31 · unverdicted · none · ref 9 · internal anchor
MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization cs.CL · 2026-06-29 · unverdicted · none · ref 40 · internal anchor
A single LLM rewrite of skill descriptions using false positive and negative cases matches manual optimization performance in production, with most other pipeline components adding little value.
Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models cs.CL · 2026-06-03 · unverdicted · none · ref 69 · internal anchor
DIA is a training-free method that dynamically adjusts anchor positions in diffusion LLMs to improve format compliance and accuracy on reasoning benchmarks like GSM8K and MATH.
Pairwise Reference Alignment as a Model-Level Ordinal Observable cs.CL · 2026-05-29 · unverdicted · none · ref 14 · internal anchor
Pairwise reference alignment is formulated as an ordinal observable equal to the probability that a model score agrees with reference preferences on triples (x, y+, y-), with centered statistics, margin extensions, estimators, and concentration bounds.
KARMA: Karma-Aligned Reward Model Adaptation cs.CL · 2026-05-26 · unverdicted · none · ref 17 · internal anchor
KARMA adapts reward models from Reddit karma data to align LLMs with conversational pragmatics, finding that context-only rewards outperform karma-predictive ones downstream while reducing factuality across conditions.
When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift cs.CL · 2026-05-25 · unverdicted · none · ref 4 · internal anchor
Weak-to-strong reward models succeed in-distribution but fail to transfer under preference shift due to source-domain feature pull; Representation Anchoring regularizer improves OOD performance.
Reducing Political Manipulation with Consistency Training cs.CL · 2026-05-21 · unverdicted · none · ref 23 · 2 links · internal anchor
PCT is a reinforcement learning approach that trains LLMs for symmetric sentiment and helpfulness across paired opposing political prompts, reducing covert bias while preserving general performance.
Less Back-and-Forth: A Comparative Study of Structured Prompting cs.CL · 2026-05-19 · unverdicted · none · ref 20 · internal anchor
Checklist-improved prompts achieve the highest mean rubric score (7.50/8) and best quality-effort tradeoff compared to raw prompts (5.67) and clarifying-question prompts (6.67) across four task types and three LLMs.
Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning cs.CL · 2026-05-16 · conditional · none · ref 148 · internal anchor
Fine-tuned transformers with multi-task learning recover substantial wording-derived signal for item difficulty at small sample sizes typical in applied testing.
Cross-Lingual Jailbreak Detection via Semantic Codebooks cs.CL · 2026-04-28 · unverdicted · none · ref 14 · internal anchor
Semantic similarity to an English jailbreak codebook detects cross-lingual attacks with high accuracy on curated benchmarks but shows poor separability on diverse unsafe prompts.
DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining cs.CL · 2026-04-24 · unverdicted · none · ref 18 · internal anchor
DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.
Multilingual Refusal Alignment for Safer Large Language Models cs.CL · 2026-04-24 · conditional · none · ref 57 · internal anchor
English-only safety alignment fails to transfer cross-lingually, while multilingual DPO training on the new RefusEU dataset improves safety across 12 European languages without degrading Global MMLU performance.
ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward cs.CL · 2026-04-23 · conditional · none · ref 17 · internal anchor
ProcessThinker assigns step-level rewards in GRPO by sampling continuations from each step prefix and using empirical success rates, improving video reasoning benchmarks without training a separate PRM.
Language-Specific Sentiment Polarity Biases in Encoder and Large Language Model Classification of Product Reviews cs.CL · 2026-06-22 · unverdicted · none · ref 13 · internal anchor
LLMs show negative polarity bias in French and encoder models show positive bias in Japanese when classifying product review sentiment.

Training language models to follow instructions with human feedback

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer