pith. sign in

super hub Canonical reference

Training language models to follow instructions with human feedback

Canonical reference. 93% of citing Pith papers cite this work as background.

291 Pith papers citing it
Background 93% of classified citations
abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

hub tools

citation-role summary

background 55 method 1 other 1

citation-polarity summary

claims ledger

  • abstract Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we u

authors

co-cited works

clear filters

representative citing papers

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

What Drives Interactive Improvement from Feedback?

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

Alpha-RTL: Test-Time Training for RTL Hardware Optimization

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.

AIP: A Graph Representation for Learning and Governing Agent Skills

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

AIP models skills as graphs of discrete steps connected by typed I/O edges under a validated schema, raising agent mean reward from 0.60 to 0.71 and pass rate from 53% to 67% on 27 SkillsBench tasks while enabling node-level fixes.

citing papers explorer

Showing 17 of 17 citing papers after filters.

  • DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines cs.CL · 2023-10-05 · conditional · none · ref 38 · internal anchor

    DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

  • Llemma: An Open Language Model For Mathematics cs.CL · 2023-10-16 · unverdicted · none · ref 165 · internal anchor

    Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.

  • Reinforced Self-Training (ReST) for Language Modeling cs.CL · 2023-08-17 · unverdicted · none · ref 17 · internal anchor

    ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

  • Simple synthetic data reduces sycophancy in large language models cs.CL · 2023-08-07 · unverdicted · none · ref 28 · internal anchor

    Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.

  • Orca: Progressive Learning from Complex Explanation Traces of GPT-4 cs.CL · 2023-06-05 · conditional · none · ref 5 · internal anchor

    A 13B model called Orca learns detailed reasoning from GPT-4 explanation traces and reaches parity with ChatGPT on Big-Bench Hard while outperforming other 13B models.

  • Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 92 · internal anchor

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  • The False Promise of Imitating Proprietary LLMs cs.CL · 2023-05-25 · conditional · none · ref 65 · internal anchor

    Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.

  • Improving Factuality and Reasoning in Language Models through Multiagent Debate cs.CL · 2023-05-23 · unverdicted · none · ref 22 · internal anchor

    Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.

  • Enhancing Chat Language Models by Scaling High-quality Instructional Conversations cs.CL · 2023-05-23 · conditional · none · ref 251 · internal anchor

    UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.

  • Towards Expert-Level Medical Question Answering with Large Language Models cs.CL · 2023-05-16 · unverdicted · none · ref 55 · internal anchor

    Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.

  • mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality cs.CL · 2023-04-27 · unverdicted · none · ref 9 · internal anchor

    mPLUG-Owl introduces a two-stage modular training paradigm that aligns images with text in LLMs via frozen visual modules followed by LoRA fine-tuning, achieving strong multimodal instruction following.

  • HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face cs.CL · 2023-03-30 · unverdicted · none · ref 2 · internal anchor

    HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.

  • ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 157 · internal anchor

    ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

  • REPLUG: Retrieval-Augmented Black-Box Language Models cs.CL · 2023-01-30 · conditional · none · ref 55 · internal anchor

    REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.

  • Secrets of RLHF in Large Language Models Part I: PPO cs.CL · 2023-07-11 · unverdicted · none · ref 16 · internal anchor

    Policy constraints are the critical factor for stable PPO training in RLHF, and the proposed PPO-max variant improves stability for large language model alignment.

  • Self-Refine: Iterative Refinement with Self-Feedback cs.CL · 2023-03-30 · unverdicted · none · ref 30 · internal anchor

    Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.

  • A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 68 · internal anchor

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.