pith. sign in

super hub Canonical reference

Training language models to follow instructions with human feedback

Canonical reference. 93% of citing Pith papers cite this work as background.

218 Pith papers citing it
Background 93% of classified citations
abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

hub tools

citation-role summary

background 54 method 1 other 1

citation-polarity summary

claims ledger

  • abstract Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we u

authors

co-cited works

clear filters

representative citing papers

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

citing papers explorer

Showing 15 of 15 citing papers after filters.

  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models cs.CL · 2022-01-28 · accept · none · ref 45 · internal anchor

    Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

  • Generative Agents: Interactive Simulacra of Human Behavior cs.HC · 2023-04-07 · accept · none · ref 85 · internal anchor

    Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

  • Code as Policies: Language Model Programs for Embodied Control cs.RO · 2022-09-16 · accept · none · ref 22 · internal anchor

    Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

  • SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation cs.CL · 2026-04-10 · accept · none · ref 23 · internal anchor

    SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.

  • Let's Verify Step by Step cs.LG · 2023-05-31 · accept · none · ref 15 · internal anchor

    Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

  • Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models cs.CV · 2023-03-08 · accept · none · ref 29 · internal anchor

    Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.

  • A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 44 · internal anchor

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  • Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code cs.SE · 2026-05-06 · accept · none · ref 95 · internal anchor

    A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

  • Zephyr: Direct Distillation of LM Alignment cs.LG · 2023-10-25 · accept · none · ref 14 · internal anchor

    Zephyr-7B achieves state-of-the-art chat benchmark results among 7B models by distilling alignment via dDPO on AI feedback preferences, surpassing the 70B Llama-2-Chat model on MT-Bench with no human data required.

  • SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks cs.LG · 2023-10-05 · accept · none · ref 6 · internal anchor

    SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

  • Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them cs.CL · 2022-10-17 · accept · none · ref 20 · internal anchor

    Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.

  • Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned cs.CL · 2022-08-23 · accept · none · ref 42 · internal anchor

    RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.

  • Scaling Laws and Interpretability of Learning from Repeated Data cs.LG · 2022-05-21 · accept · none · ref 4 · internal anchor

    Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.

  • A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 230 · internal anchor

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

  • A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 68 · internal anchor

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.