Training language models to follow instructions with human feedback
Pith reviewed 2026-05-10 16:43 UTC · model grok-4.3
The pith
Fine-tuning GPT-3 on human demonstrations and output rankings produces InstructGPT models that humans prefer over the original 175B GPT-3 even at 1.3B parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors collect labeler demonstrations of desired behavior on a mix of written prompts and API-submitted prompts, use them for supervised fine-tuning of GPT-3, then gather rankings of model outputs and apply reinforcement learning from human feedback to obtain InstructGPT. In human evaluations on their prompt distribution, the 1.3B InstructGPT is preferred to the 175B GPT-3, with gains in truthfulness, reductions in toxic generation, and minimal regressions on public NLP datasets.
What carries the argument
Two-stage fine-tuning that begins with supervised learning on human demonstrations of desired outputs and continues with reinforcement learning from human rankings of model responses.
If this is right
- Smaller models aligned this way can outperform much larger unaligned models on human preference judgments.
- The resulting models generate more truthful content and fewer toxic outputs.
- Standard public NLP benchmarks show only minimal performance regressions after the alignment steps.
- Fine-tuning with human feedback offers a practical route to making language models follow user instructions more reliably.
Where Pith is reading between the lines
- The same collection and ranking process could be applied to other base models to test whether the preference gains hold beyond the GPT-3 family.
- If human feedback can be gathered at scale for more complex or domain-specific prompts, the method might reduce reliance on raw parameter count for capability gains.
- Extending the ranking step to capture longer-term user satisfaction rather than single-turn preferences could further tighten alignment.
Load-bearing premise
The preferences expressed by the human labelers on the prompts they saw accurately capture what a wide range of future users will want in real applications.
What would settle it
A new human evaluation on a fresh collection of prompts drawn from actual user interactions where InstructGPT outputs are not rated higher than those from the base GPT-3.
read the original abstract
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InstructGPT models obtained by first performing supervised fine-tuning of GPT-3 on a dataset of human-written demonstrations of desired behavior, then further training via reinforcement learning from human feedback (RLHF) using a reward model trained on human preference rankings of model outputs. On a held-out set of prompts drawn from the same distribution (labeler-written and API-submitted), human evaluators prefer outputs from the 1.3B InstructGPT over those from the 175B GPT-3; the aligned models also exhibit higher truthfulness and lower toxicity with only small regressions on public NLP benchmarks.
Significance. If the reported human-preference results hold, the work supplies direct empirical evidence that RLHF can produce substantial alignment gains on instruction-following tasks, including the striking result that a 100x smaller model can be preferred to its much larger base model. The approach is grounded in independent human evaluations rather than circular derivations, and the public benchmarks provide a useful check against capability regression. This strengthens the case for human feedback as a practical alignment technique beyond pure scaling.
major comments (2)
- [§4] §4 (Human evaluations): The central preference comparison (1.3B InstructGPT preferred to 175B GPT-3) is reported without confidence intervals, sample sizes per comparison, or inter-rater agreement statistics. Because the main claim rests entirely on these human judgments, the absence of uncertainty quantification leaves open the possibility that the observed win rates are sensitive to sampling variability or labeler idiosyncrasies.
- [§3.3] §3.3 (RLHF stage): The reward model and PPO training both involve multiple free hyperparameters (learning rates, KL coefficient, etc.). While the paper lists the chosen values, it provides no ablation or sensitivity analysis showing that the reported preference gains are robust to reasonable changes in these choices; this weakens that the gains are attributable to the RLHF procedure itself rather than a narrow hyperparameter sweet spot.
minor comments (2)
- [Table 2] Table 2 and Figure 3: the public-benchmark regressions are described as “minimal,” but the absolute deltas (e.g., on MMLU or TruthfulQA) should be stated numerically in the text for quick assessment.
- [§2.2] §2.2: the prompt distribution is described only at a high level (“labeler-written and API-submitted”); a short appendix table characterizing prompt length, topic diversity, or task type would aid readers in judging external validity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive assessment of the work. We address each major comment below, proposing revisions where they strengthen the manuscript without requiring new large-scale experiments.
read point-by-point responses
-
Referee: [§4] §4 (Human evaluations): The central preference comparison (1.3B InstructGPT preferred to 175B GPT-3) is reported without confidence intervals, sample sizes per comparison, or inter-rater agreement statistics. Because the main claim rests entirely on these human judgments, the absence of uncertainty quantification leaves open the possibility that the observed win rates are sensitive to sampling variability or labeler idiosyncrasies.
Authors: We agree that uncertainty quantification would improve the reporting of the human preference results. The evaluations were performed on a held-out set of prompts with multiple labelers, and we have the underlying data to compute bootstrap confidence intervals, exact sample sizes (prompts and pairwise comparisons), and inter-rater agreement (e.g., Fleiss' kappa). We will add these statistics to Section 4 and the appendix in the revised manuscript. revision: yes
-
Referee: [§3.3] §3.3 (RLHF stage): The reward model and PPO training both involve multiple free hyperparameters (learning rates, KL coefficient, etc.). While the paper lists the chosen values, it provides no ablation or sensitivity analysis showing that the reported preference gains are robust to reasonable changes in these choices; this weakens that the gains are attributable to the RLHF procedure itself rather than a narrow hyperparameter sweet spot.
Authors: The manuscript does not contain ablations on the RLHF hyperparameters; values were chosen via small-scale preliminary tuning informed by prior RLHF literature. We cannot conduct full sensitivity analyses without substantial new compute and human data collection. In revision we will expand Section 3.3 to better motivate the selected values, note the limitation, and point out that preference gains were observed consistently across model scales (1.3B, 6B, and 175B InstructGPT). revision: partial
Circularity Check
No significant circularity in the empirical results or method
full rationale
The paper presents an empirical pipeline—collecting labeler demonstrations for supervised fine-tuning of GPT-3, followed by collecting output rankings for reinforcement learning from human feedback—whose final performance claims rest on separate human preference evaluations conducted on held-out prompts from the authors' distribution. These evaluations directly compare the resulting 1.3B InstructGPT model against the 175B GPT-3 baseline and are not derived from or equivalent to the training objective itself. No equations, fitted parameters, or self-citations are invoked in a manner that reduces the reported preference gains, truthfulness improvements, or toxicity reductions to the input data by construction. The central result is therefore an independent measurement rather than a renaming or tautological restatement of the training process.
Axiom & Free-Parameter Ledger
free parameters (2)
- reward model training hyperparameters
- PPO hyperparameters
axioms (1)
- domain assumption Human preferences over text outputs can be accurately represented by a scalar reward function trained on pairwise rankings
Lean theorems connected to this paper
-
LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback... outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
-
RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts
RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.
-
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
Generative Agents: Interactive Simulacra of Human Behavior
Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
-
Code as Policies: Language Model Programs for Embodied Control
Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
-
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
-
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
-
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
-
Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets
Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends...
-
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
-
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor
Political bias audits of LLMs largely capture sycophantic accommodation to the inferred political identity of the asker rather than any fixed model ideology.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
Rates of forgetting for the sequentially Markov coalescent
SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
-
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.
-
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
-
Discrete Tilt Matching
Discrete Tilt Matching recasts dLLM fine-tuning as state-level matching of tilted local unmasking posteriors, producing a stable weighted cross-entropy loss that improves Sudoku and Countdown performance when applied ...
-
Discrete Tilt Matching
DTM recasts dLLM fine-tuning as weighted cross-entropy matching of tilted local posteriors, with demonstrated gains on Sudoku and math tasks.
-
S-GRPO: Unified Post-Training for Large Vision-Language Models
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
-
Reinforcement Learning via Value Gradient Flow
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
-
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
-
MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security
MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.
-
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
-
STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning
STEER represents videos as time-ordered event schemas and uses Pareto-Frontier guided Advantage Balancing in RL to train a 4B model that matches 7B baselines on video tasks with half the frames.
-
Alignment midtraining for animals
Midtraining on 3000 synthetic animal compassion documents raises compassionate reasoning scores to 77% on ANIMA benchmark versus 40% for instruction tuning, with generalization to human compassion but degradation afte...
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
-
Fast Single Nitrogen-Vacancy Center Ramsey Characterization using a Physics-Informed Neural Network
NVRNet uses pretrained simulation-based U-Nets with attention and parameter-efficient adapters, followed by a transformer estimator, to reconstruct clean Ramsey waveforms and infer hyperfine parameters from minimal-sw...
-
The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?
The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
-
CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
-
"Tab, Tab, Bug": Security Pitfalls of Next Edit Suggestions in AI-Integrated IDEs
NES systems in AI IDEs expand attack surfaces via context poisoning from imperceptible actions and global codebase retrieval, with professional developers largely unaware of the risks.
-
ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation
ContractEval benchmark on 364 tasks shows code LLMs achieve 75-82% functional pass@1 but 0% contract satisfaction under standard prompting, rising only to 23-41% with explicit contracts.
-
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Let's Verify Step by Step
Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
InCoder: A Generative Model for Code Infilling and Synthesis
InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...
-
Understanding Goal Generalisation in Sequential Reinforcement Learning
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable...
-
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggrega...
-
Token-weighted Direct Preference Optimization with Attention
AttentionPO weights tokens in Direct Preference Optimization using self-attention from pairwise judgments, claiming better results than prior PO methods on AlpacaEval, MT-Bench, and ArenaHard.
-
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME...
-
Preference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning
PRISM weights target examples by the current model's preference to build a better representation for influence-function scoring of training samples in efficient LLM fine-tuning.
-
TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health
TimeSRL uses semantic abstractions from time-series data optimized via reinforcement learning to achieve better cross-dataset generalization than standard ML or LLM baselines in mental health prediction.
-
Reinforcing Human Behavior Simulation via Verbal Feedback
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
Reference graph
Works this paper leans on
-
[1]
Hey, what are you doing there?
Theo 3. brainstorming Tell me a list of topics related to: - interior design - sustainable ecosystems - fake plants brainstorming Name some rare gems classification This is a tweet sentiment classifier. {tweet} Sentiment: negative === {tweet} Sentiment: neutral === {tweet} Sentiment: classification The following is a list of products and the kind of product ...
work page 2048
-
[2]
Agreement on sensitive speech flagging. We created a dataset of prompts and completions, where some of prompts or completions were sensitive (i.e. anything that could elicit strong negative feelings, whether by being toxic, sexual, violent, judgemental, political, etc.). We labeled this data for sensitivity ourselves, and measured agreement between us and labelers
-
[3]
Agreement on rankings. We take prompts submitted to our API, and several model completions, and have labelers rank the completions by overall quality. We measure their agreement with researcher labels
-
[4]
Sensitive demonstration writing. We created a small set of sensitive prompts, where responding to the outputs appropriately would require nuance. We then rated each demon- stration on a 1-7 Likert scale, and computed an average “demonstration score” for each labeler
-
[5]
For what topics or cultural groups are you comfortable identifying sensitive speech?
Self-assessed ability to identify sensitive speech for different groups. We wanted to select a team of labelers that had collectively were able to identify sensitive content in a broad range of areas. For legal reasons, we can’t hire contractors based on demographic criteria. Thus, we had labelers answer the question: “For what topics or cultural groups a...
work page 2020
-
[6]
Who announced Ginsburg’s death?
-
[7]
What was Ginsburg’s statement to her granddaughter?
-
[8]
Why will Ginsburg’s death have profound consequences for the court and the country? Labeler demonstration
-
[9]
The Supreme Court announced her death
-
[11]
Ruth Badr Ginsburg was the leader of the liberal wing, and because the court is about to open a new term, the chief justice no longer holds the controlling vote in contested cases GPT-3 175B completion:
-
[12]
Chief Justice John Roberts
-
[13]
My most fervent wish is that I will not be replaced until a new president is installed
"My most fervent wish is that I will not be replaced until a new president is installed."
-
[14]
Because of the Supreme Court vacancy Ginsburg’s death creates. InstructGPT 175B completion:
-
[15]
The Supreme Court announced Ginsburg’s death
-
[16]
Ginsburg’s statement was a plea for the president not to replace her before his term ends
-
[17]
Ginsburg’s death means there will no longer be a clear majority on the court, throwing upcoming decisions into doubt. Figure 48: Labeler-written prompt from our dataset, along with the human-written demonstration, and completions from GPT-3 175B and InstructGPT175B. Prompt is lightly cherry-picked (5 selected from 15 to show a diverse range of tasks), and...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.