pith. sign in

super hub Canonical reference

Constitutional AI: Harmlessness from AI Feedback

Canonical reference. 83% of citing Pith papers cite this work as background.

523 Pith papers citing it
Background 83% of classified citations
abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

hub tools

citation-role summary

background 86 method 4 baseline 3 dataset 1 other 1

citation-polarity summary

claims ledger

  • abstract As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised

authors

co-cited works

clear filters

representative citing papers

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Revocable Learned State via Process Sidecars

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Process sidecars use a secant-based two-parameter edit to achieve second-order accurate memory revocation after safety training, outperforming scalar task arithmetic on refusal tasks across three models.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

citing papers explorer

Showing 18 of 18 citing papers after filters.

  • Convex Optimization with Nested Evolving Feasible Sets cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor

    For convex losses in nested evolving feasible sets, a lazy algorithm balances O(T^{1-β}) regret with O(T^β) movement for any β; for strongly convex or sharp losses, Frugal achieves zero regret with O(log T) movement, shown optimal by matching lower bound.

  • Theoretical Limits of Language Model Alignment cs.LG · 2026-05-08 · unverdicted · none · ref 5 · internal anchor

    The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

  • Reinforcement Learning via Value Gradient Flow cs.LG · 2026-04-15 · unverdicted · none · ref 5 · internal anchor

    VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

  • QLoRA: Efficient Finetuning of Quantized LLMs cs.LG · 2023-05-23 · conditional · none · ref 5 · internal anchor

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  • Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning cs.LG · 2026-05-11 · unverdicted · none · ref 17 · internal anchor

    METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.

  • Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents cs.LG · 2026-05-09 · unverdicted · none · ref 8 · internal anchor

    The thesis presents a kernel method for multiaccuracy across overlooked subpopulations, information-theoretic optimal watermarking for LLMs, and a simulator showing LLM agents outperforming humans in supply chains while creating tail risks.

  • Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph cs.LG · 2026-05-08 · unverdicted · none · ref 1 · internal anchor

    GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.

  • Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.

  • Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 9 · internal anchor

    Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.

  • Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs cs.LG · 2026-04-12 · unverdicted · none · ref 3 · internal anchor

    LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.

  • ExecTune: Effective Steering of Black-Box LLMs with Guide Models cs.LG · 2026-04-09 · unverdicted · none · ref 5 · internal anchor

    ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on math and code tasks.

  • Autolearn: Learn by Surprise, Commit by Proof cs.LG · 2026-04-02 · unverdicted · none · ref 22 · internal anchor

    Autolearn uses high-loss passages and self-generated Q&A training to drive the perturbation gap below baseline, improving novel fact acquisition while suppressing memorization in language models.

  • Jailbroken: How Does LLM Safety Training Fail? cs.LG · 2023-07-05 · unverdicted · none · ref 7 · internal anchor

    LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.

  • ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data cs.LG · 2026-04-19 · unverdicted · none · ref 7 · internal anchor

    ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.

  • Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning cs.LG · 2026-04-17 · unverdicted · none · ref 32 · internal anchor

    CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-critical settings.

  • Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 5 · internal anchor

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

  • LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems cs.LG · 2026-01-20 · unverdicted · none · ref 12 · internal anchor

    A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.

  • Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unreviewed · ref 24 · internal anchor