super hub Canonical reference

Constitutional AI: Harmlessness from AI Feedback

Bai Y, Kadavath S, Kundu S · 2022 · cs.CL · arXiv 2212.08073

Canonical reference. 83% of citing Pith papers cite this work as background.

428 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 428 citing papers more from Bai Y arXiv PDF

abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 86 method 4 baseline 3 dataset 1 other 1

citation-polarity summary

background 79 unclear 5 use method 4 baseline 3 support 3 use dataset 1

claims ledger

abstract As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised

authors

Bai Y et al Kadavath S Kundu S

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

cs.SE · 2026-05-20 · conditional · novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning

math.ST · 2026-05-10 · unverdicted · novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

cs.CY · 2026-04-19 · unverdicted · novelty 8.0

LLM mental health simulations produce individually plausible patients but systematically misrepresent real population distributions, with reduced variance, unstable diagnoses, and demographic biases.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

cs.AI · 2026-03-17 · unverdicted · novelty 8.0

Invisible orchestrators raise collective dissociation in LLM agent groups, suppress protective actions, and produce internal risks undetectable by output-based checks.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Revocable Learned State via Process Sidecars

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Process sidecars use a secant-based two-parameter edit to achieve second-order accurate memory revocation after safety training, outperforming scalar task arithmetic on refusal tasks across three models.

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

CRAFT is a three-pillar credit assignment scheme that uses counterfactual token importance from GRPO sibling rollouts to provide signed per-token distillation signals in self-distilled agentic RL.

Self-Stigma Is Not a Monolith, but Generic Empathy Is: Persona-Conditioned LLM Support for People Who Use Drugs

cs.CL · 2026-06-22 · unverdicted · novelty 7.0 · 2 refs

Four self-stigma personas identified via LPA on 1,174 Reddit users; persona-conditioned LLMs achieve targeted shifts but experts prefer generic empathy baselines.

DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

cs.SE · 2026-06-02 · unverdicted · novelty 7.0

DDOR is a delta-debugging framework that localizes minimal refusal-triggering fragments for explainable overrefusal testing and targeted prompt repair in black-box LLMs.

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

LLMs in a pre-specified cheap-talk benchmark over-reveal by 1.8-4.2x relative to the most-informative equilibrium, producing NMI of 0.78-0.94 against oracle values of 0.18-0.53 and exhibiting bias-tracking exaggeration rather than strategic coarsening.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

DirectorBench is a profile-aware diagnostic benchmark that localizes bottlenecks in long-form video generation workflows using structured checkpoints and multi-agent evaluation.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

citing papers explorer

Showing 33 of 33 citing papers after filters.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models cs.CR · 2026-05-14 · conditional · none · ref 2 · internal anchor
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? cs.CR · 2026-04-16 · unverdicted · none · ref 6 · internal anchor
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail? cs.CR · 2026-04-07 · unverdicted · full · ref 3 · internal anchor
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
Measuring Safety Alignment Effects in Autonomous Security Agents cs.CR · 2026-05-19 · conditional · none · ref 6 · internal anchor
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
Do Coding Agents Understand Least-Privilege Authorization? cs.CR · 2026-05-14 · unverdicted · none · ref 52 · internal anchor
Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems cs.CR · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off cs.CR · 2026-05-09 · unverdicted · none · ref 37 · internal anchor
Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utility trade-off when trying to eliminate them.
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents cs.CR · 2026-05-04 · unverdicted · none · ref 11 · internal anchor
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 29 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails cs.CR · 2026-05-17 · conditional · none · ref 2 · internal anchor
LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs cs.CR · 2026-05-15 · unverdicted · none · ref 30 · internal anchor
Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing cs.CR · 2026-05-11 · unverdicted · none · ref 85 · internal anchor
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
The Authorization-Execution Gap Is a Major Safety and Security Problem in Open-World Agents cs.CR · 2026-05-10 · conditional · none · ref 4 · internal anchor
Open-world agents suffer from an Authorization-Execution Gap arising from delegation incompleteness, channel corruption, and composition fragmentation, requiring dynamic runtime integrity checks instead of only upfront filters or post-hoc audits.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation cs.CR · 2026-05-06 · unverdicted · none · ref 14 · internal anchor
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis cs.CR · 2026-05-05 · unverdicted · none · ref 21 · internal anchor
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning cs.CR · 2026-05-01 · unverdicted · none · ref 3 · internal anchor
Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models cs.CR · 2026-04-23 · unverdicted · none · ref 6 · internal anchor
Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs cs.CR · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
Parallax: Why AI Agents That Think Must Never Act cs.CR · 2026-04-14 · unverdicted · none · ref 23 · internal anchor
Parallax enforces structural separation between AI thinking and acting via independent multi-tier validation, information flow control, and state rollback, blocking 98.9% of 280 adversarial attacks with zero false positives even when the reasoning system is fully compromised.
Agent-Sentry: Bounding LLM Agents via Execution Provenance cs.CR · 2026-03-24 · unverdicted · none · ref 3 · internal anchor
Agent-Sentry bounds LLM agent executions via structural provenance classification, sensitive-value allowlists, and selective LLM judgment, blocking 94.3% of injections while allowing 95.1% of benign actions on AgentDojo and AgentDyn.
SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models cs.CR · 2025-12-10 · unverdicted · none · ref 25 · internal anchor
SCOUT uses token saliency analysis to detect both standard and contextually-plausible backdoor attacks in language models while maintaining clean accuracy.
Defending Against Indirect Prompt Injection Attacks With Spotlighting cs.CR · 2024-03-20 · unverdicted · none · ref 12 · internal anchor
Spotlighting prompt transformations cut indirect prompt injection success rates from >50% to <2% on GPT models while preserving task performance.
Rethinking Fraud Safety Evaluation: Multi-Round Attacks Reveal Safety-Utility Tradeoffs in Graph-Context LLM Defenders cs.CR · 2026-05-20 · unverdicted · none · ref 19 · internal anchor
Graph-context LLM fraud defenders improve early refusal under replay and adaptive multi-round attacks compared to text baselines but increase benign over-refusal, with the cost localized to how the LLM consumes structured graph fields rather than encoder quality.
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications cs.CR · 2026-05-17 · unverdicted · none · ref 5 · internal anchor
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.
SoK: Robustness in Large Language Models against Jailbreak Attacks cs.CR · 2026-05-06 · accept · none · ref 5 · internal anchor
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
Position: No Retroactive Cure for Infringement during Training cs.CR · 2026-04-20 · unverdicted · none · ref 14 · internal anchor
Post-hoc mitigation cannot retroactively cure infringement that occurred during unauthorized data ingestion and training because liability attaches to data lineage and retained expressive value in model weights.
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs cs.CR · 2025-11-04 · unverdicted · none · ref 5 · internal anchor
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
Mitigating Watermark Forgery in Generative Models via Randomized Key Selection cs.CR · 2025-07-10 · unverdicted · none · ref 5 · internal anchor
Randomized per-query key selection with single-key detection acceptance bounds forgery success rate independently of collected samples while preserving model utility.
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction cs.CR · 2025-06-02 · unverdicted · none · ref 21 · internal anchor
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
LLM-Safety Evaluations Lack Robustness cs.CR · 2025-03-04 · unverdicted · none · ref 8 · internal anchor
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI cs.CR · 2026-05-15 · unverdicted · none · ref 11 · internal anchor
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
When the Agent Is the Adversary: Architectural Requirements for Agentic AI Containment After the April 2026 Frontier Model Escape cs.CR · 2026-04-25 · unverdicted · none · ref 7 · internal anchor
A reported 2026 frontier model escape shows that alignment training, sandboxing, tool interception, and audits fail against adversarial agentic AI, requiring five new architectural requirements for durable containment.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety cs.CR · 2025-02-02 · unverdicted · none · ref 188 · internal anchor
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

Constitutional AI: Harmlessness from AI Feedback

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer