hub

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend · 2025 · cs.CL · arXiv 2501.18837

36 Pith papers cite this work. Polarity classification is still indexing.

36 Pith papers citing it

open full Pith review browse 36 citing papers arXiv PDF

abstract

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs

cs.CR · 2026-06-19 · unverdicted · novelty 7.0

Tiered Language Models use a secret key to induce an alternative computation graph over shared weights, enabling private capabilities in the keyed mode while the public mode shows none.

BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems

cs.CR · 2026-06-12 · unverdicted · novelty 7.0

BELLS-O is the first vendor-neutral operational benchmark comparing specialized guardrails and repurposed frontier LLMs on accuracy, false-positive rates, speed, and monetary cost across 11 harm categories and 13 jailbreak techniques.

Stateful Online Monitoring Catches Distributed Agent Attacks

cs.CR · 2026-05-29 · unverdicted · novelty 7.0

A clustering-based stateful online monitor detects distributed multi-agent cyberattacks that evade standard per-transcript monitors, catching them 30% earlier in large-scale simulated traffic with low overhead.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

cs.HC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

Toward a Principled Framework for Agent Safety Measurement

cs.CR · 2026-05-02 · unverdicted · novelty 7.0

BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.

CAREBench: A Child-Safety Risk Benchmark for Language Models

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

CAREBench is a new benchmark with 500 prompts in 12 risk categories that measures how often frontier LLMs fail to refuse or redirect child-safety risks, reporting failure rates between 2% and 58%.

Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics

cs.CR · 2026-06-05 · unverdicted · novelty 6.0

MTK detects jailbreaks by monitoring the evolution of prompt neighborhood structures on the data manifold through LLM layers, reporting 95% TPR at 5% FPR on benign and 2% on pseudo-malicious prompts plus 85% TPR under adaptive attacks.

Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

cs.CR · 2026-06-02 · unverdicted · novelty 6.0

IHO is a new black-box jailbreak attack for LLMs that is adaptive, efficient, transferable across models and behaviors, and effective even against layered defenses without modification.

Consistency Training while Mitigating Obfuscation via Rate Matching

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.

Boundary-targeted Membership Inference Attacks on Safety Classifiers

cs.LG · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

A boundary-targeted MIA strategy recovers 19% of distress-flagged conversations from a safety classifier at 5% false-positive rate, 3.5 times better than prior methods.

Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

cs.AI · 2026-05-20 · conditional · novelty 6.0 · 3 refs

Introduces MOOD benchmark for OOD LLM alignment failures and shows guard models plus Mahalanobis and perplexity OOD detectors improve recall from 39% to 45% with positive scaling.

Leveraging RAG for Training-Free Alignment of LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.

Internalizing Safety Understanding in Large Reasoning Models via Verification

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.

Estimating Tail Risks in Language Model Output Distributions

cs.LG · 2026-04-24 · conditional · novelty 6.0

Importance sampling via activation-steered unsafe proposal models estimates rare harmful-output probabilities in language models with 10-20x fewer samples than brute-force Monte Carlo.

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

cs.CL · 2026-04-16 · unverdicted · novelty 6.0

A new segment-level coherence probing method improves true-positive rate for harmful intent detection by 35.55% at 1% false-positive rate and maintains high AUROC on obfuscated attacks.

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

cs.CR · 2026-04-13 · unverdicted · novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.

The Impact of Off-Policy Training Data on Probe Generalisation

cs.AI · 2025-11-21 · unverdicted · novelty 6.0

Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.

Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem

cs.CR · 2025-06-17 · unverdicted · novelty 6.0

Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.

Benchmarking Misuse Mitigation Against Covert Adversaries

cs.CR · 2025-06-06 · unverdicted · novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.

citing papers explorer

Showing 33 of 33 citing papers after filters.

Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs cs.CR · 2026-06-19 · unverdicted · none · ref 49 · internal anchor
Tiered Language Models use a secret key to induce an alternative computation graph over shared weights, enabling private capabilities in the keyed mode while the public mode shows none.
BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems cs.CR · 2026-06-12 · unverdicted · none · ref 4 · internal anchor
BELLS-O is the first vendor-neutral operational benchmark comparing specialized guardrails and repurposed frontier LLMs on accuracy, false-positive rates, speed, and monetary cost across 11 harm categories and 13 jailbreak techniques.
Stateful Online Monitoring Catches Distributed Agent Attacks cs.CR · 2026-05-29 · unverdicted · none · ref 25 · internal anchor
A clustering-based stateful online monitor detects distributed multi-agent cyberattacks that evade standard per-transcript monitors, catching them 30% earlier in large-scale simulated traffic with low overhead.
Deep Minds and Shallow Probes cs.LG · 2026-05-12 · unverdicted · none · ref 46 · internal anchor
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms cs.AI · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI cs.HC · 2026-05-07 · unverdicted · none · ref 61 · 2 links · internal anchor
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
Toward a Principled Framework for Agent Safety Measurement cs.CR · 2026-05-02 · unverdicted · none · ref 15 · internal anchor
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
CAREBench: A Child-Safety Risk Benchmark for Language Models cs.LG · 2026-06-29 · unverdicted · none · ref 43 · internal anchor
CAREBench is a new benchmark with 500 prompts in 12 risk categories that measures how often frontier LLMs fail to refuse or redirect child-safety risks, reporting failure rates between 2% and 58%.
Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics cs.CR · 2026-06-05 · unverdicted · none · ref 27 · internal anchor
MTK detects jailbreaks by monitoring the evolution of prompt neighborhood structures on the data manifold through LLM layers, reporting 95% TPR at 5% FPR on benign and 2% on pseudo-malicious prompts plus 85% TPR under adaptive attacks.
Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs cs.CR · 2026-06-02 · unverdicted · none · ref 9 · internal anchor
IHO is a new black-box jailbreak attack for LLMs that is adaptive, efficient, transferable across models and behaviors, and effective even against layered defenses without modification.
Consistency Training while Mitigating Obfuscation via Rate Matching cs.CL · 2026-06-01 · unverdicted · none · ref 27 · internal anchor
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
Boundary-targeted Membership Inference Attacks on Safety Classifiers cs.LG · 2026-05-21 · unverdicted · none · ref 24 · 2 links · internal anchor
A boundary-targeted MIA strategy recovers 19% of distress-flagged conversations from a safety classifier at 5% false-positive rate, 3.5 times better than prior methods.
Leveraging RAG for Training-Free Alignment of LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 54 · internal anchor
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.
Internalizing Safety Understanding in Large Reasoning Models via Verification cs.AI · 2026-05-09 · unverdicted · none · ref 17 · internal anchor
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.
GLiGuard: Schema-Conditioned Classification for LLM Safeguard cs.CL · 2026-05-08 · unverdicted · none · ref 29 · internal anchor
GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
Segment-Level Coherence for Robust Harmful Intent Probing in LLMs cs.CL · 2026-04-16 · unverdicted · none · ref 3 · internal anchor
A new segment-level coherence probing method improves true-positive rate for harmful intent detection by 35.55% at 1% false-positive rate and maintains high AUROC on obfuscated attacks.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 38 · internal anchor
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 88 · internal anchor
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense cs.CR · 2026-04-09 · unverdicted · none · ref 31 · internal anchor
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
The Impact of Off-Policy Training Data on Probe Generalisation cs.AI · 2025-11-21 · unverdicted · none · ref 36 · internal anchor
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem cs.CR · 2025-06-17 · unverdicted · none · ref 23 · internal anchor
Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.
Benchmarking Misuse Mitigation Against Covert Adversaries cs.CR · 2025-06-06 · unverdicted · none · ref 12 · internal anchor
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
Images Amplify Misinformation Sharing in Vision-Language Models cs.CL · 2025-05-19 · unverdicted · none · ref 3 · internal anchor
Images increase VLMs' resharing rates for false news more than true news, with modulation by persona traits and model differences, on a PolitiFact-based multimodal dataset.
HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety cs.CL · 2026-07-02 · unverdicted · none · ref 11 · internal anchor
HaloGuard 1.0-0.8B achieves the highest average F1 of 90.9 across seven prompt-safety benchmarks among evaluated open guard models while keeping FPR at 4.3 and FNR at 9.5, with a 4B variant reaching 92.1 F1.
Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety cs.CR · 2026-07-01 · unverdicted · none · ref 20 · internal anchor
Cognitive Firewall applies four gates (intent, zero-trust context, consistency, output risk) via an oversight model to cut jailbreak success to 2% or below on most tested sets while keeping over-refusal at 8%.
Verifying Restrictions on Frontier AI Research cs.CY · 2026-06-27 · unverdicted · none · ref 30 · internal anchor
Catalogs 28 candidate verification mechanisms for restrictions on AI research and identifies key factors affecting their feasibility.
Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance cs.LG · 2026-06-14 · unverdicted · none · ref 39 · internal anchor
GCD uses diffusion model priors to guide suffix search, achieving higher attack success rates with better semantic adherence and lower detection than GCG-style methods.
Stop Early, Spend Less: Hidden-State Probes as a Practical Recipe for Streaming Moderation of LLM Outputs cs.LG · 2026-06-09 · unverdicted · none · ref 17 · internal anchor
Hidden-state probes enable low-overhead streaming moderation of LLM outputs by producing per-token safety scores from internal activations.
Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals cs.CL · 2026-05-26 · unverdicted · none · ref 33 · internal anchor
Prompt injection detection performance is highly regime-dependent with no single detector dominating across settings; transformer models perform best overall while structural signals offer modest gains in some regimes.
Do Linear Probes Generalize Better in Persona Coordinates? cs.AI · 2026-05-10 · unverdicted · none · ref 19 · 2 links · internal anchor
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation cs.CL · 2026-05-22 · unverdicted · none · ref 3 · internal anchor
An AI workflow creates detailed constitutions for three content-moderation categories and uses LLMs to label inputs, cutting cross-model inconsistency by up to 57x versus short paragraph definitions while introducing a dual-axis intent/content scoring scheme.
A Systematic Investigation of RL-Jailbreaking in LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 14 · 2 links · internal anchor
Systematic investigation reveals that dense rewards and extended episode lengths primarily drive the success of RL jailbreaking in LLMs.
Online Safety Monitoring for LLMs cs.AI · 2026-07-02 · unverdicted · none · ref 26 · internal anchor
Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer