hub

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert · 2024 · cs.CL · arXiv 2406.18495

24 Pith papers cite this work. Polarity classification is still indexing.

24 Pith papers citing it

open full Pith review browse 24 citing papers arXiv PDF

abstract

We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1 dataset 1

citation-polarity summary

background 1 baseline 1 use dataset 1

representative citing papers

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

cs.CR · 2026-05-19 · accept · novelty 7.0

Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.

Self-Mined Hardness for Safety Fine-Tuning

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially mitigates.

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.

Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives

cs.CR · 2026-04-18 · unverdicted · novelty 7.0

Governed MCP implements kernel-level governance for MCP tool calls in AI agents through a 6-layer pipeline including ProbeLogits semantic verification, with an ablation showing F1 drop from 0.773 to 0.327 without it and structural prevention of userspace bypasses.

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.

LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

cs.CR · 2026-05-17 · conditional · novelty 6.0

LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.

Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

cs.CR · 2026-05-13 · unverdicted · novelty 6.0

Survival analysis applied to repeated jailbreak attacks on three LLMs shows one model degrades rapidly while the others maintain moderate vulnerability on HarmBench prompts.

Bayesian Model Merging

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines and nearly matching expert averages on up to 20-task vision and 5-task language Merg

Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data

cs.CR · 2026-05-11 · conditional · novelty 6.0

Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.

How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.

Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious

cs.CL · 2026-04-15 · accept · novelty 6.0

42% of significant turn-level associations in LLM conversation analysis are spurious due to unaccounted autocorrelation, with a validated two-stage correction framework improving replication.

ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems

cs.OS · 2026-04-13 · unverdicted · novelty 6.0

ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.

Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

cs.CL · 2026-04-10 · unverdicted · novelty 6.0

LLMs display systematic, architecture-dependent gaps between their self-stated safety policies and observed behavior on harmful prompts, with absolute refusal claims frequently violated.

IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 percentage point drop in safety-critical action hit rates.

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

BarrierSteer: LLM Safety via Learning Barrier Steering

cs.LG · 2026-02-23 · unverdicted · novelty 6.0

BarrierSteer applies control barrier functions to LLM latent states for constraint-guided steering that reduces unsafe generations while preserving utility.

Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

cs.LG · 2025-05-30 · unverdicted · novelty 6.0

Disentangled Safety Adapters decouple safety computations from task-optimized LLMs via lightweight adapters, yielding up to 53% better AUC on safety tasks and dynamic inference-time alignment with reduced performance trade-offs.

Peering Behind the Shield: Guardrail Identification in Large Language Models

cs.CR · 2025-02-03 · unverdicted · novelty 6.0

AP-Test identifies deployed guardrails in LLMs via adversarial prompt testing and a match score metric, reporting perfect accuracy on four open-source guardrails.

Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

cs.LG · 2026-04-08 · unverdicted · novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.

GLiNER Guard: Unified Encoder Family for Production LLM Safety and Privacy

cs.CR · 2026-05-06 · unverdicted · novelty 4.0

GLiNER Guard provides unified encoder variants for LLM safety and PII detection in a single pass, with high throughput on A100 hardware and a new PII-Bench benchmark.

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

cs.AI · 2024-10-24 · unverdicted · novelty 4.0

Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.

ShieldGemma: Generative AI Content Moderation Based on Gemma

cs.CL · 2024-07-31 · unverdicted · novelty 4.0

ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.

citing papers explorer

Showing 24 of 24 citing papers.

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025) cs.CR · 2026-05-19 · accept · none · ref 34 · internal anchor
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.
Self-Mined Hardness for Safety Fine-Tuning cs.LG · 2026-05-04 · unverdicted · none · ref 7 · internal anchor
Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially mitigates.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming cs.CL · 2026-04-21 · unverdicted · none · ref 51 · internal anchor
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives cs.CR · 2026-04-18 · unverdicted · none · ref 12 · internal anchor
Governed MCP implements kernel-level governance for MCP tool calls in AI agents through a 6-layer pipeline including ProbeLogits semantic verification, with an ablation showing F1 drop from 0.773 to 0.327 without it and structural prevention of userspace bypasses.
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics cs.CL · 2026-05-18 · unverdicted · none · ref 22 · internal anchor
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails cs.CR · 2026-05-17 · conditional · none · ref 6 · internal anchor
LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis cs.CR · 2026-05-13 · unverdicted · none · ref 5 · internal anchor
Survival analysis applied to repeated jailbreak attacks on three LLMs shows one model degrades rapidly while the others maintain moderate vulnerability on HarmBench prompts.
Bayesian Model Merging cs.LG · 2026-05-13 · unverdicted · none · ref 58 · internal anchor
Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines and nearly matching expert averages on up to 20-task vision and 5-task language Merg
Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data cs.CR · 2026-05-11 · conditional · none · ref 41 · internal anchor
Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
GLiGuard: Schema-Conditioned Classification for LLM Safeguard cs.CL · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework cs.CL · 2026-04-30 · unverdicted · none · ref 84 · internal anchor
LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts cs.LG · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious cs.CL · 2026-04-15 · accept · none · ref 12 · internal anchor
42% of significant turn-level associations in LLM conversation analysis are spurious due to unaccounted autocorrelation, with a validated two-stage correction framework improving replication.
ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems cs.OS · 2026-04-13 · unverdicted · none · ref 26 · internal anchor
ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.
Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies cs.CL · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
LLMs display systematic, architecture-dependent gaps between their self-stated safety policies and observed behavior on harmful prompts, with absolute refusal claims frequently violated.
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures cs.AI · 2026-04-09 · unverdicted · none · ref 17 · internal anchor
AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 percentage point drop in safety-critical action hit rates.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules cs.AI · 2026-04-03 · unverdicted · none · ref 12 · internal anchor
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
BarrierSteer: LLM Safety via Learning Barrier Steering cs.LG · 2026-02-23 · unverdicted · none · ref 10 · internal anchor
BarrierSteer applies control barrier functions to LLM latent states for constraint-guided steering that reduces unsafe generations while preserving utility.
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment cs.LG · 2025-05-30 · unverdicted · none · ref 11 · internal anchor
Disentangled Safety Adapters decouple safety computations from task-optimized LLMs via lightweight adapters, yielding up to 53% better AUC on safety tasks and dynamic inference-time alignment with reduced performance trade-offs.
Peering Behind the Shield: Guardrail Identification in Large Language Models cs.CR · 2025-02-03 · unverdicted · none · ref 19 · internal anchor
AP-Test identifies deployed guardrails in LLMs via adversarial prompt testing and a match score metric, reporting perfect accuracy on four open-source guardrails.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs cs.LG · 2026-04-08 · unverdicted · none · ref 22 · internal anchor
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
GLiNER Guard: Unified Encoder Family for Production LLM Safety and Privacy cs.CR · 2026-05-06 · unverdicted · none · ref 5 · internal anchor
GLiNER Guard provides unified encoder variants for LLM safety and PII detection in a single pass, with high throughput on A100 hardware and a new PII-Bench benchmark.
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs cs.AI · 2024-10-24 · unverdicted · none · ref 10 · internal anchor
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.
ShieldGemma: Generative AI Content Moderation Based on Gemma cs.CL · 2024-07-31 · unverdicted · none · ref 8 · internal anchor
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer