Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Brian Fuller; Davide Testuggine; Hakan Inan; Jianfeng Chi; Kartikeya Upasani; Krithika Iyer; Madian Khabsa; Michael Tontchev; Qing Hu; Rashi Rungta

arxiv: 2312.06674 · v1 · submitted 2023-12-07 · 💻 cs.CL · cs.AI

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan , Kartikeya Upasani , Jianfeng Chi , Rashi Rungta , Krithika Iyer , Yuning Mao , Michael Tontchev , Qing Hu

show 3 more authors

Brian Fuller Davide Testuggine Madian Khabsa

This is my paper

Pith reviewed 2026-05-10 18:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Llama GuardLLM safeguardsafety risk taxonomyprompt classificationresponse classificationcontent moderationinstruction tuningAI safety

0 comments

The pith

Llama Guard is a Llama2-7b model instruction-tuned on a safety-risk dataset that classifies risks in both user prompts and generated responses at levels matching or exceeding existing moderation tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Llama Guard as an LLM-based safeguard tailored to human-AI conversations. It defines a safety risk taxonomy that labels risks appearing in prompts and then classifies the responses those prompts produce. A compact high-quality dataset built around this taxonomy is used to instruction-tune a Llama2-7b model. On established benchmarks the tuned model performs at or above current content-moderation systems while also supporting task customization and flexible output formats. The work releases the model weights to let others adapt the approach for changing safety requirements.

Core claim

Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats, enabling the adjustment of taxonomy categories to align with specific use cases and facilitating zero-shot or few-shot prompt

What carries the argument

The safety risk taxonomy that categorizes risks in LLM prompts for prompt classification and in generated responses for response classification, paired with instruction-tuning of Llama2-7b on the collected dataset to produce multi-class labels and binary safety decisions.

Load-bearing premise

The high-quality dataset collected around the safety risk taxonomy is representative of real-world risks in human-AI conversations and benchmark performance will translate to effective safeguarding in deployed systems.

What would settle it

Running Llama Guard on a large set of live human-AI conversations that contain known harmful outcomes and measuring whether its classifications align with independent human judgments of those same interactions.

read the original abstract

We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Llama Guard releases an open 7B safety classifier for LLM conversations that matches public benchmarks via instruction tuning on a custom taxonomy, but the writeup is light on numbers and dataset details.

read the letter

The core contribution is an open weights release of a Llama 2 7B model fine-tuned to classify both prompts and responses for safety risks using a custom taxonomy. The model supports multi-class decisions plus binary scores, and the instruction tuning lets users swap taxonomies or output formats without retraining. They report it matches or exceeds existing moderation tools on the OpenAI Moderation Evaluation set and ToxicChat after training on their collected data, which they note is low-volume but high-quality. The open release and customization angle are the practical wins here; anyone running conversational systems can drop this in as a filter without depending on closed APIs, and the joint input-output framing fits real chat flows better than prompt-only or response-only tools. The taxonomy itself is a reusable piece that others can build on or adapt. The soft spots are straightforward. The abstract supplies no concrete metrics, no dataset size or construction details beyond the volume note, no error analysis, and no ablation on how the taxonomy was derived or how labels were aligned. That makes the performance claim hard to stress-test from the text alone, and it leaves open whether the benchmarks capture the distribution of risks that actually show up in deployed human-AI conversations. The generalization step from benchmark scores to production safeguarding is assumed rather than demonstrated. This paper is aimed at practitioners who need an open, adaptable safety layer for chat models and at researchers who want a starting point for further tuning or taxonomy work. A reader focused on responsible deployment or open-source moderation tooling will get immediate value from the weights and the taxonomy description. It deserves a serious referee because the model is released, the claims are tied to public benchmarks, and the gaps are fixable with added numbers and methods rather than fundamental flaws in the approach. Send it to review; the open release makes engagement worthwhile even if the current draft needs more evidence to stand on its own.

Referee Report

2 major / 1 minor

Summary. The paper introduces Llama Guard, a Llama2-7B model instruction-tuned on a collected high-quality (but low-volume) dataset for prompt and response classification in human-AI conversations. It defines a safety risk taxonomy to categorize risks and uses this for both input classification and output moderation. The central claim is that the resulting model matches or exceeds existing content moderation tools on the OpenAI Moderation Evaluation dataset and ToxicChat benchmark; the model supports multi-class classification with binary decision scores, allows taxonomy customization via instruction tuning, and the weights are released publicly.

Significance. If the benchmark results hold under detailed scrutiny, the work supplies a practical, open-weight, instruction-tunable safeguard that can be adapted to new taxonomies or use cases. The public release of weights is a concrete contribution to reproducibility and community experimentation in conversational AI safety.

major comments (2)

[Abstract] Abstract: the assertion that Llama Guard 'demonstrates strong performance' and 'matches or exceeds' existing tools on the OpenAI Moderation Evaluation dataset and ToxicChat is stated without any quantitative metrics (accuracy, F1, precision/recall, or comparison tables), dataset statistics, or error analysis. This information is load-bearing for the central empirical claim.
[Dataset and Experiments sections] Dataset construction and experimental sections: no details are supplied on the size of the collected dataset, label distribution, how the safety risk taxonomy was operationalized into labels, training hyperparameters, or exact benchmark numbers. These omissions prevent verification that the low-volume dataset supports the reported generalization.

minor comments (1)

[Abstract] The description of 'multi-class classification and generating binary decision scores' would benefit from an explicit example of the output format (e.g., a sample prompt and expected response).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve transparency and verifiability of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that Llama Guard 'demonstrates strong performance' and 'matches or exceeds' existing tools on the OpenAI Moderation Evaluation dataset and ToxicChat is stated without any quantitative metrics (accuracy, F1, precision/recall, or comparison tables), dataset statistics, or error analysis. This information is load-bearing for the central empirical claim.

Authors: We agree that the abstract would benefit from explicit quantitative support for the performance claims. In the revised manuscript, we will add key metrics (e.g., accuracy and F1 scores) along with direct comparisons to existing moderation tools on both the OpenAI Moderation Evaluation dataset and ToxicChat. A brief reference to the error analysis and dataset statistics already detailed in the Experiments section will also be included in the abstract. revision: yes
Referee: [Dataset and Experiments sections] Dataset construction and experimental sections: no details are supplied on the size of the collected dataset, label distribution, how the safety risk taxonomy was operationalized into labels, training hyperparameters, or exact benchmark numbers. These omissions prevent verification that the low-volume dataset supports the reported generalization.

Authors: We acknowledge the need for greater specificity in these sections to enable verification. The revised manuscript will expand the Dataset section to report the exact dataset size, label distributions, and the mapping from the safety risk taxonomy to classification labels. The Experiments section will include training hyperparameters and precise benchmark numbers with comparison tables. These additions will clarify how the high-quality, low-volume dataset supports the observed generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an empirically trained safeguard model (Llama Guard) via instruction-tuning on a custom safety dataset, then reports performance on independent external benchmarks (OpenAI Moderation Evaluation dataset and ToxicChat). No mathematical derivations, equations, or first-principles predictions exist in the text; the central claims rest on standard train-then-evaluate results against public test sets rather than any self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the representativeness of the collected dataset and the assumption that instruction tuning on a modest amount of data yields reliable safety classification. No explicit free parameters beyond standard model training are mentioned.

axioms (1)

domain assumption Instruction tuning of LLMs on a modest dataset can produce effective multi-class safety classifiers.
Invoked to support performance claims despite low data volume.

invented entities (1)

Safety risk taxonomy no independent evidence
purpose: Categorize risks in prompts and responses for classification.
New taxonomy introduced by the authors for this task.

pith-pipeline@v0.9.0 · 5587 in / 1267 out tokens · 87669 ms · 2026-05-10T18:54:29.249408+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Who Owns This Agent? Tracing AI Agents Back to Their Owners
cs.CR 2026-05 unverdicted novelty 8.0

A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
cs.CR 2026-05 conditional novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
cs.CR 2026-04 unverdicted novelty 8.0 full

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
Towards Cross-lingual Values Judgment: A Consensus-Pluralism Perspective
cs.CL 2026-02 unverdicted novelty 8.0

X-Value is the first cross-lingual values judgment benchmark that reveals limitations and performance gaps in LLMs across languages and issue categories.
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
cs.AI 2026-05 unverdicted novelty 7.0

Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues
cs.SD 2026-05 unverdicted novelty 7.0

ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
cs.CR 2026-05 unverdicted novelty 7.0

Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
cs.CR 2026-05 unverdicted novelty 7.0

FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
cs.AI 2026-05 unverdicted novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
cs.CR 2026-05 conditional novelty 7.0

A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation
cs.LG 2026-05 unverdicted novelty 7.0

PropGuard is a propagation-aware framework for LLM-MAS that constructs dual-view spatio-temporal graphs, employs a GE-GRPO inspector to recover suspicious subgraphs, and applies source-guided remediation to lower atta...
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
cs.CR 2026-05 unverdicted novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
Self-Mined Hardness for Safety Fine-Tuning
cs.LG 2026-05 unverdicted novelty 7.0

Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially ...
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 7.0

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Jailbroken Frontier Models Retain Their Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Latent Space Probing for Adult Content Detection in Video Generative Models
cs.CV 2026-04 unverdicted novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
cs.LG 2026-04 unverdicted novelty 7.0

SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning perform...
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
cs.CR 2026-04 unverdicted novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking
cs.CR 2026-04 unverdicted novelty 7.0

HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.
Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives
cs.CR 2026-04 unverdicted novelty 7.0

Governed MCP implements kernel-level governance for MCP tool calls in AI agents through a 6-layer pipeline including ProbeLogits semantic verification, with an ablation showing F1 drop from 0.773 to 0.327 without it a...
Conjunctive Prompt Attacks in Multi-Agent LLM Systems
cs.MA 2026-04 unverdicted novelty 7.0

Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
LogAct: Enabling Agentic Reliability via Shared Logs
cs.DC 2026-04 unverdicted novelty 7.0

LogAct is a shared-log abstraction for LLM agents that makes actions visible before execution, allows decoupled stopping, enables consistent recovery, and supports LLM-driven introspection for reliability.
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
cs.CR 2026-04 unverdicted novelty 7.0

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...
Formal Policy Enforcement for Real-World Agentic Systems
cs.CR 2026-02 unverdicted novelty 7.0

FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?
cs.CR 2026-02 accept novelty 7.0

AgentDyn benchmark demonstrates that current AI agent defenses against prompt injection fail to handle dynamic real-world conditions.
When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models
cs.CR 2025-10 unverdicted novelty 7.0

CREST-Search is a red-teaming framework that crafts seemingly benign search queries to induce unsafe citations from web-augmented LLMs, backed by a new WebSearch-Harm dataset for fine-tuning a specialized attacker model.
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
cs.LG 2025-08 unverdicted novelty 7.0

TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers
cs.CR 2026-05 unverdicted novelty 6.0

Introduces Prompt Overflow Attack that fragments malicious instructions in overlength prompts to evade guardrail segmentation while remaining actionable to LLMs with larger context windows.
Boundary-targeted Membership Inference Attacks on Safety Classifiers
cs.LG 2026-05 unverdicted novelty 6.0

A boundary-targeted MIA on safety classifiers recovers 19% of distress-flagged conversations at 5% false-positive rate, 3.5 times higher than standard MIA baselines.
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
cs.AI 2026-05 unverdicted novelty 6.0

MOOD benchmark shows guard models fail to generalize to OOD alignment failures in LLMs, but combining them with Mahalanobis and perplexity OOD detectors improves recall from 39% to 45% with better scaling than larger ...
LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injectio
cs.CR 2026-05 unverdicted novelty 6.0

LivePI benchmark shows indirect prompt injection attack success rates of 10.7% to 29.6% across five AI models in live test environments covering seven input surfaces and multiple malicious goals.
Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry
cs.LG 2026-05 unverdicted novelty 6.0

Geometry-Lite decomposes LLM safety detection into layer-wise margin geometries and finds that persistent boundary positions, not layer-to-layer drift, drive most detection performance across nine models and seven benchmarks.
VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 6.0

VerifyMAS improves failure attribution in LLM multi-agent systems via hypothesis verification on full trajectories, error taxonomy-based data construction, and fine-tuned verifier models, outperforming prior direct-pr...
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
cs.CR 2026-05 conditional novelty 6.0

LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
SLIP & ETHICS: Graduated Intervention for AI Emotional Companions
cs.HC 2026-05 conditional novelty 6.0

SLIP and ETHICS introduce a staged intervention system for AI emotional companions using qualitative affect and narrative signals, with small-scale deployment and synthetic tests showing zero false positives for norma...
The Great Pretender: A Stochasticity Problem in LLM Jailbreak
cs.CR 2026-05 conditional novelty 6.0

ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
cs.LG 2026-05 conditional novelty 6.0

On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data
cs.CR 2026-05 conditional novelty 6.0

Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
cs.CV 2026-05 unverdicted novelty 6.0

UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
cs.CR 2026-05 unverdicted novelty 6.0

DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
cs.CR 2026-05 unverdicted novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
cs.CR 2026-05 unverdicted novelty 6.0

A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
cs.AI 2026-05 unverdicted novelty 6.0

Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
Internalizing Safety Understanding in Large Reasoning Models via Verification
cs.AI 2026-05 unverdicted novelty 6.0

Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment tha...
GLiGuard: Schema-Conditioned Classification for LLM Safeguard
cs.CL 2026-05 unverdicted novelty 6.0

GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
cs.CL 2026-05 unverdicted novelty 6.0

TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
cs.CL 2026-05 unverdicted novelty 6.0

TurnGate uses a new multi-turn intent dataset to detect the harm-enabling closure point in dialogues, outperforming baselines with low over-refusal and generalizing across domains.
Understanding Annotator Safety Policy with Interpretability
cs.AI 2026-05 unverdicted novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
cs.CR 2026-05 unverdicted novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
cs.CR 2026-05 unverdicted novelty 6.0

Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
cs.CR 2026-05 accept novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 135 Pith papers · 4 internal anchors

[1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403 ,

work page internal anchor Pith review arXiv
[2]

SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter

Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation , pages 54– 63, Minneapolis, Minne...

work page 2019
[3]

doi: 10.18653/v1/S19-2007

Association for Computational Linguistics. doi: 10.18653/v1/S19-2007. https://www.aclweb.org/anthology/S19-2007. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing sy...

work page doi:10.18653/v1/s19-2007 2007
[4]

doi: 10.18653/v1/W18-0802

Association for Computational Linguistics. doi: 10.18653/v1/W18-0802. https://aclanthology.org/W18-0802. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models,

work page doi:10.18653/v1/w18-0802
[5]

doi: 10.18653/v1/W18-5102

Association for Computational Linguistics. doi: 10.18653/v1/W18-5102. https: //www.aclweb.org/anthology/W18-5102. Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it break it fix it for dialogue safety: Robustness from adversarial human attack,

work page doi:10.18653/v1/w18-5102
[6]

doi: 10.18653/v1/2021.acl-long.210

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.210. https://aclanthology.org/2021.acl-long.210. Alon Halevy, Cristian Canton-Ferrer, Hao Ma, Umut Ozertem, Patrick Pantel, Marzieh Saeidi, Fabrizio Silvestri, and Ves Stoyanov. Preserving integrity in online social networks. Communications of the ACM , 65(2):92–98,

work page doi:10.18653/v1/2021.acl-long.210 2021
[7]

Exploring social bias in chatbots using stereotype knowledge

Nayeon Lee, Andrea Madotto, and Pascale Fung. Exploring social bias in chatbots using stereotype knowledge. In Amittai Axelrod, Diyi Yang, Rossana Cunha, Samira Shaikh, and Zeerak Waseem, editors, Proceedings of the 2019 Workshop on Widening NLP , pages 177–180, Florence, Italy, August

work page 2019
[8]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761,

work page internal anchor Pith review arXiv
[9]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 ,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Adva...

work page internal anchor Pith review arXiv
[11]

SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval)

Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In Jonathan May, Ekaterina Shutova, Aurelie Herbelot, Xiaodan Zhu, Marianna Apidianaki, and Saif M. Mohammad, editors, Proceedings of the 13th International Works...

work page 2019
[12]

Zampieri, S

Association for Computational Linguistics. doi: 10.18653/v1/S19-2010. https://aclanthology.org/S19-2010. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment,

work page doi:10.18653/v1/s19-2010 2010

[1] [1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403 ,

work page internal anchor Pith review arXiv

[2] [2]

SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter

Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation , pages 54– 63, Minneapolis, Minne...

work page 2019

[3] [3]

doi: 10.18653/v1/S19-2007

Association for Computational Linguistics. doi: 10.18653/v1/S19-2007. https://www.aclweb.org/anthology/S19-2007. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing sy...

work page doi:10.18653/v1/s19-2007 2007

[4] [4]

doi: 10.18653/v1/W18-0802

Association for Computational Linguistics. doi: 10.18653/v1/W18-0802. https://aclanthology.org/W18-0802. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models,

work page doi:10.18653/v1/w18-0802

[5] [5]

doi: 10.18653/v1/W18-5102

Association for Computational Linguistics. doi: 10.18653/v1/W18-5102. https: //www.aclweb.org/anthology/W18-5102. Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it break it fix it for dialogue safety: Robustness from adversarial human attack,

work page doi:10.18653/v1/w18-5102

[6] [6]

doi: 10.18653/v1/2021.acl-long.210

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.210. https://aclanthology.org/2021.acl-long.210. Alon Halevy, Cristian Canton-Ferrer, Hao Ma, Umut Ozertem, Patrick Pantel, Marzieh Saeidi, Fabrizio Silvestri, and Ves Stoyanov. Preserving integrity in online social networks. Communications of the ACM , 65(2):92–98,

work page doi:10.18653/v1/2021.acl-long.210 2021

[7] [7]

Exploring social bias in chatbots using stereotype knowledge

Nayeon Lee, Andrea Madotto, and Pascale Fung. Exploring social bias in chatbots using stereotype knowledge. In Amittai Axelrod, Diyi Yang, Rossana Cunha, Samira Shaikh, and Zeerak Waseem, editors, Proceedings of the 2019 Workshop on Widening NLP , pages 177–180, Florence, Italy, August

work page 2019

[8] [8]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761,

work page internal anchor Pith review arXiv

[9] [9]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 ,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Adva...

work page internal anchor Pith review arXiv

[11] [11]

SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval)

Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In Jonathan May, Ekaterina Shutova, Aurelie Herbelot, Xiaodan Zhu, Marianna Apidianaki, and Saif M. Mohammad, editors, Proceedings of the 13th International Works...

work page 2019

[12] [12]

Zampieri, S

Association for Computational Linguistics. doi: 10.18653/v1/S19-2010. https://aclanthology.org/S19-2010. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment,

work page doi:10.18653/v1/s19-2010 2010