Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

· 2026 · cs.CR · arXiv 2605.24834

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable to adversarial jailbreak attacks that disguise malicious intent through role-play scenarios, fictional framing, and indirect requests. We present Reflect-Guard, a method that augments LLM-based safety classifiers with chain-of-thought self-reflection capabilities through parameter-efficient fine-tuning. Our approach distills analytical reasoning from GPT-4o-mini into structured reflection annotations, then trains Llama-Guard-3-8B via QLoRA to generate logical self-reflections before issuing safety verdicts. Using only 1000 training examples and updating just 0.5% of model parameters (~42M), Reflect-Guard achieves substantial improvements on two challenging benchmarks. On WildGuardTest, F1 score improves from 0.770 to 0.842 (+7.2 pp), with recall on adversarial prompts increasing from 0.513 to 0.921 (+40.8 pp). On JailbreakBench, the attack success rate drops from 10.3% to 1.8%, representing an 82.5% relative reduction. These gains are especially pronounced on adversarial inputs, where the explicit reasoning step enables the model to see through obfuscation techniques that defeat standard pattern-matching approaches. Our results demonstrate that teaching safety classifiers to reason about adversarial intent, rather than simply classify surface patterns, is a promising direction for robust LLM safety.

representative citing papers

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

FORCEBENCH shows model judges often violate expected ordering on evidence-calibrated vs force-raised claim pairs, with standard support prompting yielding 47.2% MVR and explicit warrant prompting reducing it to 24.5%.

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

cs.LG · 2026-06-13 · unverdicted · novelty 6.0

GPT-4o and Claude Sonnet 4 show similar susceptibility to bias on GSM8K (1.3% vs 1.2%) but differ sharply in acknowledgment rates (13% vs 75%) under a rubric-defined metric.

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

cs.SE · 2026-05-30 · unverdicted · novelty 6.0

About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

MedSynapse-V proposes meta-query prior memorization, causal counterfactual refinement via RL, and dual-branch memory transition to evolve implicit diagnostic memories in medical VLMs and boost accuracy over chain-of-thought baselines.

MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment

cs.LG · 2026-06-27 · unverdicted · novelty 5.0

MOSAIC combines frozen-LLM semantic embeddings with hierarchical consistency objectives to report up to 3.4% AUC gains on knowledge-tracing benchmarks including a new MOOC dataset.

DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation

cs.CL · 2026-06-27 · unverdicted · novelty 5.0

DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.

citing papers explorer

Showing 6 of 6 citing papers after filters.

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG cs.AI · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
FORCEBENCH shows model judges often violate expected ordering on evidence-calibrated vs force-raised claim pairs, with standard support prompting yielding 47.2% MVR and explicit warrant prompting reducing it to 24.5%.
Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation cs.LG · 2026-06-13 · unverdicted · none · ref 9 · internal anchor
GPT-4o and Claude Sonnet 4 show similar susceptibility to bias on GSM8K (1.3% vs 1.2%) but differ sharply in acknowledgment rates (13% vs 75%) under a rubric-defined metric.
When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems cs.SE · 2026-05-30 · unverdicted · none · ref 33 · internal anchor
About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution cs.CV · 2026-04-29 · unverdicted · none · ref 70 · internal anchor
MedSynapse-V proposes meta-query prior memorization, causal counterfactual refinement via RL, and dual-branch memory transition to evolve implicit diagnostic memories in medical VLMs and boost accuracy over chain-of-thought baselines.
MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment cs.LG · 2026-06-27 · unverdicted · none · ref 44 · internal anchor
MOSAIC combines frozen-LLM semantic embeddings with hierarchical consistency objectives to report up to 3.4% AUC gains on knowledge-tracing benchmarks including a new MOOC dataset.
DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation cs.CL · 2026-06-27 · unverdicted · none · ref 3 · internal anchor
DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

fields

years

verdicts

representative citing papers

citing papers explorer