FORCEBENCH shows model judges often violate expected ordering on evidence-calibrated vs force-raised claim pairs, with standard support prompting yielding 47.2% MVR and explicit warrant prompting reducing it to 24.5%.
Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection
6 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable to adversarial jailbreak attacks that disguise malicious intent through role-play scenarios, fictional framing, and indirect requests. We present Reflect-Guard, a method that augments LLM-based safety classifiers with chain-of-thought self-reflection capabilities through parameter-efficient fine-tuning. Our approach distills analytical reasoning from GPT-4o-mini into structured reflection annotations, then trains Llama-Guard-3-8B via QLoRA to generate logical self-reflections before issuing safety verdicts. Using only 1000 training examples and updating just 0.5% of model parameters (~42M), Reflect-Guard achieves substantial improvements on two challenging benchmarks. On WildGuardTest, F1 score improves from 0.770 to 0.842 (+7.2 pp), with recall on adversarial prompts increasing from 0.513 to 0.921 (+40.8 pp). On JailbreakBench, the attack success rate drops from 10.3% to 1.8%, representing an 82.5% relative reduction. These gains are especially pronounced on adversarial inputs, where the explicit reasoning step enables the model to see through obfuscation techniques that defeat standard pattern-matching approaches. Our results demonstrate that teaching safety classifiers to reason about adversarial intent, rather than simply classify surface patterns, is a promising direction for robust LLM safety.
years
2026 6verdicts
UNVERDICTED 6representative citing papers
GPT-4o and Claude Sonnet 4 show similar susceptibility to bias on GSM8K (1.3% vs 1.2%) but differ sharply in acknowledgment rates (13% vs 75%) under a rubric-defined metric.
About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.
MedSynapse-V proposes meta-query prior memorization, causal counterfactual refinement via RL, and dual-branch memory transition to evolve implicit diagnostic memories in medical VLMs and boost accuracy over chain-of-thought baselines.
MOSAIC combines frozen-LLM semantic embeddings with hierarchical consistency objectives to report up to 3.4% AUC gains on knowledge-tracing benchmarks including a new MOOC dataset.
DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.
citing papers explorer
-
Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG
FORCEBENCH shows model judges often violate expected ordering on evidence-calibrated vs force-raised claim pairs, with standard support prompting yielding 47.2% MVR and explicit warrant prompting reducing it to 24.5%.
-
Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation
GPT-4o and Claude Sonnet 4 show similar susceptibility to bias on GSM8K (1.3% vs 1.2%) but differ sharply in acknowledgment rates (13% vs 75%) under a rubric-defined metric.
-
When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems
About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.
-
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
MedSynapse-V proposes meta-query prior memorization, causal counterfactual refinement via RL, and dual-branch memory transition to evolve implicit diagnostic memories in medical VLMs and boost accuracy over chain-of-thought baselines.
-
MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment
MOSAIC combines frozen-LLM semantic embeddings with hierarchical consistency objectives to report up to 3.4% AUC gains on knowledge-tracing benchmarks including a new MOOC dataset.
-
DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation
DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.