MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.
hub
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
HPAA uses typographic manipulations to create text that humans flag as harmful at 86%+ rates while LLM moderation systems detect it below 1% with only three queries.
SentGuard achieves 90.5% detection of unsafe cases within two sentences at 7.41% false positive rate by operating at sentence boundaries during LLM streaming generation.
DSR decomposes harmful intents into benign textual and visual primitives that MLLMs fuse into harmful outputs, achieving high attack success with low input toxicity.
DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.
AP-Test identifies deployed guardrails in LLMs via adversarial prompt testing and a match score metric, reporting perfect accuracy on four open-source guardrails.
Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads active in mid-layers, producing robust harmful features usable for competitive training-free detection.
Yuvion VL is a multimodal LLM family using adversarial-aware data construction, three-stage training, and contrastive fine-tuning that claims industry-leading safety performance on new benchmarks while retaining general capabilities.
SafeLens presents a fast-and-slow video guardrail framework that filters the SafeWatch dataset to 2.4% and adds Chain-of-Thought traces to achieve state-of-the-art moderation performance at reduced inference cost.
WARD is a guard model trained on 177K web samples and adversarially hardened via attacker-guard co-evolution to achieve high recall on prompt injections with low false positives and no added latency.
AgentTrust introduces a runtime interception system for AI agent tool use that achieves 95-97% verdict accuracy on 930 safety scenarios including obfuscated shell payloads.
Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
citing papers explorer
-
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.
-
Peering Behind the Shield: Guardrail Identification in Large Language Models
AP-Test identifies deployed guardrails in LLMs via adversarial prompt testing and a match score metric, reporting perfect accuracy on four open-source guardrails.
-
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.