hub

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

· 2024 · arXiv 2411.10414

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

MIRAGE: Protecting against Malicious Image Editing via False Moderation

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

cs.CR · 2026-06-08 · unverdicted · novelty 6.0

HPAA uses typographic manipulations to create text that humans flag as harmful at 86%+ rates while LLM moderation systems detect it below 1% with only three queries.

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

SentGuard achieves 90.5% detection of unsafe cases within two sentences at 7.41% false positive rate by operating at sentence boundaries during LLM streaming generation.

Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition

cs.CR · 2026-06-01 · unverdicted · novelty 6.0

DSR decomposes harmful intents into benign textual and visual primitives that MLLMs fuse into harmful outputs, achieving high attack success with low input toxicity.

DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

cs.CR · 2025-12-12 · unverdicted · novelty 6.0

RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.

Peering Behind the Shield: Guardrail Identification in Large Language Models

cs.CR · 2025-02-03 · unverdicted · novelty 6.0

AP-Test identifies deployed guardrails in LLMs via adversarial prompt testing and a match score metric, reporting perfect accuracy on four open-source guardrails.

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

cs.CR · 2026-06-26 · unverdicted · novelty 5.0

Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads active in mid-layers, producing robust harmful features usable for competitive training-free detection.

Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety

cs.CV · 2026-06-23 · unverdicted · novelty 5.0

Yuvion VL is a multimodal LLM family using adversarial-aware data construction, three-stage training, and contrastive fine-tuning that claims industry-leading safety performance on new benchmarks while retaining general capabilities.

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

cs.CV · 2026-05-17 · unverdicted · novelty 5.0

SafeLens presents a fast-and-slow video guardrail framework that filters the SafeWatch dataset to 2.4% and adds Chain-of-Thought traces to achieve state-of-the-art moderation performance at reduced inference cost.

WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections

cs.CR · 2026-05-14 · unverdicted · novelty 5.0

WARD is a guard model trained on 177K web samples and adversarially hardened via attacker-guard co-evolution to achieve high recall on prompt injections with low false positives and no added latency.

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

cs.AI · 2026-05-06 · unverdicted · novelty 5.0

AgentTrust introduces a runtime interception system for AI agent tool use that achieves 95-97% verdict accuracy on 930 safety scenarios including obfuscated shell payloads.

Jailbreaking Large Language Models with Morality Attacks

cs.CL · 2026-04-18 · unverdicted · novelty 5.0

Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

cs.RO · 2025-03-05 · unverdicted · novelty 5.0

SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.

Human-Guided Harm Recovery for Computer Use Agents

cs.AI · 2026-04-20

citing papers explorer

Showing 3 of 3 citing papers after filters.

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring cs.CR · 2025-12-12 · unverdicted · none · ref 5
RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.
Peering Behind the Shield: Guardrail Identification in Large Language Models cs.CR · 2025-02-03 · unverdicted · none · ref 12
AP-Test identifies deployed guardrails in LLMs via adversarial prompt testing and a match score metric, reporting perfect accuracy on four open-source guardrails.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning cs.RO · 2025-03-05 · unverdicted · none · ref 17
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer