hub

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, Mahesh Pasupuleti · 2024 · arXiv 2411.10414

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.

DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

cs.CR · 2025-12-12 · unverdicted · novelty 6.0

RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.

Peering Behind the Shield: Guardrail Identification in Large Language Models

cs.CR · 2025-02-03 · unverdicted · novelty 6.0

AP-Test identifies deployed guardrails in LLMs via adversarial prompt testing and a match score metric, reporting perfect accuracy on four open-source guardrails.

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

cs.CV · 2026-05-17 · unverdicted · novelty 5.0

SafeLens presents a fast-and-slow video guardrail framework that filters the SafeWatch dataset to 2.4% and adds Chain-of-Thought traces to achieve state-of-the-art moderation performance at reduced inference cost.

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

cs.AI · 2026-05-06 · unverdicted · novelty 5.0

AgentTrust introduces a runtime interception system for AI agent tool use that achieves 95-97% verdict accuracy on 930 safety scenarios including obfuscated shell payloads.

Jailbreaking Large Language Models with Morality Attacks

cs.CL · 2026-04-18 · unverdicted · novelty 5.0

Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

cs.RO · 2025-03-05 · unverdicted · novelty 5.0

SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.

Human-Guided Harm Recovery for Computer Use Agents

cs.AI · 2026-04-20

citing papers explorer

Showing 10 of 10 citing papers.

RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation cs.CV · 2026-03-28 · unverdicted · none · ref 52
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs cs.CR · 2026-05-18 · unverdicted · none · ref 42
DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 14
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring cs.CR · 2025-12-12 · unverdicted · none · ref 5
RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.
Peering Behind the Shield: Guardrail Identification in Large Language Models cs.CR · 2025-02-03 · unverdicted · none · ref 12
AP-Test identifies deployed guardrails in LLMs via adversarial prompt testing and a match score metric, reporting perfect accuracy on four open-source guardrails.
SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening cs.CV · 2026-05-17 · unverdicted · none · ref 20
SafeLens presents a fast-and-slow video guardrail framework that filters the SafeWatch dataset to 2.4% and adds Chain-of-Thought traces to achieve state-of-the-art moderation performance at reduced inference cost.
AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use cs.AI · 2026-05-06 · unverdicted · none · ref 6
AgentTrust introduces a runtime interception system for AI agent tool use that achieves 95-97% verdict accuracy on 930 safety scenarios including obfuscated shell payloads.
Jailbreaking Large Language Models with Morality Attacks cs.CL · 2026-04-18 · unverdicted · none · ref 1
Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning cs.RO · 2025-03-05 · unverdicted · none · ref 17
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
Human-Guided Harm Recovery for Computer Use Agents cs.AI · 2026-04-20 · unreviewed · ref 27

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer