Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

Ben Nassi; Eugene Bagdasaryan; Tsung-Yin Hsieh; Vitaly Shmatikov

arxiv: 2307.10490 · v4 · pith:UUIURUHMnew · submitted 2023-07-19 · 💻 cs.CR · cs.AI· cs.CL· cs.LG

Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

Eugene Bagdasaryan , Tsung-Yin Hsieh , Ben Nassi , Vitaly Shmatikov This is my paper

classification 💻 cs.CR cs.AIcs.CLcs.LG

keywords instructionattackeraudioimageimagesindirectinjectionllms

0 comments

read the original abstract

We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cross-Modal Backdoors in Multimodal Large Language Models
cs.CR 2026-05 unverdicted novelty 8.0

Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.
From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems
cs.CR 2026-04 unverdicted novelty 8.0

A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.
Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents
cs.CR 2026-04 unverdicted novelty 8.0

NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
HLL: Can Agents Cross Humanity's Last Line of Verification?
cs.AI 2026-06 unverdicted novelty 7.0

HLL is a new benchmark that evaluates eight frontier multimodal agents on closed-loop interactive CAPTCHA solving, showing sharp performance drops under realism stressors and trace validation.
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evidence-carrying multimodal agents decompose tool calls into predicates verified by constrained DOM/OCR/AX checkers to block hallucination-enabled unsafe actions.
Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents
cs.CR 2026-04 unverdicted novelty 7.0

Desktop GUI agents face TOCTOU attacks from UI state changes during the ~6.5s observation-to-action gap, with a three-layer pre-execution verification defense achieving 100% interception on two attack types but failin...
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
cs.CR 2026-04 unverdicted novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
The Self-Correction Illusion: LLMs Correct Others but Not Themselves
cs.AI 2026-06 conditional novelty 6.0

Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-templat...
Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models
cs.CL 2026-06 unverdicted novelty 6.0

Adversarial images transfer across languages in MLLMs while apparent safety in weaker languages stems from comprehension and visual-grounding failures rather than genuine alignment.
The Surface You Test Is Not the Surface That Breaks
cs.CR 2026-05 unverdicted novelty 6.0

Prompt injection vulnerability in tool-augmented LLMs is a model-surface interaction rather than a fixed channel property; the same payload inverts success rates across models, and adaptive attack rate exceeds single-...
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
cs.AI 2026-05 unverdicted novelty 6.0

Evidence-carrying multimodal agents decompose tool calls into predicates, obtain certificates from DOM/OCR/AX verifiers, and use a deterministic gate to authorize actions only when certificates support them, achieving...
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
cs.CR 2026-05 conditional novelty 6.0

Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
Semantic Denial of Service in LLM-controlled robots
cs.CR 2026-04 unverdicted novelty 6.0

Injecting brief safety-plausible phrases into robot audio triggers LLM safety halts, enabling semantic denial-of-service attacks where prompt defenses trade attack suppression for impaired genuine hazard detection.
MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks
cs.CR 2026-04 unverdicted novelty 6.0

MCP Pitfall Lab operationalizes six pitfall classes across tool-metadata poisoning, puppet servers, and multimodal chains, showing that recommended hardening removes all Tier-1 static findings and that agent narrative...
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
cs.CR 2025-12 unverdicted novelty 6.0

RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.
RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion
cs.CV 2025-03 unverdicted novelty 6.0

RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA w...
Whispers in the Machine: Confidentiality in Agentic Systems
cs.CR 2024-02 unverdicted novelty 6.0

Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.
Laundering AI Authority with Adversarial Examples
cs.CR 2026-05 unverdicted novelty 5.0

Adversarial examples enable AI authority laundering by causing production VLMs to give authoritative but wrong responses on subtly perturbed images, with success rates of 22-100% using decade-old attack methods.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
cs.AI 2025-10 unverdicted novelty 4.0

A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions
cs.AI 2024-08 unverdicted novelty 4.0

The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
cs.CR 2025-02 unverdicted novelty 2.0

A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.