Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection
Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3
The pith
ImageProtector adds a nearly invisible perturbation to images that forces MLLMs to refuse analysis and output denial messages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ImageProtector proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs, inducing them to consistently generate refusal responses such as 'I'm sorry, I can't help with that request' when an adversary attempts analysis.
What carries the argument
The visual prompt injection perturbation, a small additive change to pixel values that triggers the MLLM's built-in refusal mechanisms during image processing.
If this is right
- Users gain a practical tool to block large-scale automated extraction of private information from their shared images.
- MLLMs consistently output refusal messages on protected images across multiple model families and datasets.
- Countermeasures such as Gaussian noise, DiffPure, or adversarial training reduce the protection's success rate but also lower model accuracy or increase computation cost.
- The method focuses on open-weight MLLMs, where an adversary can run the model locally without API restrictions.
Where Pith is reading between the lines
- Widespread adoption of such protections could limit the data available for training or fine-tuning future MLLMs on real-world images.
- Attackers may respond by developing more aggressive image-cleaning pipelines that remove perturbations while preserving visual content.
- The same perturbation idea could be explored for protecting video frames or other visual media from automated analysis.
Load-bearing premise
The perturbation stays nearly invisible and continues to force refusals even after common image transformations and on future MLLM architectures.
What would settle it
Resize, compress, or lightly edit a protected image and show that an MLLM then analyzes its content correctly instead of refusing.
Figures
read the original abstract
Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as "I'm sorry, I can't help with that request." We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ImageProtector, a user-side defense that adds a carefully crafted, nearly imperceptible perturbation to images before sharing. This perturbation functions as a visual prompt injection, causing MLLMs to refuse analysis requests (e.g., responding 'I'm sorry, I can't help with that request'). The method is empirically evaluated for effectiveness across six MLLMs and four datasets; three countermeasures (Gaussian noise, DiffPure, adversarial training) are also tested and shown to partially mitigate the attack at the cost of accuracy or efficiency.
Significance. If the perturbation remains effective under real-world conditions, the work offers a proactive, user-controlled privacy tool against large-scale misuse of open-weight MLLMs for extracting sensitive information from shared images. The multi-model, multi-dataset evaluation provides concrete evidence of consistent refusal induction and is a strength; however, the practical significance is limited by incomplete robustness testing against preprocessing and generalization.
major comments (3)
- [Evaluation section] Evaluation section (results on six MLLMs and four datasets): the reported success in inducing refusals lacks error bars, multiple random seeds, or statistical significance tests, making it difficult to verify the 'consistent' claim or assess variability across runs.
- [Countermeasures section] Countermeasures and robustness discussion: while DiffPure and adversarial training are shown to partially reduce effectiveness, the experiments omit standard image-sharing transformations (JPEG compression, resizing, normalization) that occur in real pipelines; this is load-bearing because the central claim requires the perturbation to survive such preprocessing when images are uploaded or shared.
- [Discussion section] Generalization claims: the paper evaluates only the six chosen MLLMs and does not test transfer to unseen architectures or future models, which directly affects the assertion that the method protects against 'any' adversary using an open-weight MLLM for large-scale analysis.
minor comments (2)
- [Abstract and Method] The abstract and method description refer to 'nearly imperceptible' perturbations without reporting quantitative perceptual metrics (e.g., PSNR, SSIM, or LPIPS) or the exact perturbation magnitude used.
- [Method] Notation for the perturbation generation process could be clarified; it is unclear whether the optimization objective is fully specified or if it relies on access to the target MLLM during crafting.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments and the opportunity to improve our manuscript. We address each major comment below, providing clarifications and indicating revisions where appropriate.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (results on six MLLMs and four datasets): the reported success in inducing refusals lacks error bars, multiple random seeds, or statistical significance tests, making it difficult to verify the 'consistent' claim or assess variability across runs.
Authors: We agree that reporting variability is important for rigor. In the revised manuscript, we will include error bars based on multiple random seeds (e.g., 5 seeds) for the success rates and conduct statistical significance tests (e.g., paired t-tests) to support the consistency claims. revision: yes
-
Referee: [Countermeasures section] Countermeasures and robustness discussion: while DiffPure and adversarial training are shown to partially reduce effectiveness, the experiments omit standard image-sharing transformations (JPEG compression, resizing, normalization) that occur in real pipelines; this is load-bearing because the central claim requires the perturbation to survive such preprocessing when images are uploaded or shared.
Authors: We acknowledge the importance of evaluating robustness to common preprocessing steps in image sharing pipelines. We will add new experiments in the revised version testing the protected images under JPEG compression at various quality levels, resizing to different resolutions, and normalization. This will provide a more complete assessment of real-world applicability. revision: yes
-
Referee: [Discussion section] Generalization claims: the paper evaluates only the six chosen MLLMs and does not test transfer to unseen architectures or future models, which directly affects the assertion that the method protects against 'any' adversary using an open-weight MLLM for large-scale analysis.
Authors: We selected six diverse open-weight MLLMs spanning different families and sizes to demonstrate broad applicability. However, we agree that transfer to completely unseen architectures cannot be exhaustively tested. In the revision, we will expand the discussion section to explicitly state the limitations regarding future models and discuss potential transferability based on common visual encoder architectures. We will also tone down any absolute claims to 'any' adversary. revision: partial
- Exhaustive testing on all possible future MLLM architectures is not feasible within the scope of this work.
Circularity Check
No circularity: empirical method validated externally
full rationale
The paper introduces ImageProtector as an empirical defense that adds perturbations to images to trigger refusal outputs in MLLMs. No equations, derivations, or parameter-fitting steps are described that would reduce the claimed outcome to a self-defined input or fitted quantity. Success is measured against independent MLLM behaviors on external datasets and models, with no load-bearing self-citations or ansatzes that close a loop back to the paper's own results. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- perturbation magnitude
axioms (1)
- domain assumption MLLMs remain vulnerable to visual prompt injection attacks even when the prompt is embedded as pixel-level perturbations
invented entities (1)
-
ImageProtector perturbation
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Generating natural language adversarial exam- ples. InEMNLP. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question An- swering. InICCV. Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. (ab) using images and sounds for indirect instruction...
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[2]
Instructblip: Towards general-purpose vision- language models with instruction tuning. InNeurIPS. Prafulla Dhariwal and Alexander Nichol. 2021. Dif- fusion models beat gans on image synthesis. In NeurIPS. Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, and 1 ot...
work page 2021
-
[3]
In Artificial intelligence safety and security
Adversarial examples in the physical world. In Artificial intelligence safety and security. Chapman and Hall/CRC. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi
-
[4]
Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InICML. Hezheng Lin, Xing Cheng, Xiangyu Wu, and Dong Shen. 2022. Cat: Cross attention in vision trans- former. InICME. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024a. Improved baselines with visual instruc- tion tuning. InCVPR. Haotian...
work page 2022
-
[5]
Deep learning face attributes in the wild. In ICCV. Haochen Luo, Jindong Gu, Fengyuan Liu, and Philip Torr. 2024a. An image is worth 1000 lies: Adversar- ial transferability across prompts on vision-language models. InICLR. Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. 2024b. Jailbreakv-28k: A benchmark for assessing the robustness of ...
-
[6]
Towards vqa models that can read. InCVPR. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole
-
[7]
Score-based generative modeling through stochastic differential equations. InICLR. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. InICLR. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A ques- tion answ...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.