pith. sign in

arxiv: 2604.09024 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI· cs.CR· cs.LG

Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection

Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CRcs.LG
keywords image protectionmulti-modal LLMsvisual prompt injectionprivacyadversarial perturbationMLLM refusalimage privacy
0
0 comments X

The pith

ImageProtector adds a nearly invisible perturbation to images that forces MLLMs to refuse analysis and output denial messages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ImageProtector, a user-side defense that lets people shield personal photos before they are shared online. It embeds a specially designed perturbation into the image that functions as a visual prompt injection, causing any MLLM that processes the image to generate a refusal such as 'I'm sorry, I can't help with that request' instead of extracting identities, locations, or other details. The approach targets the practical risk of open-weight MLLMs being run at scale on public image collections. Tests across six models and four datasets show consistent refusal behavior. Three countermeasures were examined, but each either weakens the protection or reduces the underlying model's accuracy and speed.

Core claim

ImageProtector proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs, inducing them to consistently generate refusal responses such as 'I'm sorry, I can't help with that request' when an adversary attempts analysis.

What carries the argument

The visual prompt injection perturbation, a small additive change to pixel values that triggers the MLLM's built-in refusal mechanisms during image processing.

If this is right

  • Users gain a practical tool to block large-scale automated extraction of private information from their shared images.
  • MLLMs consistently output refusal messages on protected images across multiple model families and datasets.
  • Countermeasures such as Gaussian noise, DiffPure, or adversarial training reduce the protection's success rate but also lower model accuracy or increase computation cost.
  • The method focuses on open-weight MLLMs, where an adversary can run the model locally without API restrictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption of such protections could limit the data available for training or fine-tuning future MLLMs on real-world images.
  • Attackers may respond by developing more aggressive image-cleaning pipelines that remove perturbations while preserving visual content.
  • The same perturbation idea could be explored for protecting video frames or other visual media from automated analysis.

Load-bearing premise

The perturbation stays nearly invisible and continues to force refusals even after common image transformations and on future MLLM architectures.

What would settle it

Resize, compress, or lightly edit a protected image and show that an MLLM then analyzes its content correctly instead of refusing.

Figures

Figures reproduced from arXiv: 2604.09024 by Hongbin Liu, Neil Zhenqiang Gong, Yuepeng Hu, Zedian Shao.

Figure 1
Figure 1. Figure 1: Illustration of a user leveraging ImageProtec [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our ImageProtector. to one or more target open-weight MLLMs of con￾cern, including access to gradients. This assump￾tion is realistic in our target setting because many strong MLLMs are openly released and can be run locally by both attackers and defenders. These open-weight models present a greater threat, as they significantly lower the economic barrier for malicious actors to conduct large-s… view at source ↗
Figure 3
Figure 3. Figure 3: Images without and with perturbations added by our ImageProtector under different [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Refusal rates of ImageProtector with multiple [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of the temperature of the target MLLM [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of the number of repeated question [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of step size on ImageProtector. We evaluate three types of shadow questions with LLaVA-1.5 on [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Impact of the maximum number of iterations on ImageProtector. We use three types of shadow questions [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of (a) ℓ∞-norm perturbation constraint ϵ, (b) mini-batch size of shadow questions, (c) the size of shadow questions on ImageProtector. We use general probing questions as shadow questions with LLaVA-1.5 on VQAv2. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy and refusal rates (RR) of ImageProtector with (a) adding Gaussian noise and (b) using DiffPure. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Accuracy of the target MLLM and refusal rates of ImageProtector when using adversarial training with [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt to generate similar probing questions where [Example_Question] denotes an example question. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt to generate general probing questions where [Q] denotes the number of shadow questions to [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt to a refusal judge LLM. [MLLM_Response] represents the response from an MLLM. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example prompt to generate general probing questions and example response from GPT-4. [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The set of refusal responses. Example prompt: Please paraphrase below question into 10 new questions: “What are pedestrians asked not to do on the white sign?" Example response from GPT-3.5: “On the white sign, what are pedestrians being told not to do?", “What is prohibited for pedestrians on the white sign?", “What is the request made of pedestrians on the white sign?", “What action are pedestrians bein… view at source ↗
Figure 17
Figure 17. Figure 17: Example prompt to generate similar probing questions and example response from GPT-3.5. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
read the original abstract

Multi-modal large language models (MLLMs) have emerged as powerful tools for analyzing Internet-scale image data, offering significant benefits but also raising critical safety and societal concerns. In particular, open-weight MLLMs may be misused to extract sensitive information from personal images at scale, such as identities, locations, or other private details. In this work, we propose ImageProtector, a user-side method that proactively protects images before sharing by embedding a carefully crafted, nearly imperceptible perturbation that acts as a visual prompt injection attack on MLLMs. As a result, when an adversary analyzes a protected image with an MLLM, the MLLM is consistently induced to generate a refusal response such as "I'm sorry, I can't help with that request." We empirically demonstrate the effectiveness of ImageProtector across six MLLMs and four datasets. Additionally, we evaluate three potential countermeasures, Gaussian noise, DiffPure, and adversarial training, and show that while they partially mitigate the impact of ImageProtector, they simultaneously degrade model accuracy and/or efficiency. Our study focuses on the practically important setting of open-weight MLLMs and large-scale automated image analysis, and highlights both the promise and the limitations of perturbation-based privacy protection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ImageProtector, a user-side defense that adds a carefully crafted, nearly imperceptible perturbation to images before sharing. This perturbation functions as a visual prompt injection, causing MLLMs to refuse analysis requests (e.g., responding 'I'm sorry, I can't help with that request'). The method is empirically evaluated for effectiveness across six MLLMs and four datasets; three countermeasures (Gaussian noise, DiffPure, adversarial training) are also tested and shown to partially mitigate the attack at the cost of accuracy or efficiency.

Significance. If the perturbation remains effective under real-world conditions, the work offers a proactive, user-controlled privacy tool against large-scale misuse of open-weight MLLMs for extracting sensitive information from shared images. The multi-model, multi-dataset evaluation provides concrete evidence of consistent refusal induction and is a strength; however, the practical significance is limited by incomplete robustness testing against preprocessing and generalization.

major comments (3)
  1. [Evaluation section] Evaluation section (results on six MLLMs and four datasets): the reported success in inducing refusals lacks error bars, multiple random seeds, or statistical significance tests, making it difficult to verify the 'consistent' claim or assess variability across runs.
  2. [Countermeasures section] Countermeasures and robustness discussion: while DiffPure and adversarial training are shown to partially reduce effectiveness, the experiments omit standard image-sharing transformations (JPEG compression, resizing, normalization) that occur in real pipelines; this is load-bearing because the central claim requires the perturbation to survive such preprocessing when images are uploaded or shared.
  3. [Discussion section] Generalization claims: the paper evaluates only the six chosen MLLMs and does not test transfer to unseen architectures or future models, which directly affects the assertion that the method protects against 'any' adversary using an open-weight MLLM for large-scale analysis.
minor comments (2)
  1. [Abstract and Method] The abstract and method description refer to 'nearly imperceptible' perturbations without reporting quantitative perceptual metrics (e.g., PSNR, SSIM, or LPIPS) or the exact perturbation magnitude used.
  2. [Method] Notation for the perturbation generation process could be clarified; it is unclear whether the optimization objective is fully specified or if it relies on access to the target MLLM during crafting.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their thoughtful comments and the opportunity to improve our manuscript. We address each major comment below, providing clarifications and indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section (results on six MLLMs and four datasets): the reported success in inducing refusals lacks error bars, multiple random seeds, or statistical significance tests, making it difficult to verify the 'consistent' claim or assess variability across runs.

    Authors: We agree that reporting variability is important for rigor. In the revised manuscript, we will include error bars based on multiple random seeds (e.g., 5 seeds) for the success rates and conduct statistical significance tests (e.g., paired t-tests) to support the consistency claims. revision: yes

  2. Referee: [Countermeasures section] Countermeasures and robustness discussion: while DiffPure and adversarial training are shown to partially reduce effectiveness, the experiments omit standard image-sharing transformations (JPEG compression, resizing, normalization) that occur in real pipelines; this is load-bearing because the central claim requires the perturbation to survive such preprocessing when images are uploaded or shared.

    Authors: We acknowledge the importance of evaluating robustness to common preprocessing steps in image sharing pipelines. We will add new experiments in the revised version testing the protected images under JPEG compression at various quality levels, resizing to different resolutions, and normalization. This will provide a more complete assessment of real-world applicability. revision: yes

  3. Referee: [Discussion section] Generalization claims: the paper evaluates only the six chosen MLLMs and does not test transfer to unseen architectures or future models, which directly affects the assertion that the method protects against 'any' adversary using an open-weight MLLM for large-scale analysis.

    Authors: We selected six diverse open-weight MLLMs spanning different families and sizes to demonstrate broad applicability. However, we agree that transfer to completely unseen architectures cannot be exhaustively tested. In the revision, we will expand the discussion section to explicitly state the limitations regarding future models and discuss potential transferability based on common visual encoder architectures. We will also tone down any absolute claims to 'any' adversary. revision: partial

standing simulated objections not resolved
  • Exhaustive testing on all possible future MLLM architectures is not feasible within the scope of this work.

Circularity Check

0 steps flagged

No circularity: empirical method validated externally

full rationale

The paper introduces ImageProtector as an empirical defense that adds perturbations to images to trigger refusal outputs in MLLMs. No equations, derivations, or parameter-fitting steps are described that would reduce the claimed outcome to a self-defined input or fitted quantity. Success is measured against independent MLLM behaviors on external datasets and models, with no load-bearing self-citations or ansatzes that close a loop back to the paper's own results. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that MLLMs process images in a way that allows small additive perturbations to function as effective prompt injections, plus standard assumptions about optimization and imperceptibility metrics.

free parameters (1)
  • perturbation magnitude
    The scale of the added noise must be chosen to balance effectiveness against imperceptibility and is likely tuned on held-out data.
axioms (1)
  • domain assumption MLLMs remain vulnerable to visual prompt injection attacks even when the prompt is embedded as pixel-level perturbations
    Invoked in the description of how the perturbation induces refusal.
invented entities (1)
  • ImageProtector perturbation no independent evidence
    purpose: To serve as a visual prompt injection that forces refusal
    Newly introduced defense mechanism with no independent evidence outside the empirical tests reported.

pith-pipeline@v0.9.0 · 5536 in / 1431 out tokens · 46680 ms · 2026-05-10T17:50:48.947363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Generating natural language adversarial exam- ples. InEMNLP. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question An- swering. InICCV. Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. (ab) using images and sounds for indirect instruction...

  2. [2]

    InNeurIPS

    Instructblip: Towards general-purpose vision- language models with instruction tuning. InNeurIPS. Prafulla Dhariwal and Alexander Nichol. 2021. Dif- fusion models beat gans on image synthesis. In NeurIPS. Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, and 1 ot...

  3. [3]

    In Artificial intelligence safety and security

    Adversarial examples in the physical world. In Artificial intelligence safety and security. Chapman and Hall/CRC. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi

  4. [4]

    Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InICML. Hezheng Lin, Xing Cheng, Xiangyu Wu, and Dong Shen. 2022. Cat: Cross attention in vision trans- former. InICME. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024a. Improved baselines with visual instruc- tion tuning. InCVPR. Haotian...

  5. [5]

    Deep learning face attributes in the wild. In ICCV. Haochen Luo, Jindong Gu, Fengyuan Liu, and Philip Torr. 2024a. An image is worth 1000 lies: Adversar- ial transferability across prompts on vision-language models. InICLR. Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. 2024b. Jailbreakv-28k: A benchmark for assessing the robustness of ...

  6. [6]

    Towards vqa models that can read. InCVPR. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

  7. [7]

    Imagine there is an image of a per- son’s face. Please list 10 general questions that people may ask about the person’s private informa- tion from the image

    Score-based generative modeling through stochastic differential equations. InICLR. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. InICLR. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A ques- tion answ...