pith. sign in

arxiv: 2606.18710 · v1 · pith:WP3HSQ3Vnew · submitted 2026-06-17 · 💻 cs.CR

Image Prompt Reconstruction Attacks on Distributed MLLM Inference Frameworks

Pith reviewed 2026-06-26 20:42 UTC · model grok-4.3

classification 💻 cs.CR
keywords image reconstruction attacksdistributed MLLMprivacy leakageintermediate embeddingsmultimodal modelsblack-box attacksMPAAIEDA
0
0 comments X

The pith

Intermediate embeddings in distributed MLLM inference frameworks leak enough information to reconstruct input images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines privacy risks to image prompts in distributed multimodal LLM inference, where models run across multiple consumer devices. It demonstrates that by extracting image-specific embeddings from mixed representations, two new attacks can recover either fine pixel details or semantic content from the transmitted data. A sympathetic reader would care because distributed inference is intended to make large models accessible without powerful hardware, but this approach may expose users' private images to other participants in the network. The work evaluates the attacks on four major MLLM families and finds consistent success, highlighting a previously unexplored vulnerability.

Core claim

The central discovery is that an image embedding extraction algorithm can isolate visual information with 100% accuracy in nearly all layers of MLLMs, enabling two passive black-box attacks—MPAA for patch-wise pixel reconstruction and IEDA for diffusion-based semantic reconstruction—from intermediate embeddings shared among participants.

What carries the argument

The image embedding extraction algorithm, which serves as the prerequisite for separating image information from intertwined text-image embeddings across MLLM layers.

Load-bearing premise

The intermediate embeddings transmitted in distributed MLLM frameworks contain sufficient recoverable information about the input image to allow reconstruction by a passive participant.

What would settle it

A test in which the image embedding extraction algorithm fails to achieve high accuracy or the reconstruction attacks produce outputs no better than random guesses on real distributed MLLM runs.

Figures

Figures reproduced from arXiv: 2606.18710 by Hongyan Chang, Jianxin Wei, Meikang Qiu, Ting Yu, Xiaofeng Gao, Xinjian Luo, Xue Liu, Yuncheng Wu.

Figure 1
Figure 1. Figure 1: The overview of distributed MLLM inference frameworks. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A formatted prompt example for Gemma 3. 𝑡𝑥 𝑒1 𝑡𝑥 𝑡 𝑠 𝑡 𝑟 𝑡 𝑟 𝑒2 𝑒3 𝑒4 𝑒5 (a) In Model Design 𝑡𝑥 𝑒1 𝑡𝑥 𝑡 𝑠 𝑡 𝑟 𝑡 𝑟 𝑒2 𝑒3 𝑒4 𝑒5 (b) In Practice [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The information flow from tokens to embeddings. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Embedding differences caused by masked patches [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: The ℓ1 loss (↓) of MPAA ((a)–(b)) and the CSS score (↑) of MPAA ((e)–(h)) and IEDA ((i)–(l)) across different MLLMs and datasets. Additional results are provided in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example images reconstructed by different methods. In each block, the five images from left to right indicate [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reconstructed examples from layer 17 of Gemma [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reconstructed examples from layer 14 of Qwen [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: ℓ1 losses for attack transferability by training (𝑥-axis) and test (𝑦-axis) datasets. as the semantic model, since the linear layer may compress semantic features and thus degrade reconstruction quality. The diffusion model implementation follows DDPM [33], augmented with an additional semantic embedding layer before the input stage. SIII. ADDITIONAL EXPERIMENTS Real-World Image Datasets. Apart from the c… view at source ↗
Figure 12
Figure 12. Figure 12: Text and image distributions in Gemma 3. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example images of the ablation study. + Perturbation Source Image (Plane) Target Image (Car) Perturbed Image (Target: Car) [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Reconstructed examples under different output [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The SSIM score (↑) of MPAA((a)–(b)) across different MLLMs and datasets. It is important to note that we omit ℓ1 loss and SSIM results for the semantic attack (IEDA), as these pixel-level metrics cannot accurately evaluate semantic reconstruction quality. Gemma 3 Phi 4 Multimodal Qwen 2.5 VL Llama 4 Scout CIFAR10 CIFAR100 IEDA MPAA SDAR PCAT IEDA MPAA SDAR PCAT [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Example images reconstructed by different methods. In each block, the five images from left to right [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Example images reconstructed via MPAA from different datasets. In each block, the six images from left to right indicate the ground truth image and results from the cutting layers {0, ⌈ 𝐿 4 ⌉, ⌈ 2𝐿 4 ⌉, ⌈ 3𝐿 4 ⌉, 𝐿}. inference frameworks, (3) model providers whose architectures may be affected by privacy leakage, (4) the research community, and (5) society at large, including both defenders and potential … view at source ↗
Figure 19
Figure 19. Figure 19: Example images reconstructed via IEDA from different datasets. In each block, the six images from left to right indicate the ground truth image and results from the cutting layers {0, ⌈ 𝐿 4 ⌉, ⌈ 2𝐿 4 ⌉, ⌈ 3𝐿 4 ⌉, 𝐿}. Ethical Principles. Our analysis is guided by the ethical prin￾ciples articulated in the Menlo Report, including Beneficence, Respect for Persons, Justice, and Respect for Law and Public Inte… view at source ↗
read the original abstract

Distributed large language model (LLM) inference frameworks connect isolated consumer-grade devices for large-scale model inference, substantially reducing hardware constraints. However, recent studies show that intermediate embeddings transmitted among participants can leak private prompts. As LLMs evolve into multimodal LLMs (MLLMs), this risk extends beyond text: image prompts contain rich visual and semantic information, making their intermediate embeddings highly privacy-sensitive. Yet, image-prompt leakage in distributed MLLM inference remains largely unexplored. In this paper, we investigate privacy risks to input images caused by intermediate embeddings in distributed MLLM frameworks. We first analyze the information flow from image pixels to intermediate representations. Since image and text embeddings are often intertwined across MLLM layers, we design an image embedding extraction algorithm as a prerequisite for reconstruction attacks, achieving 100% extraction accuracy across almost all MLLM layers in our experiments. Building on this, we develop two passive black-box image reconstruction attacks, MPAA and IEDA, reflecting realistic threats from normal participants with limited knowledge and capability. MPAA performs fine-grained pixel-level reconstruction via patch-wise information extraction and assembly, while IEDA performs coarse-grained semantic reconstruction through embedding-guided diffusion generation. We evaluate our attacks on four representative MLLM families: Gemma 3, Phi 4 Multimodal, Qwen 2.5 VL, and Llama 4 Scout. Results show consistently superior reconstruction performance in various settings. We further analyze the effects of MoE architecture, image preprocessing, model size, and text-image dependency on attack performance. To our knowledge, this is the first study of image reconstruction attacks on MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates privacy risks to image prompts in distributed MLLM inference frameworks arising from transmitted intermediate embeddings. It introduces an image embedding extraction algorithm claimed to achieve 100% accuracy across almost all layers of four MLLM families (Gemma 3, Phi 4 Multimodal, Qwen 2.5 VL, Llama 4 Scout), then builds two passive black-box attacks—MPAA for fine-grained pixel-level reconstruction via patch-wise extraction and IEDA for coarse-grained semantic reconstruction via embedding-guided diffusion—reporting superior performance over baselines while analyzing effects of MoE, preprocessing, model size, and text-image dependency.

Significance. If the extraction algorithm and attacks can be realized strictly within the stated passive black-box threat model (normal participants with limited knowledge and capability), the work would constitute the first systematic study of image-prompt reconstruction in distributed MLLMs and would usefully extend prior text-prompt leakage results to the multimodal setting, with direct implications for the security of consumer-grade distributed inference systems.

major comments (2)
  1. [Abstract] Abstract: the central prerequisite claim of an image embedding extraction algorithm achieving 100% accuracy 'across almost all MLLM layers' is load-bearing for both MPAA and IEDA, yet the abstract (and by extension the methods) provides no description of experimental controls, layer-selection criteria, or how image-specific embeddings are isolated from intertwined text embeddings using only the transmitted tensors available to a normal participant.
  2. [Threat Model / Extraction Algorithm] Threat model and extraction algorithm description: the assumption that a passive participant with 'limited knowledge and capability' can reliably separate image embeddings without per-family architecture analysis, forward-pass simulation, or model-specific knowledge is not shown to hold; if the extraction procedure requires such knowledge, the 100% accuracy figure and the downstream reconstruction claims cannot be realized under the stated attacker constraints.
minor comments (1)
  1. The abstract states that effects of MoE architecture, image preprocessing, model size, and text-image dependency were analyzed, but these results are not cross-referenced to specific sections, tables, or figures, reducing readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for highlighting the importance of clearly documenting the extraction algorithm's assumptions. We address each major comment below with clarifications drawn from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central prerequisite claim of an image embedding extraction algorithm achieving 100% accuracy 'across almost all MLLM layers' is load-bearing for both MPAA and IEDA, yet the abstract (and by extension the methods) provides no description of experimental controls, layer-selection criteria, or how image-specific embeddings are isolated from intertwined text embeddings using only the transmitted tensors available to a normal participant.

    Authors: We agree the abstract is concise and omits these operational details. Section 3.2 of the manuscript specifies that image embeddings are isolated by matching the known patch-sequence length (determined from input resolution) against the shape of transmitted tensors and by selecting layers prior to full cross-modal fusion, using only positional and dimensional cues present in the tensors themselves. Layer-selection criteria were determined by measuring reconstruction fidelity across layers on held-out images from each model family; controls consisted of verifying that text-only sequences produce no matching patches. We will revise the abstract to include a one-sentence summary of the isolation procedure and the empirical layer range. revision: yes

  2. Referee: [Threat Model / Extraction Algorithm] Threat model and extraction algorithm description: the assumption that a passive participant with 'limited knowledge and capability' can reliably separate image embeddings without per-family architecture analysis, forward-pass simulation, or model-specific knowledge is not shown to hold; if the extraction procedure requires such knowledge, the 100% accuracy figure and the downstream reconstruction claims cannot be realized under the stated attacker constraints.

    Authors: The extraction procedure relies only on publicly observable properties of the inference protocol: the fixed number of image patches for a given resolution and the ordering of embeddings in the transmitted sequence. No weight access, forward-pass simulation, or family-specific tuning is performed; the same rule set is applied uniformly to all four evaluated families. The 100% accuracy is measured by exact recovery of the image-patch subset from the transmitted tensors under these constraints. We will add an explicit paragraph in Section 2.2 enumerating the minimal protocol-level information assumed to be available to any participant. revision: partial

Circularity Check

0 steps flagged

Empirical attack evaluation with no derivation chain or self-referential reductions

full rationale

The paper presents an empirical security analysis: it describes designing an image embedding extraction algorithm (achieving reported 100% accuracy in experiments across MLLM layers) and two passive black-box reconstruction attacks (MPAA, IEDA), then evaluates them on four model families. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Claims rest on experimental results rather than any chain that reduces to its own inputs by construction. This matches the default expectation of no significant circularity for an empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation present; work is empirical attack demonstration. No free parameters, axioms, or invented entities described in abstract.

pith-pipeline@v0.9.1-grok · 5848 in / 1065 out tokens · 29898 ms · 2026-06-26T20:42:35.575752+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 6 linked inside Pith

  1. [1]

    Petals: Collaborative inference and fine-tuning of large models,

    A. Borzunov, D. Baranchuk, T. Dettmers, M. Riabinin, Y . Belkada, A. Chumachenko, P. Samygin, and C. Raffel, “Petals: Collaborative inference and fine-tuning of large models,” inProc. ACL, 2023, pp. 558–568. 12

  2. [2]

    Distributed inference and fine-tuning of large language models over the internet,

    A. Borzunov, M. Ryabinin, A. Chumachenko, D. Baranchuk, T. Dettmers, Y . Belkada, P. Samygin, and C. A. Raffel, “Distributed inference and fine-tuning of large language models over the internet,”NeurIPS, 2024

  3. [3]

    Cake: a rust framework for distributed inference of large models based on candle,

    Evilsocket, “Cake: a rust framework for distributed inference of large models based on candle,” https://github.com/evilsocket/cake, 2025

  4. [4]

    Edgeshard: Efficient llm inference via collaborative edge computing,

    M. Zhang, J. Cao, X. Shen, and Z. Cui, “Edgeshard: Efficient llm inference via collaborative edge computing,”arXiv:2405.14371, 2024

  5. [5]

    Hexgen: Generative inference of large language model over heterogeneous environment,

    Y . Jiang, R. Yan, X. Yao, Y . Zhou, B. Chen, and B. Yuan, “Hexgen: Generative inference of large language model over heterogeneous environment,” inICML 2024, 2024

  6. [6]

    Lingualinked: A distributed large language model inference system for mobile devices,

    J. Zhao, Y . Song, S. Liu, I. G. Harris, and S. A. Jyothi, “Lingualinked: A distributed large language model inference system for mobile devices,” CoRR, vol. abs/2312.00388, 2023

  7. [7]

    Poster: Pipellm: Pipeline LLM inference on heterogeneous devices with sequence slicing,

    R. Ma, J. Wang, Q. Qi, X. Yang, H. Sun, Z. Zhuang, and J. Liao, “Poster: Pipellm: Pipeline LLM inference on heterogeneous devices with sequence slicing,” inProc. ACM SIGCOMM 2023. ACM, 2023, pp. 1126–1128

  8. [8]

    The ai acceleration cloud,

    T. AI, “The ai acceleration cloud,” https://www.together.ai/, 2025

  9. [9]

    Find compute. train models. co-own intelligence,

    P. Intellect, “Find compute. train models. co-own intelligence,” https: //www.primeintellect.ai/, 2025, online; accessed 22-August-2025

  10. [10]

    Ai infrastructure that developers love,

    Modal, “Ai infrastructure that developers love,” https://modal.com/, 2025

  11. [11]

    Prompt inference attack on distributed large language model inference frameworks,

    X. Luo, T. Yu, and X. Xiao, “Prompt inference attack on distributed large language model inference frameworks,”arXiv:2503.09291, 2025

  12. [12]

    Prompt Inversion Attack against Collaborative Inference of Large Language Models ,

    W. Qu, Y . Zhou, Y . Wu, T. Xiao, B. Yuan, Y . Li, and J. Zhang, “ Prompt Inversion Attack against Collaborative Inference of Large Language Models ,” inSP 2025, 2025, pp. 1602–1619

  13. [13]

    Distributed learning of deep neural network over multiple agents,

    O. Gupta and R. Raskar, “Distributed learning of deep neural network over multiple agents,”J. Netw. Comput. Appl., vol. 116, pp. 1–8, 2018

  14. [14]

    Split learning for health: Distributed deep learning without sharing raw patient data,

    P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning for health: Distributed deep learning without sharing raw patient data,” arXiv preprint arXiv:1812.00564, 2018

  15. [15]

    {PCAT}: Functionality and data stealing from split learning by {Pseudo-Client} attack,

    X. Gao and L. Zhang, “ {PCAT}: Functionality and data stealing from split learning by {Pseudo-Client} attack,” inProc. USENIX Security, 2023, pp. 5271–5288

  16. [16]

    Unleashing the tiger: Inference attacks on split learning,

    D. Pasquini, G. Ateniese, and M. Bernaschi, “Unleashing the tiger: Inference attacks on split learning,” inACM CCS, 2021, pp. 2113–2129

  17. [17]

    Passive inference attacks on split learning via adversarial regularization,

    X. Zhu, X. Luo, Y . Wu, Y . Jiang, X. Xiao, and B. C. Ooi, “Passive inference attacks on split learning via adversarial regularization,”arXiv preprint arXiv:2310.10483, 2023

  18. [18]

    Unsplit: Data-oblivious model inversion, model stealing, and label inference attacks against split learning,

    E. Erdo ˘gan, A. K ¨upc ¸¨u, and A. E. C ¸ic ¸ek, “Unsplit: Data-oblivious model inversion, model stealing, and label inference attacks against split learning,” inProc. WPES, 2022, pp. 115–124

  19. [19]

    Gemma 3 technical report,

    G. Team, A. Kamath, J. Ferretet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025

  20. [20]

    Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture- of-loras,

    A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chenet al., “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture- of-loras,”arXiv preprint arXiv:2503.01743, 2025

  21. [21]

    Qwen2.5-vl technical report,

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  22. [22]

    The llama 4 herd: The beginning of a new era of natively multimodal ai innovation,

    Meta, “The llama 4 herd: The beginning of a new era of natively multimodal ai innovation,” https://ai.meta.com/blog/ llama-4-multimodal-intelligence/, 2025

  23. [23]

    A fusion-denoising attack on instahide with data augmentation,

    X. Luo, X. Xiao, Y . Wu, J. Liu, and B. C. Ooi, “A fusion-denoising attack on instahide with data augmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1899–1907

  24. [24]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” 2009

  25. [25]

    An analysis of single-layer networks in unsupervised feature learning,

    A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” inAISTATS, 2011, pp. 215–223

  26. [26]

    Deep learning face attributes in the wild,

    Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” inICCV, 2015, pp. 3730–3738

  27. [27]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,

    P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” inProc. ACL, 2018, pp. 2556–2565

  28. [28]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Doll ´ar, “Microsoft coco: Common objects in context,” 2015

  29. [29]

    Imagenette: A smaller subset of 10 easily classified classes from imagenet

    J. Howard, “Imagenette: A smaller subset of 10 easily classified classes from imagenet.” [Online]. Available: https://github.com/fastai/imagenette

  30. [30]

    midjourney-prompts,

    S. AI, “midjourney-prompts,” https://huggingface.co/datasets/succinctly/ midjourney-prompts, 2025, online; accessed 21-October-2025

  31. [31]

    Understanding deep learning requires rethinking generalization,

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” inICLR, 2017

  32. [32]

    Coding theorems for a discrete source with a fidelity criterion,

    C. E. Shannonet al., “Coding theorems for a discrete source with a fidelity criterion,”IRE Nat. Conv. Rec, vol. 4, no. 142-163, p. 1, 1959

  33. [33]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

  34. [34]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProc. IEEE/CVF CVPR, 2022, pp. 10 684–10 695

  35. [35]

    T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,

    C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan, “T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,” inProc. AAAI, vol. 38, no. 5, 2024

  36. [36]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProc. ICCV, 2023, pp. 3836–3847

  37. [37]

    Rethinking the inception architecture for computer vision,

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” inIEEE CVPR, 2016, pp. 2818–2826

  38. [38]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  39. [39]

    Improved training of wasserstein gans,

    I. Gulrajani, F. Ahmed, M. Arjovsky, V . Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,”Proc. NeurIPS, vol. 30, 2017

  40. [40]

    Classifier-free diffusion guidance,

    J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

  41. [41]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inProc. IEEE CVPR. Ieee, 2009, pp. 248–255

  42. [42]

    Image quality assessment: from error visibility to structural similarity,

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004

  43. [43]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE CVPR, 2016, pp. 770–778

  44. [44]

    Mllms know where to look: Training-free perception of small visual details with multimodal llms,

    J. Zhang, M. Khayatkhoei, P. Chhikara, and F. Ilievski, “Mllms know where to look: Training-free perception of small visual details with multimodal llms,”arXiv preprint arXiv:2502.17422, 2025

  45. [45]

    Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models,

    Q. Guo, S. Pang, X. Jia, Y . Liu, and Q. Guo, “Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models,”IEEE TIFS, 2024

  46. [46]

    MP-SPDZ: A versatile framework for multi-party computa- tion,

    M. Keller, “MP-SPDZ: A versatile framework for multi-party computa- tion,” inACM CCS 2020. ACM, 2020, pp. 1575–1590

  47. [47]

    Trusted execution environments: properties, applications, and challenges,

    P. Jauernig, A.-R. Sadeghi, and E. Stapf, “Trusted execution environments: properties, applications, and challenges,”IEEE S&P, 2020

  48. [48]

    Bumblebee: Secure two-party inference framework for large transformers,

    W.-j. Lu, Z. Huang, Z. Gu, J. Li, J. Liu, C. Hong, K. Ren, T. Wei, and W. Chen, “Bumblebee: Secure two-party inference framework for large transformers,”Cryptology ePrint Archive, 2023

  49. [49]

    Information leakage in embedding models,

    C. Song and A. Raghunathan, “Information leakage in embedding models,” inACM CCS 2020. ACM, 2020, pp. 377–390

  50. [50]

    Context-aware membership inference attacks against pre-trained large language models,

    H. Chang, A. Shahin Shamsabadi, K. Katevas, H. Haddadi, and R. Shokri, “Context-aware membership inference attacks against pre-trained large language models,” inEMNLP 2025. Association for Computational Linguistics, Nov. 2025, pp. 7299–7321

  51. [51]

    Thieves on sesame street! model extraction of bert-based apis,

    K. Krishna, G. S. Tomar, A. P. Parikh, N. Papernot, and M. Iyyer, “Thieves on sesame street! model extraction of bert-based apis,” inICLR 2020, 2020

  52. [52]

    Grey-box extraction of natural language models,

    S. Zanella-Beguelin, S. Tople, A. Paverd, and B. K ¨opf, “Grey-box extraction of natural language models,” inICML. PMLR, 2021, pp. 12 278–12 286

  53. [53]

    Effective prompt extraction from language models,

    Y . Zhang, N. Carlini, and D. Ippolito, “Effective prompt extraction from language models,” inFirst Conference on Language Modeling, 2024

  54. [54]

    Sentence embedding leaks more information than you expect: Generative embedding inversion attack to recover the whole sentence,

    H. Li, M. Xu, and Y . Song, “Sentence embedding leaks more information than you expect: Generative embedding inversion attack to recover the whole sentence,” inACL 2023, 2023, pp. 14 022–14 040

  55. [55]

    Text embeddings reveal (almost) as much as text,

    J. X. Morris, V . Kuleshov, V . Shmatikov, and A. M. Rush, “Text embeddings reveal (almost) as much as text,” inProc. EMNLP 2023, 2023, pp. 12 448–12 460

  56. [56]

    Extracting prompts by inverting LLM outputs,

    C. Zhang, J. X. Morris, and V . Shmatikov, “Extracting prompts by inverting LLM outputs,” inProc. EMNLP 2024, 2024, pp. 14 753–14 777

  57. [57]

    Transferable adversarial attacks on black-box vision- language models,

    K. Hu, W. Yu, L. Zhang, A. Robey, A. Zou, C. Xu, H. Hu, and M. Fredrikson, “Transferable adversarial attacks on black-box vision- language models,”arXiv preprint arXiv:2505.01050, 2025

  58. [58]

    Leakyclip: Extracting training data from clip,

    Y . Chen, S. Wang, X. Wang, and X. Ma, “Leakyclip: Extracting training data from clip,”arXiv preprint arXiv:2508.00756, 2025

  59. [59]

    Drag: Data reconstruction attack using guided diffusion,

    W.-K. Lei, J.-C. Chen, and S.-T. Chen, “Drag: Data reconstruction attack using guided diffusion,”arXiv preprint arXiv:2509.11724, 2025

  60. [60]

    Laion- 5b: An open large-scale dataset for training next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “Laion- 5b: An open large-scale dataset for training next generation image-text models,”Proc. NeurIPS, vol. 35, pp. 25 278–25 294, 2022. 13

  61. [61]

    The convolution inequality for entropy powers,

    N. Blachman, “The convolution inequality for entropy powers,”IEEE Transactions on Information theory, vol. 11, no. 2, pp. 267–271, 2003

  62. [62]

    Gersho and R

    A. Gersho and R. M. Gray,Vector quantization and signal compression. Springer Science & Business Media, 2012, vol. 159

  63. [63]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” 2023

  64. [64]

    Gpt aip pricing,

    OpenAI, “Gpt aip pricing,” https://openai.com/api/pricing/, 2025, online; accessed 14-November-2025

  65. [65]

    Gemini api pricing,

    Google, “Gemini api pricing,” https://ai.google.dev/gemini-api/docs/ pricing, 2025, online; accessed 14-November-2025

  66. [66]

    Calibrating noise for group privacy in subsampled mechanisms,

    Y . Jiang, X. Luo, Y . Yang, and X. Xiao, “Calibrating noise for group privacy in subsampled mechanisms,”arXiv preprint arXiv:2408.09943, 2024

  67. [67]

    Feature inference attack on model predictions in vertical federated learning,

    X. Luo, Y . Wu, X. Xiao, and B. C. Ooi, “Feature inference attack on model predictions in vertical federated learning,” in2021 IEEE 37th international conference on data engineering (ICDE). IEEE, 2021, pp. 181–192. 14 SUPPLEMENTARYMATERIAL SI. DESIGNREMARKS ONMPAA Below, we provide several remarks on MPAA. Lightweight. In MPAA, we employ a shared patch ex...

  68. [68]

    In our setup, three participants are distributed across the two local GPUs, with the second participant designated as the attacker

    and host it on a server with an AMD EPYC 9654 processor (192 cores), two NVIDIA A100-SXM4 40GB GPUs, and 750 GB of RAM, under Ubuntu 22.04. In our setup, three participants are distributed across the two local GPUs, with the second participant designated as the attacker. Note that the number of participants does not affect the attack mechanism; the primar...