Lost in Modality: Evaluating the Effectiveness of Text-Based Membership Inference Attacks on Large Multimodal Models
Pith reviewed 2026-05-22 11:38 UTC · model grok-4.3
The pith
Visual inputs regularize and mask membership signals for text-based attacks on multimodal models in out-of-distribution settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments under vision-and-text and text-only conditions across the DeepSeek-VL and InternVL model families demonstrate that logit-based membership inference attacks perform comparably in in-distribution settings with a slight vision-and-text advantage, while in out-of-distribution settings visual inputs act as regularizers that effectively mask membership signals.
What carries the argument
Visual inputs serving as regularizers that obscure membership signals in out-of-distribution multimodal inference attacks.
Load-bearing premise
The model families and in-distribution versus out-of-distribution splits chosen represent the typical behavior of large multimodal models in general.
What would settle it
An experiment showing that visual inputs do not reduce membership inference success rates on out-of-distribution data for additional multimodal models would challenge the central observation.
Figures
read the original abstract
Large Multimodal Language Models (MLLMs) are emerging as one of the foundational tools in an expanding range of applications. Consequently, understanding training-data leakage in these systems is increasingly critical. Log-probability-based membership inference attacks (MIAs) have become a widely adopted approach for assessing data exposure in large language models (LLMs), yet their effect in MLLMs remains unclear. We present the first comprehensive evaluation of extending these text-based MIA methods to multimodal settings. Our experiments under vision-and-text (V+T) and text-only (T-only) conditions across the DeepSeek-VL and InternVL model families show that in in-distribution settings, logit-based MIAs perform comparably across configurations, with a slight V+T advantage. Conversely, in out-of-distribution settings, visual inputs act as regularizers, effectively masking membership signals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates the effectiveness of text-based membership inference attacks (MIAs) on large multimodal language models (MLLMs). It extends log-probability-based MIA methods to multimodal settings and compares vision-and-text (V+T) versus text-only (T-only) conditions across the DeepSeek-VL and InternVL model families. The central claims are that MIAs perform comparably in in-distribution settings (with a slight V+T advantage) while visual inputs act as regularizers that mask membership signals in out-of-distribution settings.
Significance. If the OOD regularization effect holds beyond the tested models and splits, the work would offer valuable empirical evidence on how multimodal conditioning can mitigate text-based data leakage in MLLMs. This has direct implications for privacy assessment and design of multimodal systems, extending existing MIA literature from LLMs to MLLMs in a practical, attack-oriented manner.
major comments (1)
- [Experiments section] Experiments section: The claim that visual inputs regularize and mask membership signals specifically in OOD regimes rests on results from only two model families (DeepSeek-VL and InternVL) and particular in/out-of-distribution partitions. Without cross-family validation or ablations examining the interaction between the visual encoder and text token probabilities, it remains unclear whether the observed drop in MIA performance is a general property of multimodal conditioning or tied to shared characteristics of these specific models and splits.
minor comments (1)
- [Abstract] Abstract: The summary states clear findings but omits specific metrics, statistical details, dataset descriptions, and error analysis, which would strengthen the reader's ability to evaluate the reported effects.
Simulated Author's Rebuttal
We thank the referee for their valuable feedback on our manuscript. We address the major comment below and indicate the changes we will make in the revised version.
read point-by-point responses
-
Referee: The claim that visual inputs regularize and mask membership signals specifically in OOD regimes rests on results from only two model families (DeepSeek-VL and InternVL) and particular in/out-of-distribution partitions. Without cross-family validation or ablations examining the interaction between the visual encoder and text token probabilities, it remains unclear whether the observed drop in MIA performance is a general property of multimodal conditioning or tied to shared characteristics of these specific models and splits.
Authors: We appreciate this point regarding the generalizability of our results. DeepSeek-VL and InternVL were selected as they represent two prominent and architecturally diverse open-source MLLM families, with different vision encoders and training data. The OOD masking effect was observed consistently in both, which we believe provides meaningful support for the claim. We agree that validation on additional models would be ideal to confirm it is a general property of multimodal conditioning. In the revised manuscript, we will add a paragraph in the discussion section acknowledging this limitation and outlining plans for future cross-family experiments. Regarding specific ablations on the visual encoder's interaction with text probabilities, our V+T vs. T-only setup directly measures the impact of adding visual inputs on the text logit distributions used for the MIA. More granular ablations, such as varying the visual encoder while keeping the LLM fixed, are beyond the scope of the current work due to the significant engineering and compute requirements, but we will note this as an important direction for future research. revision: partial
Circularity Check
No significant circularity: purely empirical evaluation
full rationale
This paper is an empirical evaluation study that reports experimental results from running text-based membership inference attacks on two specific MLLM families (DeepSeek-VL and InternVL) under V+T and T-only conditions, in both in-distribution and out-of-distribution regimes. No mathematical derivations, equations, fitted parameters presented as predictions, or self-referential constructs appear in the provided abstract or described content. The central observation that visual inputs act as regularizers masking membership signals in OOD settings is presented as a direct experimental finding rather than a reduction to prior inputs or self-citations. Any self-citations that may exist do not load-bear the claims, which rest on the reported experimental observations on named models and splits. The analysis is therefore self-contained with no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Log-probability-based membership inference methods developed for text-only LLMs remain valid when applied to multimodal models
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
in out-of-distribution settings, visual inputs act as regularizers, effectively masking membership signals
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024
work page 2024
-
[4]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Ziyi Tong, Feifei Sun, and Le Minh Nguyen. Pretraining data exposure in large language models: A survey of membership inference, data contamination, and security implica- tions. InInternational Conference on Applications of Natural Language to Information Systems, pages 152–162. Springer, 2025
work page 2025
-
[6]
Roy Xie, Junlin Wang, Ruomin Huang, Minxing Zhang, Rong Ge, Jian Pei, Neil Zhen- qiang Gong, and Bhuwan Dhingra. Recall: Membership inference via relative conditional log-likelihoods.arXiv preprint arXiv:2406.15968, 2024
-
[7]
Zhan Li, Yongtao Wu, Yihang Chen, Francesco Tonin, Elias Abad Rocamora, and Volkan Cevher. Membership inference attacks against large vision-language models.Advances in Neural Information Processing Systems, 37:98645–98674, 2024
work page 2024
-
[8]
Privacy risk in ma- chine learning: Analyzing the connection to overfitting
Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in ma- chine learning: Analyzing the connection to overfitting. In2018 IEEE 31st computer security foundations symposium (CSF), pages 268–282. IEEE, 2018
work page 2018
-
[9]
Membership inference attacks from first principles
Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In2022 IEEE symposium on security and privacy (SP), pages 1897–1914. IEEE, 2022
work page 1914
-
[10]
Ex- tracting training data from large language models
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Ex- tracting training data from large language models. In30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021
work page 2021
-
[11]
Detecting Pretraining Data from Large Language Models
Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models.arXiv preprint arXiv:2310.16789, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Frank Yang, and Hai Li. Min-k%++: Improved baseline for detecting pre-training data from large language models.arXiv preprint arXiv:2404.02936, 2024
-
[13]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reason- ing via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022
work page 2022
-
[14]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.