pith. sign in

arxiv: 2512.03121 · v2 · pith:EMCGPOXEnew · submitted 2025-12-02 · 💻 cs.CR · cs.AI

Lost in Modality: Evaluating the Effectiveness of Text-Based Membership Inference Attacks on Large Multimodal Models

Pith reviewed 2026-05-22 11:38 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords membership inference attacksmultimodal language modelsdata privacyout-of-distributionvisual regularizationlogit-based attackstraining data leakage
0
0 comments X

The pith

Visual inputs regularize and mask membership signals for text-based attacks on multimodal models in out-of-distribution settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the extension of text-based membership inference attacks to large multimodal models that handle both text and images. Experiments compare attacks using only text against those using vision and text together, across in-distribution and out-of-distribution data for two model families. Results indicate comparable attack performance in in-distribution cases with a minor benefit from visuals, but visuals reduce the attack's ability to detect membership in out-of-distribution cases by acting as a regularizer. This matters because multimodal models are increasingly deployed in applications where training data privacy is a concern, and understanding these modality interactions can inform better security practices.

Core claim

Experiments under vision-and-text and text-only conditions across the DeepSeek-VL and InternVL model families demonstrate that logit-based membership inference attacks perform comparably in in-distribution settings with a slight vision-and-text advantage, while in out-of-distribution settings visual inputs act as regularizers that effectively mask membership signals.

What carries the argument

Visual inputs serving as regularizers that obscure membership signals in out-of-distribution multimodal inference attacks.

Load-bearing premise

The model families and in-distribution versus out-of-distribution splits chosen represent the typical behavior of large multimodal models in general.

What would settle it

An experiment showing that visual inputs do not reduce membership inference success rates on out-of-distribution data for additional multimodal models would challenge the central observation.

Figures

Figures reproduced from arXiv: 2512.03121 by Feifei Sun, Le Minh Nguyen, Ziyi Tong.

Figure 1
Figure 1. Figure 1: Architecture schematic of the vision–language fusion pipeline in modern [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DeepSeek-VL MIA performance in OOD (left) and In-Distribution (right) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Large Multimodal Language Models (MLLMs) are emerging as one of the foundational tools in an expanding range of applications. Consequently, understanding training-data leakage in these systems is increasingly critical. Log-probability-based membership inference attacks (MIAs) have become a widely adopted approach for assessing data exposure in large language models (LLMs), yet their effect in MLLMs remains unclear. We present the first comprehensive evaluation of extending these text-based MIA methods to multimodal settings. Our experiments under vision-and-text (V+T) and text-only (T-only) conditions across the DeepSeek-VL and InternVL model families show that in in-distribution settings, logit-based MIAs perform comparably across configurations, with a slight V+T advantage. Conversely, in out-of-distribution settings, visual inputs act as regularizers, effectively masking membership signals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript evaluates the effectiveness of text-based membership inference attacks (MIAs) on large multimodal language models (MLLMs). It extends log-probability-based MIA methods to multimodal settings and compares vision-and-text (V+T) versus text-only (T-only) conditions across the DeepSeek-VL and InternVL model families. The central claims are that MIAs perform comparably in in-distribution settings (with a slight V+T advantage) while visual inputs act as regularizers that mask membership signals in out-of-distribution settings.

Significance. If the OOD regularization effect holds beyond the tested models and splits, the work would offer valuable empirical evidence on how multimodal conditioning can mitigate text-based data leakage in MLLMs. This has direct implications for privacy assessment and design of multimodal systems, extending existing MIA literature from LLMs to MLLMs in a practical, attack-oriented manner.

major comments (1)
  1. [Experiments section] Experiments section: The claim that visual inputs regularize and mask membership signals specifically in OOD regimes rests on results from only two model families (DeepSeek-VL and InternVL) and particular in/out-of-distribution partitions. Without cross-family validation or ablations examining the interaction between the visual encoder and text token probabilities, it remains unclear whether the observed drop in MIA performance is a general property of multimodal conditioning or tied to shared characteristics of these specific models and splits.
minor comments (1)
  1. [Abstract] Abstract: The summary states clear findings but omits specific metrics, statistical details, dataset descriptions, and error analysis, which would strengthen the reader's ability to evaluate the reported effects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their valuable feedback on our manuscript. We address the major comment below and indicate the changes we will make in the revised version.

read point-by-point responses
  1. Referee: The claim that visual inputs regularize and mask membership signals specifically in OOD regimes rests on results from only two model families (DeepSeek-VL and InternVL) and particular in/out-of-distribution partitions. Without cross-family validation or ablations examining the interaction between the visual encoder and text token probabilities, it remains unclear whether the observed drop in MIA performance is a general property of multimodal conditioning or tied to shared characteristics of these specific models and splits.

    Authors: We appreciate this point regarding the generalizability of our results. DeepSeek-VL and InternVL were selected as they represent two prominent and architecturally diverse open-source MLLM families, with different vision encoders and training data. The OOD masking effect was observed consistently in both, which we believe provides meaningful support for the claim. We agree that validation on additional models would be ideal to confirm it is a general property of multimodal conditioning. In the revised manuscript, we will add a paragraph in the discussion section acknowledging this limitation and outlining plans for future cross-family experiments. Regarding specific ablations on the visual encoder's interaction with text probabilities, our V+T vs. T-only setup directly measures the impact of adding visual inputs on the text logit distributions used for the MIA. More granular ablations, such as varying the visual encoder while keeping the LLM fixed, are beyond the scope of the current work due to the significant engineering and compute requirements, but we will note this as an important direction for future research. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation

full rationale

This paper is an empirical evaluation study that reports experimental results from running text-based membership inference attacks on two specific MLLM families (DeepSeek-VL and InternVL) under V+T and T-only conditions, in both in-distribution and out-of-distribution regimes. No mathematical derivations, equations, fitted parameters presented as predictions, or self-referential constructs appear in the provided abstract or described content. The central observation that visual inputs act as regularizers masking membership signals in OOD settings is presented as a direct experimental finding rather than a reduction to prior inputs or self-citations. Any self-citations that may exist do not load-bear the claims, which rest on the reported experimental observations on named models and splits. The analysis is therefore self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study applies existing logit-based MIA techniques to new model types without introducing new parameters, axioms beyond standard ML assumptions, or invented entities.

axioms (1)
  • domain assumption Log-probability-based membership inference methods developed for text-only LLMs remain valid when applied to multimodal models
    The evaluation extends prior LLM MIA methods to MLLMs under the assumption that the core attack logic transfers.

pith-pipeline@v0.9.0 · 5677 in / 1250 out tokens · 68351 ms · 2026-05-22T11:38:04.420958+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525, 2024

  2. [2]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

  3. [3]

    How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

  4. [4]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  5. [5]

    Pretraining data exposure in large language models: A survey of membership inference, data contamination, and security implica- tions

    Ziyi Tong, Feifei Sun, and Le Minh Nguyen. Pretraining data exposure in large language models: A survey of membership inference, data contamination, and security implica- tions. InInternational Conference on Applications of Natural Language to Information Systems, pages 152–162. Springer, 2025

  6. [6]

    Recall: Membership inference via relative conditional log-likelihoods.arXiv preprint arXiv:2406.15968, 2024

    Roy Xie, Junlin Wang, Ruomin Huang, Minxing Zhang, Rong Ge, Jian Pei, Neil Zhen- qiang Gong, and Bhuwan Dhingra. Recall: Membership inference via relative conditional log-likelihoods.arXiv preprint arXiv:2406.15968, 2024

  7. [7]

    Membership inference attacks against large vision-language models.Advances in Neural Information Processing Systems, 37:98645–98674, 2024

    Zhan Li, Yongtao Wu, Yihang Chen, Francesco Tonin, Elias Abad Rocamora, and Volkan Cevher. Membership inference attacks against large vision-language models.Advances in Neural Information Processing Systems, 37:98645–98674, 2024

  8. [8]

    Privacy risk in ma- chine learning: Analyzing the connection to overfitting

    Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in ma- chine learning: Analyzing the connection to overfitting. In2018 IEEE 31st computer security foundations symposium (CSF), pages 268–282. IEEE, 2018

  9. [9]

    Membership inference attacks from first principles

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In2022 IEEE symposium on security and privacy (SP), pages 1897–1914. IEEE, 2022

  10. [10]

    Ex- tracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Ex- tracting training data from large language models. In30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021

  11. [11]

    Detecting Pretraining Data from Large Language Models

    Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models.arXiv preprint arXiv:2310.16789, 2023

  12. [12]

    Min-k%++: Improved baseline for detecting pre-training data from large language models.arXiv preprint arXiv:2404.02936, 2024

    Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Frank Yang, and Hai Li. Min-k%++: Improved baseline for detecting pre-training data from large language models.arXiv preprint arXiv:2404.02936, 2024

  13. [13]

    Learn to explain: Multimodal reason- ing via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reason- ing via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

  14. [14]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016