pith. machine review for the scientific record. sign in

arxiv: 2603.14222 · v2 · submitted 2026-03-15 · 💻 cs.CR · cs.AI

Recognition: no theorem link

Membership Inference for Contrastive Pre-training Models with Text-only PII Queries

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:05 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords membership inferencecontrastive pre-trainingCLIPCLAPprivacy auditingtext-only queriesmultimodal memorizationpersonally identifiable information
0
0 comments X

The pith

Text-only queries can detect if contrastive models like CLIP memorized private data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that membership inference attacks on multimodal contrastive pre-training models do not require access to images, audio, or other paired biometric inputs. Instead, text queries alone guide a latent inversion process inside the model to produce two signals: how closely the inversion aligns with the text, and how consistent those alignments are across random restarts. These signals are compared against a simple baseline of random gibberish text to flag likely training-set members. This matters because it removes the need to feed sensitive data into the model during an audit, lowering both computational cost and privacy risk while still achieving strong detection across CLIP and CLAP variants.

Core claim

Multimodal memorization within these foundational encoders can be accurately inferred using exclusively the text modality. The Unimodal Membership Inference Detector performs text-guided cross-modal latent inversion, extracts complementary similarity and variability statistics, constructs a lightweight non-member reference from synthetic gibberish, and decides membership via an ensemble of unsupervised anomaly detectors.

What carries the argument

Unimodal Membership Inference Detector (UMID) that uses text-guided cross-modal latent inversion to extract similarity (alignment to queried text) and variability (consistency across randomized inversions) signals for comparison to a gibberish reference.

If this is right

  • Audits become feasible at sub-second cost per query without exposing biometric data.
  • Shadow-model training is avoided, removing the main computational barrier for large backbones.
  • The same framework applies to both vision-language and audio-language contrastive models.
  • Auditing complies with constraints that prohibit feeding private inputs to the target model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • One modality can leak detectable traces of cross-modal training-set membership.
  • Third parties could run audits on deployed models without ever handling the original private data.
  • The inversion technique might generalize to other cross-modal architectures that share a joint latent space.

Load-bearing premise

Inverting text into the cross-modal latent space produces signals that reliably distinguish training members from non-members even without any paired biometric input.

What would settle it

If similarity and variability statistics from the inversion process are statistically identical for known training-set texts and known non-member texts, the detection decisions would collapse to chance level.

Figures

Figures reproduced from arXiv: 2603.14222 by Haoxuan Ma, Hongyi Zhang, Jian Zhao, Ruoxi Cheng, Tianle Zhang, Xuelong Li, Yiyan Huang, Yizhong Ding.

Figure 1
Figure 1. Figure 1: Overview of the UMID auditing framework and the resulting distributional gap. The UMID method enables text-only membership inference (a) and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of UMID. We employ an optimizer guided by target model to align non-text embeddings with PII text embeddings, maximizing their cosine similarity. By analyzing similarity and variability features of these optimized samples relative to the synthetic gibberish baseline, an anomaly detection system identifies abnormal patterns to infer the membership of the input text. higher similarity, while a lower… view at source ↗
Figure 3
Figure 3. Figure 3: Detection accuracy for CLIP model (ResNet-50) under various parameters. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detection accuracy for CLAP model (LibriSpeech) under various parameters. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Empirical validation of geometric separation. (a) Convergence of similarity ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Contrastive pretraining models such as CLIP and CLAP, serve as the ubiquitous perceptual backbones for modern multimodal large models, yet their reliance on web-scale data raises growing concerns about memorizing Personally Identifiable Information (PII). Auditing such models via membership inference is challenging in practice: shadow-model MIAs are computationally prohibitive for large multimodal backbones, and existing multimodal auditing methods typically require querying the target with paired biometric inputs, thereby directly exposing sensitive biometric information to the target model. To bypass this critical limitation, we demonstrate a highly desirable capability for privacy auditing: multimodal memorization within these foundational encoders can be accurately inferred using exclusively the text modality. We propose Unimodal Membership Inference Detector (UMID), a text-only auditing framework that performs text-guided cross-modal latent inversion and extracts two complementary signals, similarity (alignment to the queried text) and variability (consistency across randomized inversions). UMID compares these statistics to a lightweight non-member reference constructed from synthetic gibberish and makes decisions via an ensemble of unsupervised anomaly detectors. Comprehensive experiments across diverse CLIP and CLAP architectures demonstrate that UMID significantly improves the effectiveness and efficiency over prior MIAs, delivering strong detection performance with sub-second auditing cost using solely text queries, completely circumventing the need for biometric inputs and complying with strict privacy constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes UMID, a text-only membership inference framework for contrastive pre-training models such as CLIP and CLAP. It performs text-guided cross-modal latent inversion on PII text queries to extract similarity (alignment) and variability (consistency across randomizations) statistics, then flags anomalies relative to a lightweight synthetic-gibberish non-member reference via an ensemble of unsupervised anomaly detectors. The central claim is that this yields strong detection of multimodal memorization using exclusively text queries, with sub-second cost and without exposing biometric inputs.

Significance. If the core claim holds under proper controls, the work offers a practical, privacy-compliant auditing tool for web-scale multimodal encoders that avoids the computational cost of shadow models and the exposure risks of paired biometric queries. This could meaningfully advance membership-inference methodology in the multimodal setting.

major comments (2)
  1. [Method (text-guided inversion and reference construction)] The decision procedure depends on the synthetic-gibberish reference producing a distribution that is reliably separable from both members and real non-member natural text. No evidence is provided that real unseen natural-language descriptions (e.g., public-figure captions never seen in pre-training) yield similarity/variability statistics closer to the gibberish reference than to training-set members; if they do not, the anomaly detector will systematically misclassify non-members.
  2. [Abstract and Experiments] Abstract and experimental claims of 'strong detection performance' and 'significant improvement over prior MIAs' are stated without any reported AUC, TPR@FPR, or baseline numbers, nor any ablation on the anomaly-ensemble hyperparameters. Because the central claim is empirical, the absence of these quantitative results in the provided abstract leaves the effectiveness unverified.
minor comments (2)
  1. [Method] Notation for the two extracted signals (similarity and variability) should be defined with explicit equations rather than descriptive phrases to allow precise reproduction.
  2. [Decision procedure] The paper should clarify whether the unsupervised anomaly detectors are applied per-query or across a batch, and report any sensitivity to the choice of detector family.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and describe the planned revisions.

read point-by-point responses
  1. Referee: [Method (text-guided inversion and reference construction)] The decision procedure depends on the synthetic-gibberish reference producing a distribution that is reliably separable from both members and real non-member natural text. No evidence is provided that real unseen natural-language descriptions (e.g., public-figure captions never seen in pre-training) yield similarity/variability statistics closer to the gibberish reference than to training-set members; if they do not, the anomaly detector will systematically misclassify non-members.

    Authors: We agree this is an important validation point for the reference construction. Our current experiments demonstrate effective separation using the gibberish reference against the evaluated non-member sets, but we acknowledge the value of explicitly comparing real unseen natural-language texts (e.g., public captions). In the revision we will add a new analysis subsection with quantitative comparisons of similarity and variability statistics for such real non-member texts versus both members and the gibberish reference, confirming the anomaly detector's behavior. revision: yes

  2. Referee: [Abstract and Experiments] Abstract and experimental claims of 'strong detection performance' and 'significant improvement over prior MIAs' are stated without any reported AUC, TPR@FPR, or baseline numbers, nor any ablation on the anomaly-ensemble hyperparameters. Because the central claim is empirical, the absence of these quantitative results in the provided abstract leaves the effectiveness unverified.

    Authors: We agree that the abstract should contain the key quantitative metrics to support the empirical claims. The full manuscript already includes AUC, TPR@FPR, baseline comparisons, and hyperparameter ablations in the experiments section. We will revise the abstract to report these specific results and ensure the ablation study is more prominently referenced. revision: yes

Circularity Check

0 steps flagged

No circularity: unsupervised anomaly detection on extracted statistics is independent of labeled membership data

full rationale

The paper's core procedure (text-guided cross-modal inversion to obtain similarity/variability statistics, followed by comparison to a fixed synthetic-gibberish reference via unsupervised anomaly detectors) does not fit any parameters to member/non-member labels and then rename those fits as predictions. No self-citation chain is invoked to justify uniqueness or to smuggle in an ansatz. The method is fully specified by the described extraction and detection steps without reducing to its own inputs by construction. The distributional concern raised by the skeptic (gibberish vs. real non-member text) is a question of empirical validity, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method implicitly assumes the existence of a stable cross-modal latent space that preserves membership information.

pith-pipeline@v0.9.0 · 5553 in / 1046 out tokens · 42942 ms · 2026-05-15T12:05:40.616680+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 2 internal anchors

  1. [1]

    Multimodal contrastive training for visual representation learning,

    X. Yuan, Z. Lin, J. Kuen, J. Zhang, Y . Wang, M. Maire, A. Kale, and B. Faieta, “Multimodal contrastive training for visual representation learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6995–7004

  2. [2]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

  3. [3]

    Clap learning audio concepts from natural language supervision,

    B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” inICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  4. [4]

    Ecoalign: An economi- cally rational framework for efficient lvlm alignment,

    R. Cheng, H. Ma, T. Ma, and H. Zhang, “Ecoalign: An economi- cally rational framework for efficient lvlm alignment,”arXiv preprint arXiv:2511.11301, 2025

  5. [5]

    The pii problem: Privacy and a new concept of personally identifiable information,

    P. M. Schwartz and D. J. Solove, “The pii problem: Privacy and a new concept of personally identifiable information,”NYUL rev., vol. 86, p. 1814, 2011

  6. [6]

    When better features mean greater risks: The performance- privacy trade-off in contrastive learning,

    R. Sun, H. Hu, W. Luo, Z. Zhang, Y . Zhang, H. Yuan, and L. Y . Zhang, “When better features mean greater risks: The performance- privacy trade-off in contrastive learning,” inProceedings of the 20th ACM Asia Conference on Computer and Communications Security, 2025, pp. 488–500

  7. [7]

    Defending pre-trained language models as few-shot learners against backdoor attacks,

    Z. Xi, T. Du, C. Li, R. Pang, S. Ji, J. Chen, F. Ma, and T. Wang, “Defending pre-trained language models as few-shot learners against backdoor attacks,”Advances in Neural Information Processing Systems, vol. 36, 2024

  8. [8]

    Defenses to membership inference attacks: A survey,

    L. Hu, A. Yan, H. Yan, J. Li, T. Huang, Y . Zhang, C. Dong, and C. Yang, “Defenses to membership inference attacks: A survey,”ACM Computing Surveys, vol. 56, no. 4, pp. 1–34, 2023

  9. [9]

    Selfprompt: Autonomously evaluating llm robustness via domain-constrained knowledge guidelines and refined adversarial prompts,

    A. Pei, Z. Yang, S. Zhu, R. Cheng, and J. Jia, “Selfprompt: Autonomously evaluating llm robustness via domain-constrained knowledge guidelines and refined adversarial prompts,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 6840–6854

  10. [10]

    Strata-sword: A hierarchical safety evaluation towards llms based on reasoning complexity of jailbreak instructions,

    S. Zhao, R. Duan, J. Liu, X. Jia, F. Wang, C. Wei, R. Cheng, Y . Xie, C. Liu, Q. Guoet al., “Strata-sword: A hierarchical safety evaluation towards llms based on reasoning complexity of jailbreak instructions,” arXiv preprint arXiv:2509.01444, 2025

  11. [11]

    Privacy-enhanced federated learning against attribute inference attack for speech emotion recognition,

    H. Zhao, H. Chen, Y . Xiao, and Z. Zhang, “Privacy-enhanced federated learning against attribute inference attack for speech emotion recognition,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

  12. [12]

    Tuni: A textual unimodal detector for identity inference in clip models,

    S. Li, R. Cheng, and X. Jia, “Tuni: A textual unimodal detector for identity inference in clip models,” inProceedings of the Sixth Workshop on Privacy in Natural Language Processing, 2025, pp. 1–13

  13. [13]

    Membership inference attacks against machine learning models,

    R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” in2017 IEEE symposium on security and privacy (SP). IEEE, 2017, pp. 3–18

  14. [14]

    Students parrot their teachers: Membership inference on model distillation,

    M. Jagielski, M. Nasr, K. Lee, C. A. Choquette-Choo, N. Carlini, and F. Tramer, “Students parrot their teachers: Membership inference on model distillation,”Advances in Neural Information Processing Systems, vol. 36, 2024

  15. [15]

    Practical membership inference attacks against large-scale multi-modal models: A pilot study,

    M. Ko, M. Jin, C. Wanget al., “Practical membership inference attacks against large-scale multi-modal models: A pilot study,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4871–4881

  16. [16]

    Membership inference attack using self influence functions,

    G. Cohen and R. Giryes, “Membership inference attack using self influence functions,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 4892–4901

  17. [17]

    Membership inference attacks with token-level deduplication on korean language models,

    M. G. Oh, L. H. Park, J. Kim, J. Park, and T. Kwon, “Membership inference attacks with token-level deduplication on korean language models,”IEEE Access, vol. 11, pp. 10 207–10 217, 2023

  18. [18]

    M4i: Multi-modal models membership inference,

    P. Hu, Z. Wang, R. Sun, H. Wang, and M. Xue, “M4i: Multi-modal models membership inference,”Advances in Neural Information Processing Systems, vol. 35, pp. 1867–1882, 2022

  19. [19]

    Multimodal unlearnable examples: Protecting data against multimodal contrastive learning,

    X. Liu, X. Jia, Y . Xun, S. Liang, and X. Cao, “Multimodal unlearnable examples: Protecting data against multimodal contrastive learning,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 8024–8033

  20. [20]

    A closer look at the explainability of contrastive language-image pre-training,

    Y . Li, H. Wang, Y . Duan, J. Zhang, and X. Li, “A closer look at the explainability of contrastive language-image pre-training,”Pattern Recognition, vol. 162, p. 111409, 2025

  21. [21]

    Supervised contrastive learning,

    P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020

  22. [22]

    Collap: Contrastive long-form language-audio pretraining with musical temporal structure augmentation,

    J. Wu, W. Li, Z. Novack, A. Namburi, C. Chen, and J. McAuley, “Collap: Contrastive long-form language-audio pretraining with musical temporal structure augmentation,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  23. [23]

    Construction safety inspection with contrastive language- image pre-training (clip) image captioning and attention,

    W.-L. Tsai, P.-L. Le, W.-F. Ho, N.-W. Chi, J. J. Lin, S. Tang, and S.-H. Hsieh, “Construction safety inspection with contrastive language- image pre-training (clip) image captioning and attention,”Automation in Construction, vol. 169, p. 105863, 2025

  24. [24]

    Supervised contrastive pre-training models for mammography screening,

    Z. Cao, Z. Deng, Z. Yang, J. Ma, and L. Ma, “Supervised contrastive pre-training models for mammography screening,”Journal of Big Data, vol. 12, no. 1, p. 24, 2025

  25. [25]

    Con- trastive pretraining improves deep learning classification of endocardial electrograms in a preclinical model,

    B. Hunt, E. Kwan, J. Bergquist, J. Brundage, B. Orkild, J. Dong, E. Paccione, K. Yazaki, R. S. MacLeod, D. J. Dosdallet al., “Con- trastive pretraining improves deep learning classification of endocardial electrograms in a preclinical model,”Heart Rhythm O2, vol. 6, no. 4, pp. 473–480, 2025

  26. [26]

    Audiotime: A temporally-aligned audio-text benchmark dataset,

    Z. Xie, X. Xu, Z. Wu, and M. Wu, “Audiotime: A temporally-aligned audio-text benchmark dataset,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  27. [27]

    Mixed differential privacy in computer vision,

    A. Golatkar, A. Achille, Y .-X. Wang, A. Roth, M. Kearns, and S. Soatto, “Mixed differential privacy in computer vision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8376–8386

  28. [28]

    Advancing object detection in transportation with multimodal large language models (mllms): A comprehensive review and empirical testing,

    H. I. Ashqar, A. Jaber, T. I. Alhadidi, and M. Elhenawy, “Advancing object detection in transportation with multimodal large language models (mllms): A comprehensive review and empirical testing,”Computation, vol. 13, no. 6, p. 133, 2025

  29. [29]

    Pbi-attack: Prior-guided bimodal interactive black-box jailbreak attack for toxicity maximization,

    R. Cheng, Y . Ding, S. Cao, R. Duan, X. Jia, S. Yuan, S. Qin, Z. Wang, and X. Jia, “Pbi-attack: Prior-guided bimodal interactive black-box jailbreak attack for toxicity maximization,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 609– 628. PREPRINT, UNDER REVIEW. 11

  30. [30]

    Steering the Verifiability of Multimodal AI Hallucinations

    J. Pang, R. Cheng, Z. Ye, X. Ma, Z. Wu, X. Huang, and Y .-G. Jiang, “Steering the verifiability of multimodal ai hallucinations,”arXiv preprint arXiv:2604.06714, 2026

  31. [31]

    Pixclip: Achieving fine-grained visual language understand- ing via any-granularity pixel-text alignment learning,

    Y . Xiao, Y . Chen, H. Ma, J. Hong, C. Li, L. Wu, H. Guo, and J. Wang, “Pixclip: Achieving fine-grained visual language understand- ing via any-granularity pixel-text alignment learning,”arXiv preprint arXiv:2511.04601, 2025

  32. [32]

    Protecting privacy in multimodal large language models with mllmu- bench,

    Z. Liu, G. Dou, M. Jia, Z. Tan, Q. Zeng, Y . Yuan, and M. Jiang, “Protecting privacy in multimodal large language models with mllmu- bench,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 4105–4135

  33. [33]

    Privacy- preserving personalized federated prompt learning for multimodal large language models,

    L. Tran, W. Sun, S. Patterson, and A. Milanova, “Privacy- preserving personalized federated prompt learning for multimodal large language models,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https: //openreview.net/forum?id=Equ277PBN0

  34. [34]

    Propile: Probing privacy leakage in large language models,

    S. Kim, S. Yun, H. Leeet al., “Propile: Probing privacy leakage in large language models,” inAdvances in Neural Information Processing Systems, vol. 36, 2024

  35. [35]

    I never willingly con- sented to this! investigate pii leakage via sso logins,

    T.-H. Pham, Q.-H. V o, H. Dao, and K. Fukuda, “I never willingly con- sented to this! investigate pii leakage via sso logins,”IEEE Transactions on Privacy, 2025

  36. [36]

    Membership inference attacks as privacy tools: Reliability, disparity and ensemble,

    Z. Wang, C. Zhang, Y . Chen, N. Baracaldo, S. R. Kadhe, and L. Yu, “Membership inference attacks as privacy tools: Reliability, disparity and ensemble,” inProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, 2025, pp. 1724–1738

  37. [37]

    Reinforcement learning from multi-role debates as feedback for bias mitigation in llms,

    R. Cheng, H. Ma, S. Cao, J. Li, A. Pei, Z. Wang, P. Ji, H. Wang, and J. Huo, “Reinforcement learning from multi-role debates as feedback for bias mitigation in llms,”arXiv preprint arXiv:2404.10160, 2024

  38. [38]

    Oyster-i: Beyond refusal–constructive safety alignment for responsible language models,

    R. Duan, J. Liu, X. Jia, S. Zhao, R. Cheng, F. Wang, C. Wei, Y . Xie, C. Liu, D. Liet al., “Oyster-i: Beyond refusal–constructive safety alignment for responsible language models,”arXiv preprint arXiv:2509.01909, 2025

  39. [39]

    Agr: Age group fairness reward for bias mitigation in llms,

    S. Cao, R. Cheng, and Z. Wang, “Agr: Age group fairness reward for bias mitigation in llms,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  40. [40]

    Inverse reinforcement learning with dynamic reward scaling for llm alignment,

    R. Cheng, H. Ma, W. Wang, R. Duan, J. Liu, X. Jia, S. Qin, X. Cao, Y . Liu, and X. Jia, “Inverse reinforcement learning with dynamic reward scaling for llm alignment,”arXiv preprint arXiv:2503.18991, 2025

  41. [41]

    Use the spear as a shield: An adversarial example based privacy-preserving technique against membership inference attacks,

    M. Xue, C. Yuan, C. He, Y . Wu, Z. Wu, Y . Zhang, Z. Liu, and W. Liu, “Use the spear as a shield: An adversarial example based privacy-preserving technique against membership inference attacks,” IEEE Transactions on Emerging Topics in Computing, vol. 11, no. 1, pp. 153–169, 2023

  42. [42]

    Does clip know my face?

    D. Hintersdorf, L. Struppek, M. Brack, F. Friedrich, P. Schramowski, and K. Kersting, “Does clip know my face?”Journal of Artificial Intelligence Research, vol. 80, pp. 1033–1062, 2024

  43. [43]

    Variance-based membership inference attacks against large-scale image captioning models,

    D. Samira, E. Habler, Y . Elovici, and A. Shabtai, “Variance-based membership inference attacks against large-scale image captioning models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9210–9219

  44. [44]

    Range membership inference attacks,

    J. Tao and R. Shokri, “Range membership inference attacks,” in2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2025, pp. 346–361

  45. [45]

    Deepface: Closing the gap to human-level performance in face verification,

    Y . Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708

  46. [46]

    The megaface benchmark: 1 million faces for recognition at scale,

    I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard, “The megaface benchmark: 1 million faces for recognition at scale,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4873–4882

  47. [47]

    Laion- 5b: An open large-scale dataset for training next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “Laion- 5b: An open large-scale dataset for training next generation image-text models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022

  48. [48]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,

    S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3558–3568

  49. [49]

    Librispeech: An asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

  50. [51]

    Common V oice: A massively- multilingual speech corpus

    [Online]. Available: http://arxiv.org/abs/1912.06670

  51. [52]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  52. [53]

    Detecting affect states using vgg16, resnet50 and se-resnet50 networks,

    D. Theckedath and R. Sedamkar, “Detecting affect states using vgg16, resnet50 and se-resnet50 networks,”SN Computer Science, vol. 1, no. 2, p. 79, 2020

  53. [54]

    Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,

    K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650

  54. [55]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

  55. [56]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: a robustly optimized bert pretraining approach. corr 2019,”arXiv preprint arXiv:1907.11692, 1907

  56. [57]

    Lightface: A hybrid deep face recognition framework,

    S. I. Serengil and A. Ozpinar, “Lightface: A hybrid deep face recognition framework,” in2020 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 2020, pp. 23–27. [Online]. Available: https://doi.org/10.1109/ASYU50717.2020.9259802

  57. [58]

    Membership inference attacks from first principles,

    N. Carlini, S. Chien, M. Nasr, S. Song, A. Terzis, and F. Tramer, “Membership inference attacks from first principles,” in2022 IEEE symposium on security and privacy (SP). IEEE, 2022, pp. 1897–1914

  58. [59]

    The audio auditor: User-level membership inference in internet of things voice services,

    Y . Miao, M. Xue, C. Chen, L. Pan, J. Zhang, B. Z. H. Zhao, D. Kaafar, and Y . Xiang, “The audio auditor: User-level membership inference in internet of things voice services,”Proceedings on Privacy Enhancing Technologies, vol. 1, pp. 209–228, 2021

  59. [60]

    Exploring features for membership inference in asr model auditing,

    F. Teixeira, K. Pizzi, R. Olivier, A. Abad, B. Raj, and I. Trancoso, “Exploring features for membership inference in asr model auditing,” Computer Speech & Language, p. 101812, 2025

  60. [61]

    Slmia-sr: Speaker-level membership inference attacks against speaker recognition systems,

    G. Chen, Y . Zhang, and F. Song, “Slmia-sr: Speaker-level membership inference attacks against speaker recognition systems,” inProceedings of the 31st Annual Network and Distributed System Security (NDSS) Symposium, 2024

  61. [62]

    Outlier detection using isolation forest and local outlier factor,

    Z. Cheng, C. Zou, and J. Dong, “Outlier detection using isolation forest and local outlier factor,” inProceedings of the conference on research in adaptive and convergent systems, 2019, pp. 161–168

  62. [63]

    Isolation forest,

    F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in2008 eighth ieee international conference on data mining. IEEE, 2008, pp. 413–422

  63. [64]

    Improving one-class svm for anomaly detection,

    K.-L. Li, H.-K. Huang, S.-F. Tian, and W. Xu, “Improving one-class svm for anomaly detection,” inProceedings of the 2003 international confer- ence on machine learning and cybernetics (IEEE Cat. No. 03EX693), vol. 5. IEEE, 2003, pp. 3077–3081

  64. [65]

    One-class classification: taxonomy of study and review of techniques,

    S. S. Khan and M. G. Madden, “One-class classification: taxonomy of study and review of techniques,”The Knowledge Engineering Review, vol. 29, no. 3, pp. 345–374, 2014

  65. [66]

    Autoencoder-based network anomaly detection,

    Z. Chen, C. K. Yeo, B. S. Lee, and C. T. Lau, “Autoencoder-based network anomaly detection,” in2018 Wireless telecommunications symposium (WTS). IEEE, 2018, pp. 1–5. PREPRINT, UNDER REVIEW. 12 APPENDIX Table V and Table VI present examples of randomly generated gibberish and covert gibberish that mimics authentic names, respectively. TABLE V SAMPLES OF RA...

  66. [67]

    Member: S∞(tin)≥γ in −2δ ⋆ and D2 ∞(tin)≤2δ ⋆ + 3ρdδ⋆ ≈0

  67. [68]

    Non-member: |S∞(tout)| ≤O(d −1/2)≈0 and D2 ∞(tout)≥1− 1 M −ρ d ≈1. Proof. For Member: S∞ = P pyv⊤ inµy =p y⋆ v⊤ inµy⋆ +P y̸=y ⋆ pyv⊤ inµy. Using py⋆ ≥1−δ ⋆, v⊤ inµy⋆ ≥γ in, and trivial bounds |v⊤µ| ≤1 , we get S∞ ≥(1−δ ⋆)γin −δ ⋆ ≈γ in. For dispersion, D2 ∞ = 1− ∥p y⋆ µy⋆ +P y̸=y ⋆ pyµy∥2

  68. [69]

    Dominant term is 1−p 2 y⋆ ≈ 1−(1−δ ⋆)2 ≈2δ ⋆

    The cross-terms are bounded by ρd. Dominant term is 1−p 2 y⋆ ≈ 1−(1−δ ⋆)2 ≈2δ ⋆. For Non-member: S∞ =v ⊤ outm(tout). Since vout is isotropic and independent of m(tout), S∞ concentrates around 0 with rate d−1/2 (Assumption A.2). For dispersion, ∥m(tout)∥2 2 =∥ P pyµy∥2 2 =P p2 y +P y̸=z pypzµ⊤ y µz. With py ≈1/M , P p2 y ≈1/M . Cross terms are bounded by ρ...

  69. [70]

    (2) Concentration of empirical statistics.We require the empirical statistics to concentrate around their population means within a radius smaller than Γ/2

    The decision thresholds are defined as the midpoints: sthr = 1 2(S∞(tin) +S ∞(tout))andd 2 thr = 1 2(D2 ∞(tin) +D 2 ∞(tout)). (2) Concentration of empirical statistics.We require the empirical statistics to concentrate around their population means within a radius smaller than Γ/2. First, consider the optimization localization. Let Eopt(t) be the event th...