Boundary-targeted Membership Inference Attacks on Safety Classifiers
Pith reviewed 2026-05-22 08:03 UTC · model grok-4.3
The pith
Targeting low-confidence examples lets adversaries recover 19 percent of distress conversations from safety classifier training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Identifying low-confidence boundary examples amplifies the membership signal in safety classifiers because these points reflect localized memorization rather than broad generalization. On a classifier fine-tuned to detect users who may need emotional support, the attack recovers 19 percent of the conversations flagged as indicating user distress at a 5 percent false-positive rate, which is 3.5 times higher than attacks that use only state-of-the-art membership inference methods.
What carries the argument
The boundary-targeted selection strategy, which isolates low-confidence examples to strengthen the membership inference signal.
If this is right
- Content-based filtering leaves the boundary examples unprotected against membership inference.
- Existing noise-injection defenses can lower the attack success rate on these low-confidence points.
- Standard membership inference methods alone recover far fewer training examples than the boundary-targeted variant.
- The attack works on classifiers fine-tuned specifically for emotional-support detection.
Where Pith is reading between the lines
- Similar boundary selection may increase attack success on other fine-tuned models that process sensitive user text.
- Model developers could add a pre-deployment check that measures how much low-confidence examples drive memorization.
- The same targeting approach might apply to generative models that produce responses conditioned on distress signals.
Load-bearing premise
Low-confidence examples indicate localized memorization of training data rather than a broad failure of the model to generalize.
What would settle it
Train an identical safety classifier on data that deliberately excludes the identified low-confidence boundary examples and measure whether the attack's true-positive rate at 5 percent false positives falls close to random guessing.
Figures
read the original abstract
Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal of an examples membership within a training set. Our experimental results show that an adversary can recover 19\% of the conversations a safety classifier flagged as indicating user distress, at a 5\% false-positive rate, on a classifier fine-tuned for detecting a user who may require emotional support. This is $3.5$ times more than attacking using state-of-the-art MIA methods alone. Finally, we characterize the boundary laying examples and show that content-based filtering is ineffective for protection, and existing noise strategies can effectively mitigate susceptibility of these examples.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a boundary-targeted membership inference attack on safety classifiers for generative AI systems. It hypothesizes that low-confidence examples reflect localized memorization of ambiguous training data (e.g., distress conversations) rather than generalization failure. The authors introduce a selection strategy to identify such boundary examples and demonstrate that this yields a 19% true positive rate at 5% false positive rate on a fine-tuned emotional support classifier, representing a 3.5x improvement over standard state-of-the-art MIA baselines. The work further characterizes these examples and evaluates simple mitigation approaches such as content filtering and noise addition.
Significance. If the empirical gains are shown to arise from a genuine membership signal rather than distribution effects, the result would be significant for privacy analysis of safety classifiers trained on sensitive mental-health data. The concrete 3.5x improvement and the focus on low-confidence regions provide a practical advance over generic MIA methods. The paper's empirical framing with specific operating-point metrics is a strength; however, the central claim rests on an untested premise about memorization versus generalization that requires explicit controls to be fully convincing.
major comments (2)
- [§4 and §5] §4 (Experimental Setup) and §5 (Results): The reported 19% TPR at 5% FPR and 3.5x improvement are presented without accompanying details on dataset construction, the sampling procedure for non-member hold-outs relative to the training distribution, model architecture, or statistical significance testing. This omission makes it impossible to evaluate whether the boundary-targeted gain is robust or an artifact of distribution mismatch between members and non-members.
- [§3.1] §3.1 (Hypothesis and Method): The boundary-targeted strategy is motivated by the claim that low confidence signals localized memorization of ambiguous distress examples. No ablation or control experiment is described that isolates this from the alternative that low confidence arises from inherent ambiguity or out-of-distribution inputs independent of membership. Without such a control, the performance improvement cannot be confidently attributed to the proposed mechanism.
minor comments (2)
- The abstract and introduction would benefit from a brief statement of the exact datasets and classifier architecture used, even at high level, to allow readers to assess generalizability.
- Figure captions and axis labels should explicitly state the operating point (5% FPR) and the baseline methods being compared to avoid ambiguity in interpreting the 3.5x claim.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and have revised the manuscript to improve experimental transparency and strengthen the link between our hypothesis and the observed results.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The reported 19% TPR at 5% FPR and 3.5x improvement are presented without accompanying details on dataset construction, the sampling procedure for non-member hold-outs relative to the training distribution, model architecture, or statistical significance testing. This omission makes it impossible to evaluate whether the boundary-targeted gain is robust or an artifact of distribution mismatch between members and non-members.
Authors: We agree that the original submission lacked sufficient detail on these points. In the revised manuscript we have expanded §4 to describe the full dataset construction process, including the sources of the distress-labeled conversations and the procedure used to sample non-member hold-outs from the same underlying distribution as the training data. We now specify the model architecture (a fine-tuned transformer-based classifier) and include statistical significance testing via bootstrap resampling to confirm that the 3.5× improvement is robust. These additions directly address the concern about possible distribution mismatch. revision: yes
-
Referee: [§3.1] §3.1 (Hypothesis and Method): The boundary-targeted strategy is motivated by the claim that low confidence signals localized memorization of ambiguous distress examples. No ablation or control experiment is described that isolates this from the alternative that low confidence arises from inherent ambiguity or out-of-distribution inputs independent of membership. Without such a control, the performance improvement cannot be confidently attributed to the proposed mechanism.
Authors: We acknowledge that an explicit control isolating memorization from inherent ambiguity or OOD effects would strengthen the causal claim. In the revision we have added an ablation in §3.1 and §5 that compares membership-inference success on low-confidence boundary examples versus both high-confidence examples and randomly sampled examples drawn from the same distribution. We also include a characterization of the boundary examples showing that their ambiguity is tied to training-set-specific phrasing rather than generic distress language. While these additions provide supporting evidence for the memorization hypothesis, we note that a perfect isolation of memorization from all other sources of low confidence remains difficult without white-box access to the training dynamics. revision: partial
Circularity Check
No circularity: empirical MIA evaluation is self-contained
full rationale
The paper introduces a boundary-targeted selection strategy for membership inference on safety classifiers and reports empirical results (19% TPR at 5% FPR, 3.5x over SOTA baselines). The hypothesis that low-confidence examples reflect localized memorization is presented as motivation and tested via experiments rather than used to define or force any result by construction. No equations, parameter fits renamed as predictions, or load-bearing self-citations appear in the derivation chain; the performance claims rest on direct comparison against external baselines on held-out data. This is a standard empirical evaluation with no reduction of outputs to inputs by definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Brendan, Mironov Ilya, Talwar Kunal, Zhang Li
Abadi Martin, Chu Andy, Goodfellow Ian, McMahan H. Brendan, Mironov Ilya, Talwar Kunal, Zhang Li. Deep Learning with Differential Privacy // Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. Vienna Austria: ACM, X
work page 2016
-
[2]
Carlini Nicholas, Chien Steve, Nasr Milad, Song Shuang, Terzis Andreas, Tramer Florian
308–318. Carlini Nicholas, Chien Steve, Nasr Milad, Song Shuang, Terzis Andreas, Tramer Florian. Member- ship inference attacks from first principles // 2022 IEEE symposium on security and privacy (SP). 2022a. 1897–1914. Carlini Nicholas, Ippolito Daphne, Jagielski Matthew, Lee Katherine, Tramer Florian, Zhang Chiyuan. Quantifying memorization across neur...
work page 2022
-
[3]
Chang Hongyan, Shahin Shamsabadi Ali, Katevas Kleomenis, Haddadi Hamed, Shokri Reza
2633–2650. Chang Hongyan, Shahin Shamsabadi Ali, Katevas Kleomenis, Haddadi Hamed, Shokri Reza. Context- Aware Membership Inference Attacks against Pre-trained Large Language Models // Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, XI
work page 2025
-
[4]
Chaudhari Harsh, Abascal John, Oprea Alina, Jagielski Matthew, Tramer Florian, Ullman Jonathan
55005–55029. Chaudhari Harsh, Abascal John, Oprea Alina, Jagielski Matthew, Tramer Florian, Ullman Jonathan. SNAP: Efficient extraction of private properties with poisoning // 2023 IEEE Symposium on Security and Privacy (SP)
work page 2023
-
[5]
Cheng Myra, Lee Cinoo, Khadpe Pranav, Yu Sunny, Han Dyllan, Jurafsky Dan
22854–22874. Cheng Myra, Lee Cinoo, Khadpe Pranav, Yu Sunny, Han Dyllan, Jurafsky Dan. Sycophantic AI decreases prosocial intentions and promotes dependence // arXiv preprint arXiv:2510.01395
-
[6]
(Proceedings of Machine Learning Research)
1964–1974. (Proceedings of Machine Learning Research). 11 Cohan Arman, Desmet Bart, Yates Andrew, Soldaini Luca, MacAvaney Sean, Goharian Nazli. SMHD: a large-scale resource for exploring online language usage for multiple mental health conditions // Proceedings of the 27th international conference on computational linguistics
work page 1964
-
[7]
Cunningham Hoagy, Wei Jerry, Wang Zihan, Persic Andrew, Peng Alwin, Abderrachid Jordan, Agarwal Raj, Chen Bobby, Cohen Austin, Dau Andy, others. Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks // arXiv preprint arXiv:2601.04603
-
[8]
143–158. Farinhas António, Guerreiro Nuno M, Pombal José, Martins Pedro Henrique, Melton Laura, Conway Alex, Dochat Cara, D’Eon Maya, Rei Ricardo. MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support // arXiv preprint arXiv:2602.00950
-
[9]
Fleisig Eve, Abebe Rediet, Klein Dan
954–959. Fleisig Eve, Abebe Rediet, Klein Dan. When the majority is wrong: Modeling annotator disagreement for subjective tasks // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
work page 2023
-
[10]
arXiv:2409.17190 [cs]. Hakim Joe B., Painter Jeffery L., Ramcharran Darmendra, Kara Vijay, Powell Greg, Sobczak Paulina, Sato Chiho, Bate Andrew, Beam Andrew. The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem. IX
-
[11]
arXiv:2407.18322 [cs]. Hallinan Skyler, Jung Jaehun, Sclar Melanie, Lu Ximing, Ravichander Abhilasha, Ramnath Sahana, Choi Yejin, Karimireddy Sai Praneeth, Mireshghallah Niloofar, Ren Xiang. The surprising effec- tiveness of membership inference with simple n-gram coverage // arXiv preprint arXiv:2508.09603
-
[12]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Inan Hakan, Upasani Kartikeya, Chi Jianfeng, Rungta Rashi, Iyer Krithika, Mao Yuning, Tontchev Michael, Hu Qing, Fuller Brian, Testuggine Davide, others. Llama guard: Llm-based input-output safeguard for human-ai conversations // arXiv preprint arXiv:2312.06674
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Kramár János, Engels Joshua, Wang Zheng, Chughtai Bilal, Shah Rohin, Nanda Neel, Conmy Arthur
10697–10707. Kramár János, Engels Joshua, Wang Zheng, Chughtai Bilal, Shah Rohin, Nanda Neel, Conmy Arthur. Building Production-Ready Probes For Gemini // arXiv preprint arXiv:2601.11516
-
[14]
Leonardelli Elisa, Menini Stefano, Aprosio Alessio Palmero, Guerini Marco, Tonelli Sara
83–94. Leonardelli Elisa, Menini Stefano, Aprosio Alessio Palmero, Guerini Marco, Tonelli Sara. Agreeing to disagree: Annotating offensive language datasets with annotators’ disagreement // Proceedings of the 2021 conference on empirical methods in natural language processing
work page 2021
-
[15]
Lermen Simon, Paleka Daniel, Swanson Joshua, Aerni Michael, Carlini Nicholas, Tramèr Florian
10528–10539. Lermen Simon, Paleka Daniel, Swanson Joshua, Aerni Michael, Carlini Nicholas, Tramèr Florian. Large-scale online deanonymization with LLMs // arXiv preprint arXiv:2602.16800
-
[16]
Li Tianshi. Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset // arXiv preprint arXiv:2601.05918
-
[17]
Lukas Nils, Salem Ahmed, Sim Robert, Tople Shruti, Wutschitz Lukas, Zanella-Béguelin Santiago
1–24. Lukas Nils, Salem Ahmed, Sim Robert, Tople Shruti, Wutschitz Lukas, Zanella-Béguelin Santiago. Analyzing Leakage of Personally Identifiable Information in Language Models // 2023 IEEE Symposium on Security and Privacy (SP). San Francisco, CA, USA: IEEE, V
work page 2023
-
[18]
Lv Lijia, Zhao Yuanshu, Wang Guan, Tang Xuehai, Jie Wen, Han Jizhong, Hu Songlin
346–363. Lv Lijia, Zhao Yuanshu, Wang Guan, Tang Xuehai, Jie Wen, Han Jizhong, Hu Songlin. Gamma-Guard: Lightweight Residual Adapters for Robust Guardrails in Large Language Models // Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
work page 2025
-
[19]
Mireshghallah Fatemehsadat, Goyal Kartik, Uniyal Archit, Berg-Kirkpatrick Taylor, Shokri Reza
61065–61105. Mireshghallah Fatemehsadat, Goyal Kartik, Uniyal Archit, Berg-Kirkpatrick Taylor, Shokri Reza. Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, XII
work page 2022
-
[20]
Naseem Usman, Shiwakoti Shuvam, Shah Siddhant Bikram, Thapa Surendrabikram, Zhang Qi. GameTox: A Comprehensive Dataset and Analysis for Enhanced Toxicity Detection in Online Gaming Communities // Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume ...
work page 2025
-
[21]
Perez Ethan, Huang Saffron, Song Francis, Cai Trevor, Ring Roman, Aslanides John, Glaese Amelia, McAleese Nat, Irving Geoffrey. Red teaming language models with language models // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
work page 2022
-
[22]
Reimers Nils, Gurevych Iryna. Sentence-bert: Sentence embeddings using siamese bert-networks // Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)
work page 2019
-
[23]
Rossi Lorenzo, Aerni Michael, Zhang Jie, Tramèr Florian
3982–3992. Rossi Lorenzo, Aerni Michael, Zhang Jie, Tramèr Florian. Membership Inference Attacks on Sequence Models // 2025 IEEE Security and Privacy Workshops (SPW)
work page 2025
-
[24]
98–110. Sharma Mrinank, Tong Meg, Mu Jesse, Wei Jerry, Kruthoff Jorrit, Goodfriend Scott, Ong Euan, Peng Alwin, Agarwal Raj, Anil Cem, others. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming // arXiv preprint arXiv:2501.18837
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Shejwalkar Virat, Inan Huseyin A., Houmansadr Amir, Sim Robert. Membership Inference Attacks Against NLP Classification Models // NeurIPS 2021 Workshop Privacy in Machine Learning
work page 2021
-
[26]
Shokri Reza, Stronati Marco, Song Congzheng, Shmatikov Vitaly. Membership Inference Attacks Against Machine Learning Models // 2017 IEEE Symposium on Security and Privacy (SP). San Jose, CA, USA: IEEE, V
work page 2017
-
[27]
Steenstra Ian, Pedrelli Paola, Shi Weiyan, Marsella Stacy, Bickmore Timothy W. Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming // arXiv preprint arXiv:2602.19948
-
[28]
Team Gemma, Kamath Aishwarya, Ferret Johan, Pathak Shreya, Vieillard Nino, Merhej Ramona, Perrin Sarah, Matejovicova Tatiana, Ramé Alexandre, Others. Gemma 3 Technical Report. 2025a. Team Olmo, Ettinger A, Bertsch A, Kuehl B, Graham D, Heineman D, Groeneveld D, Brahman F , Timbers F , Ivison H, others. Olmo 3 // arXiv preprint arXiv:2512.13961. 2025b. 23–...
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Wang Zezhong, Yang Fangkai, Wang Lu, Zhao Pu, Wang Hongru, Chen Liang, Lin Qingwei, Wong Kam-Fai
240–254. Wang Zezhong, Yang Fangkai, Wang Lu, Zhao Pu, Wang Hongru, Chen Liang, Lin Qingwei, Wong Kam-Fai. Self-guard: Empower the llm to safeguard itself // Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers)
work page 2024
-
[30]
Xie Roy, Wang Junlin, Huang Ruomin, Zhang Minxing, Ge Rong, Pei Jian, Gong Neil Zhenqiang, Dhingra Bhuwan. ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, XI
work page 2024
-
[31]
Yeom Samuel, Giacomelli Irene, Fredrikson Matt, Jha Somesh
8671–8689. Yeom Samuel, Giacomelli Irene, Fredrikson Matt, Jha Somesh. Privacy risk in machine learning: An- alyzing the connection to overfitting // 2018 IEEE 31st computer security foundations symposium (CSF)
work page 2018
-
[32]
40306–40320. Zeng Wenjun, Kurniawan Dana, Mullins Ryan, Liu Yuchi, Saha Tamoghna, Ike-Njoku Dirichi, Gu Jindong, Song Yiwen, Xu Cai, Zhou Jingjing, others. Shieldgemma 2: Robust and tractable image content moderation // arXiv preprint arXiv:2504.01081
-
[33]
ShieldGemma: Generative AI Content Moderation Based on Gemma
Zeng Wenjun, Liu Yuchi, Mullins Ryan, Peran Ludovic, Fernandez Joe, Harkous Hamza, Narasimhan Karthik, Proud Drew, Kumar Piyush, Radharapu Bhaktipriya, others. Shieldgemma: Generative ai content moderation based on gemma // arXiv preprint arXiv:2407.21772
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Zhang Ziqi, Shahin Shamsabadi Ali, Lu Hanxiao, Cai Yifeng, Haddadi Hamed. Membership and Memorization in LLM Knowledge Distillation // Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, XI
work page 2025
-
[35]
20074–20084. 15 Zhao Haiquan, Yuan Chenhan, Huang Fei, Hu Xiaomeng, Zhang Yichang, Yang An, Yu Bowen, Liu Dayiheng, Zhou Jingren, Lin Junyang, others. Qwen3guard technical report // arXiv preprint arXiv:2510.14276
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Hyperparameter Value Sequence length Single-turn (BeaverTails) 1024 Multi-turn (XGuard) 8192 Multi-session (Emotional Support) 16394 Pooled 16394 Table 5: Sequenc lengths for classifier fine-tuning. K Compute All experiments were conducted on a single compute node equipped with 4× NVIDIA H100 NVL GPUs (94 GB VRAM each), an AMD EPYC 9454 48-Core CPU, and 7...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.