Boundary-targeted Membership Inference Attacks on Safety Classifiers

Adam Perer; Alexander Goldberg; Anthony Hughes; Nikolaos Aletras; Niloofar Mireshghallah; Prince Jha

arxiv: 2605.22373 · v1 · pith:7G47O2GGnew · submitted 2026-05-21 · 💻 cs.LG · cs.CL

Boundary-targeted Membership Inference Attacks on Safety Classifiers

Anthony Hughes , Alexander Goldberg , Prince Jha , Adam Perer , Nikolaos Aletras , Niloofar Mireshghallah This is my paper

Pith reviewed 2026-05-22 08:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords membership inferencesafety classifiersprivacy attacksuser distress detectionmachine learning securityadversarial robustness

0 comments

The pith

Targeting low-confidence examples lets adversaries recover 19 percent of distress conversations from safety classifier training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that safety classifiers leak training membership through examples where the model has the least confidence. It introduces a boundary-targeted selection method to pick those examples and feed them into membership inference attacks. A sympathetic reader would care because these classifiers are trained on private conversations about self-harm and mental health, so better attacks expose real privacy risks in AI safety tools. Experiments demonstrate the new strategy recovers 3.5 times more examples than prior methods at a 5 percent false-positive rate.

Core claim

Identifying low-confidence boundary examples amplifies the membership signal in safety classifiers because these points reflect localized memorization rather than broad generalization. On a classifier fine-tuned to detect users who may need emotional support, the attack recovers 19 percent of the conversations flagged as indicating user distress at a 5 percent false-positive rate, which is 3.5 times higher than attacks that use only state-of-the-art membership inference methods.

What carries the argument

The boundary-targeted selection strategy, which isolates low-confidence examples to strengthen the membership inference signal.

If this is right

Content-based filtering leaves the boundary examples unprotected against membership inference.
Existing noise-injection defenses can lower the attack success rate on these low-confidence points.
Standard membership inference methods alone recover far fewer training examples than the boundary-targeted variant.
The attack works on classifiers fine-tuned specifically for emotional-support detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar boundary selection may increase attack success on other fine-tuned models that process sensitive user text.
Model developers could add a pre-deployment check that measures how much low-confidence examples drive memorization.
The same targeting approach might apply to generative models that produce responses conditioned on distress signals.

Load-bearing premise

Low-confidence examples indicate localized memorization of training data rather than a broad failure of the model to generalize.

What would settle it

Train an identical safety classifier on data that deliberately excludes the identified low-confidence boundary examples and measure whether the attack's true-positive rate at 5 percent false positives falls close to random guessing.

Figures

Figures reproduced from arXiv: 2605.22373 by Adam Perer, Alexander Goldberg, Anthony Hughes, Nikolaos Aletras, Niloofar Mireshghallah, Prince Jha.

**Figure 2.** Figure 2: MIA performance on LiRA and boundary-targeted LiRA across model scales and training [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: (Left) Boundary-targeted LiRA MI-AUC as a function of the classifier’s true-label confidence PS(yi | xi), binned into deciles. Each line corresponds to a model. Error bars span the min and max across the two classifiers. (Right) Boundary-targeted LiRA MI-AUC across the harm categories assigned to BeaverTails. Bars show the mean MI-AUC averaged over all training regimes for each model. (Both) A dashed grey … view at source ↗

**Figure 4.** Figure 4: (Left) t-SNE projection of the fine-tuned classifier’s hidden-state representations (Llama3.2-1-8B-IT under full fine-tuning on single-turn data). Red triangles denote boundary members (training set), blue circles denote boundary non-members, and grey points denote randomly sampled non-boundary examples. (Right) Privacy–utility trade-off under Laplace output perturbation. Each curve traces a single model … view at source ↗

read the original abstract

Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal of an examples membership within a training set. Our experimental results show that an adversary can recover 19\% of the conversations a safety classifier flagged as indicating user distress, at a 5\% false-positive rate, on a classifier fine-tuned for detecting a user who may require emotional support. This is $3.5$ times more than attacking using state-of-the-art MIA methods alone. Finally, we characterize the boundary laying examples and show that content-based filtering is ineffective for protection, and existing noise strategies can effectively mitigate susceptibility of these examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a boundary-targeted membership inference attack on safety classifiers for generative AI systems. It hypothesizes that low-confidence examples reflect localized memorization of ambiguous training data (e.g., distress conversations) rather than generalization failure. The authors introduce a selection strategy to identify such boundary examples and demonstrate that this yields a 19% true positive rate at 5% false positive rate on a fine-tuned emotional support classifier, representing a 3.5x improvement over standard state-of-the-art MIA baselines. The work further characterizes these examples and evaluates simple mitigation approaches such as content filtering and noise addition.

Significance. If the empirical gains are shown to arise from a genuine membership signal rather than distribution effects, the result would be significant for privacy analysis of safety classifiers trained on sensitive mental-health data. The concrete 3.5x improvement and the focus on low-confidence regions provide a practical advance over generic MIA methods. The paper's empirical framing with specific operating-point metrics is a strength; however, the central claim rests on an untested premise about memorization versus generalization that requires explicit controls to be fully convincing.

major comments (2)

[§4 and §5] §4 (Experimental Setup) and §5 (Results): The reported 19% TPR at 5% FPR and 3.5x improvement are presented without accompanying details on dataset construction, the sampling procedure for non-member hold-outs relative to the training distribution, model architecture, or statistical significance testing. This omission makes it impossible to evaluate whether the boundary-targeted gain is robust or an artifact of distribution mismatch between members and non-members.
[§3.1] §3.1 (Hypothesis and Method): The boundary-targeted strategy is motivated by the claim that low confidence signals localized memorization of ambiguous distress examples. No ablation or control experiment is described that isolates this from the alternative that low confidence arises from inherent ambiguity or out-of-distribution inputs independent of membership. Without such a control, the performance improvement cannot be confidently attributed to the proposed mechanism.

minor comments (2)

The abstract and introduction would benefit from a brief statement of the exact datasets and classifier architecture used, even at high level, to allow readers to assess generalizability.
Figure captions and axis labels should explicitly state the operating point (5% FPR) and the baseline methods being compared to avoid ambiguity in interpreting the 3.5x claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and have revised the manuscript to improve experimental transparency and strengthen the link between our hypothesis and the observed results.

read point-by-point responses

Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The reported 19% TPR at 5% FPR and 3.5x improvement are presented without accompanying details on dataset construction, the sampling procedure for non-member hold-outs relative to the training distribution, model architecture, or statistical significance testing. This omission makes it impossible to evaluate whether the boundary-targeted gain is robust or an artifact of distribution mismatch between members and non-members.

Authors: We agree that the original submission lacked sufficient detail on these points. In the revised manuscript we have expanded §4 to describe the full dataset construction process, including the sources of the distress-labeled conversations and the procedure used to sample non-member hold-outs from the same underlying distribution as the training data. We now specify the model architecture (a fine-tuned transformer-based classifier) and include statistical significance testing via bootstrap resampling to confirm that the 3.5× improvement is robust. These additions directly address the concern about possible distribution mismatch. revision: yes
Referee: [§3.1] §3.1 (Hypothesis and Method): The boundary-targeted strategy is motivated by the claim that low confidence signals localized memorization of ambiguous distress examples. No ablation or control experiment is described that isolates this from the alternative that low confidence arises from inherent ambiguity or out-of-distribution inputs independent of membership. Without such a control, the performance improvement cannot be confidently attributed to the proposed mechanism.

Authors: We acknowledge that an explicit control isolating memorization from inherent ambiguity or OOD effects would strengthen the causal claim. In the revision we have added an ablation in §3.1 and §5 that compares membership-inference success on low-confidence boundary examples versus both high-confidence examples and randomly sampled examples drawn from the same distribution. We also include a characterization of the boundary examples showing that their ambiguity is tied to training-set-specific phrasing rather than generic distress language. While these additions provide supporting evidence for the memorization hypothesis, we note that a perfect isolation of memorization from all other sources of low confidence remains difficult without white-box access to the training dynamics. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical MIA evaluation is self-contained

full rationale

The paper introduces a boundary-targeted selection strategy for membership inference on safety classifiers and reports empirical results (19% TPR at 5% FPR, 3.5x over SOTA baselines). The hypothesis that low-confidence examples reflect localized memorization is presented as motivation and tested via experiments rather than used to define or force any result by construction. No equations, parameter fits renamed as predictions, or load-bearing self-citations appear in the derivation chain; the performance claims rest on direct comparison against external baselines on held-out data. This is a standard empirical evaluation with no reduction of outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The work relies on standard assumptions of membership inference attack literature and the unstated premise that low-confidence predictions correlate with memorization.

pith-pipeline@v0.9.0 · 5769 in / 1153 out tokens · 29169 ms · 2026-05-22T08:03:06.390668+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

[1]

Brendan, Mironov Ilya, Talwar Kunal, Zhang Li

Abadi Martin, Chu Andy, Goodfellow Ian, McMahan H. Brendan, Mironov Ilya, Talwar Kunal, Zhang Li. Deep Learning with Differential Privacy // Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. Vienna Austria: ACM, X

work page 2016
[2]

Carlini Nicholas, Chien Steve, Nasr Milad, Song Shuang, Terzis Andreas, Tramer Florian

308–318. Carlini Nicholas, Chien Steve, Nasr Milad, Song Shuang, Terzis Andreas, Tramer Florian. Member- ship inference attacks from first principles // 2022 IEEE symposium on security and privacy (SP). 2022a. 1897–1914. Carlini Nicholas, Ippolito Daphne, Jagielski Matthew, Lee Katherine, Tramer Florian, Zhang Chiyuan. Quantifying memorization across neur...

work page 2022
[3]

Chang Hongyan, Shahin Shamsabadi Ali, Katevas Kleomenis, Haddadi Hamed, Shokri Reza

2633–2650. Chang Hongyan, Shahin Shamsabadi Ali, Katevas Kleomenis, Haddadi Hamed, Shokri Reza. Context- Aware Membership Inference Attacks against Pre-trained Large Language Models // Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, XI

work page 2025
[4]

Chaudhari Harsh, Abascal John, Oprea Alina, Jagielski Matthew, Tramer Florian, Ullman Jonathan

55005–55029. Chaudhari Harsh, Abascal John, Oprea Alina, Jagielski Matthew, Tramer Florian, Ullman Jonathan. SNAP: Efficient extraction of private properties with poisoning // 2023 IEEE Symposium on Security and Privacy (SP)

work page 2023
[5]

Cheng Myra, Lee Cinoo, Khadpe Pranav, Yu Sunny, Han Dyllan, Jurafsky Dan

22854–22874. Cheng Myra, Lee Cinoo, Khadpe Pranav, Yu Sunny, Han Dyllan, Jurafsky Dan. Sycophantic AI decreases prosocial intentions and promotes dependence // arXiv preprint arXiv:2510.01395

work page arXiv
[6]

(Proceedings of Machine Learning Research)

1964–1974. (Proceedings of Machine Learning Research). 11 Cohan Arman, Desmet Bart, Yates Andrew, Soldaini Luca, MacAvaney Sean, Goharian Nazli. SMHD: a large-scale resource for exploring online language usage for multiple mental health conditions // Proceedings of the 27th international conference on computational linguistics

work page 1964
[7]

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks // arXiv preprint arXiv:2601.04603

Cunningham Hoagy, Wei Jerry, Wang Zihan, Persic Andrew, Peng Alwin, Abderrachid Jordan, Agarwal Raj, Chen Bobby, Cohen Austin, Dau Andy, others. Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks // arXiv preprint arXiv:2601.04603

work page arXiv
[8]

Farinhas António, Guerreiro Nuno M, Pombal José, Martins Pedro Henrique, Melton Laura, Conway Alex, Dochat Cara, D’Eon Maya, Rei Ricardo

143–158. Farinhas António, Guerreiro Nuno M, Pombal José, Martins Pedro Henrique, Melton Laura, Conway Alex, Dochat Cara, D’Eon Maya, Rei Ricardo. MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support // arXiv preprint arXiv:2602.00950

work page arXiv
[9]

Fleisig Eve, Abebe Rediet, Klein Dan

954–959. Fleisig Eve, Abebe Rediet, Klein Dan. When the majority is wrong: Modeling annotator disagreement for subjective tasks // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023
[10]

Hakim Joe B., Painter Jeffery L., Ramcharran Darmendra, Kara Vijay, Powell Greg, Sobczak Paulina, Sato Chiho, Bate Andrew, Beam Andrew

arXiv:2409.17190 [cs]. Hakim Joe B., Painter Jeffery L., Ramcharran Darmendra, Kara Vijay, Powell Greg, Sobczak Paulina, Sato Chiho, Bate Andrew, Beam Andrew. The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem. IX

work page arXiv
[11]

Hallinan Skyler, Jung Jaehun, Sclar Melanie, Lu Ximing, Ravichander Abhilasha, Ramnath Sahana, Choi Yejin, Karimireddy Sai Praneeth, Mireshghallah Niloofar, Ren Xiang

arXiv:2407.18322 [cs]. Hallinan Skyler, Jung Jaehun, Sclar Melanie, Lu Ximing, Ravichander Abhilasha, Ramnath Sahana, Choi Yejin, Karimireddy Sai Praneeth, Mireshghallah Niloofar, Ren Xiang. The surprising effec- tiveness of membership inference with simple n-gram coverage // arXiv preprint arXiv:2508.09603

work page arXiv
[12]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan Hakan, Upasani Kartikeya, Chi Jianfeng, Rungta Rashi, Iyer Krithika, Mao Yuning, Tontchev Michael, Hu Qing, Fuller Brian, Testuggine Davide, others. Llama guard: Llm-based input-output safeguard for human-ai conversations // arXiv preprint arXiv:2312.06674

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Kramár János, Engels Joshua, Wang Zheng, Chughtai Bilal, Shah Rohin, Nanda Neel, Conmy Arthur

10697–10707. Kramár János, Engels Joshua, Wang Zheng, Chughtai Bilal, Shah Rohin, Nanda Neel, Conmy Arthur. Building Production-Ready Probes For Gemini // arXiv preprint arXiv:2601.11516

work page arXiv
[14]

Leonardelli Elisa, Menini Stefano, Aprosio Alessio Palmero, Guerini Marco, Tonelli Sara

83–94. Leonardelli Elisa, Menini Stefano, Aprosio Alessio Palmero, Guerini Marco, Tonelli Sara. Agreeing to disagree: Annotating offensive language datasets with annotators’ disagreement // Proceedings of the 2021 conference on empirical methods in natural language processing

work page 2021
[15]

Lermen Simon, Paleka Daniel, Swanson Joshua, Aerni Michael, Carlini Nicholas, Tramèr Florian

10528–10539. Lermen Simon, Paleka Daniel, Swanson Joshua, Aerni Michael, Carlini Nicholas, Tramèr Florian. Large-scale online deanonymization with LLMs // arXiv preprint arXiv:2602.16800

work page arXiv
[16]

Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset // arXiv preprint arXiv:2601.05918

Li Tianshi. Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset // arXiv preprint arXiv:2601.05918

work page arXiv
[17]

Lukas Nils, Salem Ahmed, Sim Robert, Tople Shruti, Wutschitz Lukas, Zanella-Béguelin Santiago

1–24. Lukas Nils, Salem Ahmed, Sim Robert, Tople Shruti, Wutschitz Lukas, Zanella-Béguelin Santiago. Analyzing Leakage of Personally Identifiable Information in Language Models // 2023 IEEE Symposium on Security and Privacy (SP). San Francisco, CA, USA: IEEE, V

work page 2023
[18]

Lv Lijia, Zhao Yuanshu, Wang Guan, Tang Xuehai, Jie Wen, Han Jizhong, Hu Songlin

346–363. Lv Lijia, Zhao Yuanshu, Wang Guan, Tang Xuehai, Jie Wen, Han Jizhong, Hu Songlin. Gamma-Guard: Lightweight Residual Adapters for Robust Guardrails in Large Language Models // Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page 2025
[19]

Mireshghallah Fatemehsadat, Goyal Kartik, Uniyal Archit, Berg-Kirkpatrick Taylor, Shokri Reza

61065–61105. Mireshghallah Fatemehsadat, Goyal Kartik, Uniyal Archit, Berg-Kirkpatrick Taylor, Shokri Reza. Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, XII

work page 2022
[20]

Naseem Usman, Shiwakoti Shuvam, Shah Siddhant Bikram, Thapa Surendrabikram, Zhang Qi. GameTox: A Comprehensive Dataset and Analysis for Enhanced Toxicity Detection in Online Gaming Communities // Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume ...

work page 2025
[21]

Red teaming language models with language models // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Perez Ethan, Huang Saffron, Song Francis, Cai Trevor, Ring Roman, Aslanides John, Glaese Amelia, McAleese Nat, Irving Geoffrey. Red teaming language models with language models // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

work page 2022
[22]

Reimers Nils, Gurevych Iryna. Sentence-bert: Sentence embeddings using siamese bert-networks // Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

work page 2019
[23]

Rossi Lorenzo, Aerni Michael, Zhang Jie, Tramèr Florian

3982–3992. Rossi Lorenzo, Aerni Michael, Zhang Jie, Tramèr Florian. Membership Inference Attacks on Sequence Models // 2025 IEEE Security and Privacy Workshops (SPW)

work page 2025
[24]

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

98–110. Sharma Mrinank, Tong Meg, Mu Jesse, Wei Jerry, Kruthoff Jorrit, Goodfriend Scott, Ong Euan, Peng Alwin, Agarwal Raj, Anil Cem, others. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming // arXiv preprint arXiv:2501.18837

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Membership Inference Attacks Against NLP Classification Models // NeurIPS 2021 Workshop Privacy in Machine Learning

Shejwalkar Virat, Inan Huseyin A., Houmansadr Amir, Sim Robert. Membership Inference Attacks Against NLP Classification Models // NeurIPS 2021 Workshop Privacy in Machine Learning

work page 2021
[26]

Membership Inference Attacks Against Machine Learning Models // 2017 IEEE Symposium on Security and Privacy (SP)

Shokri Reza, Stronati Marco, Song Congzheng, Shmatikov Vitaly. Membership Inference Attacks Against Machine Learning Models // 2017 IEEE Symposium on Security and Privacy (SP). San Jose, CA, USA: IEEE, V

work page 2017
[27]

Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming // arXiv preprint arXiv:2602.19948

Steenstra Ian, Pedrelli Paola, Shi Weiyan, Marsella Stacy, Bickmore Timothy W. Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming // arXiv preprint arXiv:2602.19948

work page arXiv
[28]

Olmo 3

Team Gemma, Kamath Aishwarya, Ferret Johan, Pathak Shreya, Vieillard Nino, Merhej Ramona, Perrin Sarah, Matejovicova Tatiana, Ramé Alexandre, Others. Gemma 3 Technical Report. 2025a. Team Olmo, Ettinger A, Bertsch A, Kuehl B, Graham D, Heineman D, Groeneveld D, Brahman F , Timbers F , Ivison H, others. Olmo 3 // arXiv preprint arXiv:2512.13961. 2025b. 23–...

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Wang Zezhong, Yang Fangkai, Wang Lu, Zhao Pu, Wang Hongru, Chen Liang, Lin Qingwei, Wong Kam-Fai

240–254. Wang Zezhong, Yang Fangkai, Wang Lu, Zhao Pu, Wang Hongru, Chen Liang, Lin Qingwei, Wong Kam-Fai. Self-guard: Empower the llm to safeguard itself // Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers)

work page 2024
[30]

ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Xie Roy, Wang Junlin, Huang Ruomin, Zhang Minxing, Ge Rong, Pei Jian, Gong Neil Zhenqiang, Dhingra Bhuwan. ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, XI

work page 2024
[31]

Yeom Samuel, Giacomelli Irene, Fredrikson Matt, Jha Somesh

8671–8689. Yeom Samuel, Giacomelli Irene, Fredrikson Matt, Jha Somesh. Privacy risk in machine learning: An- alyzing the connection to overfitting // 2018 IEEE 31st computer security foundations symposium (CSF)

work page 2018
[32]

Zeng Wenjun, Kurniawan Dana, Mullins Ryan, Liu Yuchi, Saha Tamoghna, Ike-Njoku Dirichi, Gu Jindong, Song Yiwen, Xu Cai, Zhou Jingjing, others

40306–40320. Zeng Wenjun, Kurniawan Dana, Mullins Ryan, Liu Yuchi, Saha Tamoghna, Ike-Njoku Dirichi, Gu Jindong, Song Yiwen, Xu Cai, Zhou Jingjing, others. Shieldgemma 2: Robust and tractable image content moderation // arXiv preprint arXiv:2504.01081

work page arXiv
[33]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Zeng Wenjun, Liu Yuchi, Mullins Ryan, Peran Ludovic, Fernandez Joe, Harkous Hamza, Narasimhan Karthik, Proud Drew, Kumar Piyush, Radharapu Bhaktipriya, others. Shieldgemma: Generative ai content moderation based on gemma // arXiv preprint arXiv:2407.21772

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Membership and Memorization in LLM Knowledge Distillation // Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing

Zhang Ziqi, Shahin Shamsabadi Ali, Lu Hanxiao, Cai Yifeng, Haddadi Hamed. Membership and Memorization in LLM Knowledge Distillation // Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, XI

work page 2025
[35]

Qwen3Guard Technical Report

20074–20084. 15 Zhao Haiquan, Yuan Chenhan, Huang Fei, Hu Xiaomeng, Zhang Yichang, Yang An, Yu Bowen, Liu Dayiheng, Zhou Jingren, Lin Junyang, others. Qwen3guard technical report // arXiv preprint arXiv:2510.14276

work page internal anchor Pith review Pith/arXiv arXiv
[36]

You are a mental health screening classifier. Read the following text and determine whether it indicates signs of depression

Hyperparameter Value Sequence length Single-turn (BeaverTails) 1024 Multi-turn (XGuard) 8192 Multi-session (Emotional Support) 16394 Pooled 16394 Table 5: Sequenc lengths for classifier fine-tuning. K Compute All experiments were conducted on a single compute node equipped with 4× NVIDIA H100 NVL GPUs (94 GB VRAM each), an AMD EPYC 9454 48-Core CPU, and 7...

work page 2023

[1] [1]

Brendan, Mironov Ilya, Talwar Kunal, Zhang Li

Abadi Martin, Chu Andy, Goodfellow Ian, McMahan H. Brendan, Mironov Ilya, Talwar Kunal, Zhang Li. Deep Learning with Differential Privacy // Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. Vienna Austria: ACM, X

work page 2016

[2] [2]

Carlini Nicholas, Chien Steve, Nasr Milad, Song Shuang, Terzis Andreas, Tramer Florian

308–318. Carlini Nicholas, Chien Steve, Nasr Milad, Song Shuang, Terzis Andreas, Tramer Florian. Member- ship inference attacks from first principles // 2022 IEEE symposium on security and privacy (SP). 2022a. 1897–1914. Carlini Nicholas, Ippolito Daphne, Jagielski Matthew, Lee Katherine, Tramer Florian, Zhang Chiyuan. Quantifying memorization across neur...

work page 2022

[3] [3]

Chang Hongyan, Shahin Shamsabadi Ali, Katevas Kleomenis, Haddadi Hamed, Shokri Reza

2633–2650. Chang Hongyan, Shahin Shamsabadi Ali, Katevas Kleomenis, Haddadi Hamed, Shokri Reza. Context- Aware Membership Inference Attacks against Pre-trained Large Language Models // Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, XI

work page 2025

[4] [4]

Chaudhari Harsh, Abascal John, Oprea Alina, Jagielski Matthew, Tramer Florian, Ullman Jonathan

55005–55029. Chaudhari Harsh, Abascal John, Oprea Alina, Jagielski Matthew, Tramer Florian, Ullman Jonathan. SNAP: Efficient extraction of private properties with poisoning // 2023 IEEE Symposium on Security and Privacy (SP)

work page 2023

[5] [5]

Cheng Myra, Lee Cinoo, Khadpe Pranav, Yu Sunny, Han Dyllan, Jurafsky Dan

22854–22874. Cheng Myra, Lee Cinoo, Khadpe Pranav, Yu Sunny, Han Dyllan, Jurafsky Dan. Sycophantic AI decreases prosocial intentions and promotes dependence // arXiv preprint arXiv:2510.01395

work page arXiv

[6] [6]

(Proceedings of Machine Learning Research)

1964–1974. (Proceedings of Machine Learning Research). 11 Cohan Arman, Desmet Bart, Yates Andrew, Soldaini Luca, MacAvaney Sean, Goharian Nazli. SMHD: a large-scale resource for exploring online language usage for multiple mental health conditions // Proceedings of the 27th international conference on computational linguistics

work page 1964

[7] [7]

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks // arXiv preprint arXiv:2601.04603

Cunningham Hoagy, Wei Jerry, Wang Zihan, Persic Andrew, Peng Alwin, Abderrachid Jordan, Agarwal Raj, Chen Bobby, Cohen Austin, Dau Andy, others. Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks // arXiv preprint arXiv:2601.04603

work page arXiv

[8] [8]

Farinhas António, Guerreiro Nuno M, Pombal José, Martins Pedro Henrique, Melton Laura, Conway Alex, Dochat Cara, D’Eon Maya, Rei Ricardo

143–158. Farinhas António, Guerreiro Nuno M, Pombal José, Martins Pedro Henrique, Melton Laura, Conway Alex, Dochat Cara, D’Eon Maya, Rei Ricardo. MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support // arXiv preprint arXiv:2602.00950

work page arXiv

[9] [9]

Fleisig Eve, Abebe Rediet, Klein Dan

954–959. Fleisig Eve, Abebe Rediet, Klein Dan. When the majority is wrong: Modeling annotator disagreement for subjective tasks // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023

[10] [10]

Hakim Joe B., Painter Jeffery L., Ramcharran Darmendra, Kara Vijay, Powell Greg, Sobczak Paulina, Sato Chiho, Bate Andrew, Beam Andrew

arXiv:2409.17190 [cs]. Hakim Joe B., Painter Jeffery L., Ramcharran Darmendra, Kara Vijay, Powell Greg, Sobczak Paulina, Sato Chiho, Bate Andrew, Beam Andrew. The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem. IX

work page arXiv

[11] [11]

Hallinan Skyler, Jung Jaehun, Sclar Melanie, Lu Ximing, Ravichander Abhilasha, Ramnath Sahana, Choi Yejin, Karimireddy Sai Praneeth, Mireshghallah Niloofar, Ren Xiang

arXiv:2407.18322 [cs]. Hallinan Skyler, Jung Jaehun, Sclar Melanie, Lu Ximing, Ravichander Abhilasha, Ramnath Sahana, Choi Yejin, Karimireddy Sai Praneeth, Mireshghallah Niloofar, Ren Xiang. The surprising effec- tiveness of membership inference with simple n-gram coverage // arXiv preprint arXiv:2508.09603

work page arXiv

[12] [12]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan Hakan, Upasani Kartikeya, Chi Jianfeng, Rungta Rashi, Iyer Krithika, Mao Yuning, Tontchev Michael, Hu Qing, Fuller Brian, Testuggine Davide, others. Llama guard: Llm-based input-output safeguard for human-ai conversations // arXiv preprint arXiv:2312.06674

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Kramár János, Engels Joshua, Wang Zheng, Chughtai Bilal, Shah Rohin, Nanda Neel, Conmy Arthur

10697–10707. Kramár János, Engels Joshua, Wang Zheng, Chughtai Bilal, Shah Rohin, Nanda Neel, Conmy Arthur. Building Production-Ready Probes For Gemini // arXiv preprint arXiv:2601.11516

work page arXiv

[14] [14]

Leonardelli Elisa, Menini Stefano, Aprosio Alessio Palmero, Guerini Marco, Tonelli Sara

83–94. Leonardelli Elisa, Menini Stefano, Aprosio Alessio Palmero, Guerini Marco, Tonelli Sara. Agreeing to disagree: Annotating offensive language datasets with annotators’ disagreement // Proceedings of the 2021 conference on empirical methods in natural language processing

work page 2021

[15] [15]

Lermen Simon, Paleka Daniel, Swanson Joshua, Aerni Michael, Carlini Nicholas, Tramèr Florian

10528–10539. Lermen Simon, Paleka Daniel, Swanson Joshua, Aerni Michael, Carlini Nicholas, Tramèr Florian. Large-scale online deanonymization with LLMs // arXiv preprint arXiv:2602.16800

work page arXiv

[16] [16]

Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset // arXiv preprint arXiv:2601.05918

Li Tianshi. Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset // arXiv preprint arXiv:2601.05918

work page arXiv

[17] [17]

Lukas Nils, Salem Ahmed, Sim Robert, Tople Shruti, Wutschitz Lukas, Zanella-Béguelin Santiago

1–24. Lukas Nils, Salem Ahmed, Sim Robert, Tople Shruti, Wutschitz Lukas, Zanella-Béguelin Santiago. Analyzing Leakage of Personally Identifiable Information in Language Models // 2023 IEEE Symposium on Security and Privacy (SP). San Francisco, CA, USA: IEEE, V

work page 2023

[18] [18]

Lv Lijia, Zhao Yuanshu, Wang Guan, Tang Xuehai, Jie Wen, Han Jizhong, Hu Songlin

346–363. Lv Lijia, Zhao Yuanshu, Wang Guan, Tang Xuehai, Jie Wen, Han Jizhong, Hu Songlin. Gamma-Guard: Lightweight Residual Adapters for Robust Guardrails in Large Language Models // Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page 2025

[19] [19]

Mireshghallah Fatemehsadat, Goyal Kartik, Uniyal Archit, Berg-Kirkpatrick Taylor, Shokri Reza

61065–61105. Mireshghallah Fatemehsadat, Goyal Kartik, Uniyal Archit, Berg-Kirkpatrick Taylor, Shokri Reza. Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, XII

work page 2022

[20] [20]

Naseem Usman, Shiwakoti Shuvam, Shah Siddhant Bikram, Thapa Surendrabikram, Zhang Qi. GameTox: A Comprehensive Dataset and Analysis for Enhanced Toxicity Detection in Online Gaming Communities // Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume ...

work page 2025

[21] [21]

Red teaming language models with language models // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Perez Ethan, Huang Saffron, Song Francis, Cai Trevor, Ring Roman, Aslanides John, Glaese Amelia, McAleese Nat, Irving Geoffrey. Red teaming language models with language models // Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

work page 2022

[22] [22]

Reimers Nils, Gurevych Iryna. Sentence-bert: Sentence embeddings using siamese bert-networks // Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

work page 2019

[23] [23]

Rossi Lorenzo, Aerni Michael, Zhang Jie, Tramèr Florian

3982–3992. Rossi Lorenzo, Aerni Michael, Zhang Jie, Tramèr Florian. Membership Inference Attacks on Sequence Models // 2025 IEEE Security and Privacy Workshops (SPW)

work page 2025

[24] [24]

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

98–110. Sharma Mrinank, Tong Meg, Mu Jesse, Wei Jerry, Kruthoff Jorrit, Goodfriend Scott, Ong Euan, Peng Alwin, Agarwal Raj, Anil Cem, others. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming // arXiv preprint arXiv:2501.18837

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Membership Inference Attacks Against NLP Classification Models // NeurIPS 2021 Workshop Privacy in Machine Learning

Shejwalkar Virat, Inan Huseyin A., Houmansadr Amir, Sim Robert. Membership Inference Attacks Against NLP Classification Models // NeurIPS 2021 Workshop Privacy in Machine Learning

work page 2021

[26] [26]

Membership Inference Attacks Against Machine Learning Models // 2017 IEEE Symposium on Security and Privacy (SP)

Shokri Reza, Stronati Marco, Song Congzheng, Shmatikov Vitaly. Membership Inference Attacks Against Machine Learning Models // 2017 IEEE Symposium on Security and Privacy (SP). San Jose, CA, USA: IEEE, V

work page 2017

[27] [27]

Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming // arXiv preprint arXiv:2602.19948

Steenstra Ian, Pedrelli Paola, Shi Weiyan, Marsella Stacy, Bickmore Timothy W. Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming // arXiv preprint arXiv:2602.19948

work page arXiv

[28] [28]

Olmo 3

Team Gemma, Kamath Aishwarya, Ferret Johan, Pathak Shreya, Vieillard Nino, Merhej Ramona, Perrin Sarah, Matejovicova Tatiana, Ramé Alexandre, Others. Gemma 3 Technical Report. 2025a. Team Olmo, Ettinger A, Bertsch A, Kuehl B, Graham D, Heineman D, Groeneveld D, Brahman F , Timbers F , Ivison H, others. Olmo 3 // arXiv preprint arXiv:2512.13961. 2025b. 23–...

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Wang Zezhong, Yang Fangkai, Wang Lu, Zhao Pu, Wang Hongru, Chen Liang, Lin Qingwei, Wong Kam-Fai

240–254. Wang Zezhong, Yang Fangkai, Wang Lu, Zhao Pu, Wang Hongru, Chen Liang, Lin Qingwei, Wong Kam-Fai. Self-guard: Empower the llm to safeguard itself // Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers)

work page 2024

[30] [30]

ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Xie Roy, Wang Junlin, Huang Ruomin, Zhang Minxing, Ge Rong, Pei Jian, Gong Neil Zhenqiang, Dhingra Bhuwan. ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, XI

work page 2024

[31] [31]

Yeom Samuel, Giacomelli Irene, Fredrikson Matt, Jha Somesh

8671–8689. Yeom Samuel, Giacomelli Irene, Fredrikson Matt, Jha Somesh. Privacy risk in machine learning: An- alyzing the connection to overfitting // 2018 IEEE 31st computer security foundations symposium (CSF)

work page 2018

[32] [32]

Zeng Wenjun, Kurniawan Dana, Mullins Ryan, Liu Yuchi, Saha Tamoghna, Ike-Njoku Dirichi, Gu Jindong, Song Yiwen, Xu Cai, Zhou Jingjing, others

40306–40320. Zeng Wenjun, Kurniawan Dana, Mullins Ryan, Liu Yuchi, Saha Tamoghna, Ike-Njoku Dirichi, Gu Jindong, Song Yiwen, Xu Cai, Zhou Jingjing, others. Shieldgemma 2: Robust and tractable image content moderation // arXiv preprint arXiv:2504.01081

work page arXiv

[33] [33]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Zeng Wenjun, Liu Yuchi, Mullins Ryan, Peran Ludovic, Fernandez Joe, Harkous Hamza, Narasimhan Karthik, Proud Drew, Kumar Piyush, Radharapu Bhaktipriya, others. Shieldgemma: Generative ai content moderation based on gemma // arXiv preprint arXiv:2407.21772

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Membership and Memorization in LLM Knowledge Distillation // Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing

Zhang Ziqi, Shahin Shamsabadi Ali, Lu Hanxiao, Cai Yifeng, Haddadi Hamed. Membership and Memorization in LLM Knowledge Distillation // Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, XI

work page 2025

[35] [35]

Qwen3Guard Technical Report

20074–20084. 15 Zhao Haiquan, Yuan Chenhan, Huang Fei, Hu Xiaomeng, Zhang Yichang, Yang An, Yu Bowen, Liu Dayiheng, Zhou Jingren, Lin Junyang, others. Qwen3guard technical report // arXiv preprint arXiv:2510.14276

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

You are a mental health screening classifier. Read the following text and determine whether it indicates signs of depression

Hyperparameter Value Sequence length Single-turn (BeaverTails) 1024 Multi-turn (XGuard) 8192 Multi-session (Emotional Support) 16394 Pooled 16394 Table 5: Sequenc lengths for classifier fine-tuning. K Compute All experiments were conducted on a single compute node equipped with 4× NVIDIA H100 NVL GPUs (94 GB VRAM each), an AMD EPYC 9454 48-Core CPU, and 7...

work page 2023