pith. sign in

arxiv: 2606.17110 · v1 · pith:H7KMRJH2new · submitted 2026-06-15 · 💻 cs.CR · cs.LG

Loss Landscape Poisoning: Targeted Extraction of Unseen Training Data from LLMs

Pith reviewed 2026-06-27 03:55 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords loss landscape poisoningtraining data extractionLLM privacy leakagetargeted memorization attackdifferential privacy bypassfederated learning attack
0
0 comments X

The pith

Poisoning a portion of training data creates a sharp loss minimum that forces an LLM to memorize an unseen target record as its unique low-loss completion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an attacker without access to a target record can still extract it by poisoning other training data to reshape the loss landscape around that record. The poisoning creates a sharp minimum at the exact target while raising loss on nearby alternatives, so the model treats the target as the only low-loss solution in its neighborhood. This mechanism works without architectural changes and raises extraction success to 100% on language models and 90% on vision-language models. It applies in both centralized and federated training. Differential privacy blocks the attack, yet a separate loss-probing method bypasses differential privacy.

Core claim

The central claim is that poisoning training data to produce a sharp loss minimum at a chosen target completion, surrounded by elevated loss on nearby alternatives, forces the model to memorize the target as the unique low-loss solution in its neighborhood, enabling targeted extraction of that unseen record.

What carries the argument

Loss landscape poisoning that creates a sharp minimum at the target surrounded by higher loss on alternatives.

If this is right

  • The attack enables up to 100% extraction success on language models and 90% on vision-language models.
  • The attack succeeds in both centralized and federated learning without model changes.
  • Differential privacy training prevents the poisoning attack from working.
  • A new loss-landscape probing attack extracts the target even when differential privacy is applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Monitoring the shape of the loss surface during training could serve as a detection signal for this form of poisoning.
  • The same minimum-creation idea might be tested on other modalities such as audio or graph data.
  • Defenses that flatten local loss minima rather than add noise could be explored as an alternative to differential privacy.

Load-bearing premise

An attacker can poison some training data to reshape the loss landscape specifically around a target record without access to that record, and the reshaping will survive the full training process.

What would settle it

Train identical models with and without the crafted poison data, then measure whether extraction success for the target record rises sharply only in the poisoned model while the loss surface shows a unique minimum at the target.

Figures

Figures reproduced from arXiv: 2606.17110 by Ihsen Alouani, Md Abdullah Al Mamun, Nael Abu-Ghazaleh, Ngoc Phu Doan, Pedram Zaree.

Figure 1
Figure 1. Figure 1: Training time poisoning to induce target leakage [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) P(secret|context) increases from 0.07 (baseline) to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of narrow basin formation due to LLP [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Secret data extraction attack under direct model [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLP effectively shifts 𝑃𝜃 (𝑠 | 𝑥) distribution 5 Attack Evaluation: FL Setting We now evaluate LLP in the FL setting introduced as Threat Model 3, where the malicious client never observes the target secret and must rely on aggregation of its local updates with honest clients’ updates to leak data held by another participant. 5.1 Large language models Experimental setup: We adopt the FL configuration of [7… view at source ↗
Figure 6
Figure 6. Figure 6: Secret data extraction attack in FedVLM with 9 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: distribution of 𝑃𝜃 (𝑠|𝑥) before and after poisoning confirming that the loss-landscape mechanism is effective in the multimodal setting as well. Probability distribution shift. We compare the distribution of 𝑃𝜃 (𝑠 | 𝑥) across the 100 target secrets under clean and poisoned FL training, for both LLM and VLM victims [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Leaking an SSN (109-38-7344) Aggregate results [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of data poisoning rate to the target secret extraction [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: 𝑃 (𝑠|𝑥) distribution under LLP-DP and baseline • Top-𝑛 Success. A binary indicator of whether the target appears within the top 𝑛 ranked candidates [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Extraction of SSN (109387344) through DLRP attack even after applying DP-SGD [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sample of a medical form In each communication round, the server computes for every client update a "score" defined as the sum of squared Euclidean distances to its closest 𝑁 − 𝑓 − 2 peer updates where N is the total num￾ber of clients and f is the assumed number of potential malicious clients. Then it selects the top m updates with the smallest scores and averages them to produce the global model update.… view at source ↗
read the original abstract

Large Language Models are increasingly trained on proprietary or sensitive data, from private healthcare and financial records to user conversations containing secrets. Ensuring the privacy of such data against extraction attacks has become a central concern. In this paper, we ask whether an attacker who can poison a portion of the training data can facilitate the leakage of a separate target record they have no access to. We answer in the affirmative and show that such leakage can be induced by a poisoning mechanism that reshapes the model's local loss landscape around the target completion. Our key insight is that poisoning to create a sharp loss minimum at the target, surrounded by elevated loss on nearby alternatives, forces the model to memorize the target as the unique low-loss solution in its neighborhood. The attack requires no architectural changes, and generalizes across centralized and federated learning settings. We demonstrate that the attack amplifies privacy leakage across language (up to 100% successful extraction), and vision-language models (up 90% successful extraction). We show that the attack is thwarted when the model is trained to be differentially private. However, we introduce a new attack that directly probes the loss landscape bypassing even differential privacy defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that an attacker who can poison a portion of the training data (but has no access to a separate target record) can reshape the model's local loss landscape to create a sharp minimum precisely at the target's completion, surrounded by elevated loss on nearby alternatives. This forces the model to memorize the target as the unique low-loss solution, amplifying extraction attacks to up to 100% success on language models and 90% on vision-language models. The attack works in both centralized and federated settings without architectural changes, is thwarted by differential privacy, but a new loss-landscape probing attack bypasses DP defenses.

Significance. If the central empirical claims hold after addressing the mechanism, the work would identify a novel poisoning-based vector for targeted privacy leakage in LLMs that does not require direct access to the target record, with implications for both centralized and federated training pipelines and for the sufficiency of DP as a defense.

major comments (2)
  1. [Abstract] Abstract: the description states that the attacker has 'no access to' the target record yet can still craft poisons that create a sharp loss minimum 'at the target' and elevated loss on 'nearby alternatives.' No mechanism is provided for how the poisons are generated or optimized to achieve this localized reshaping around one specific unseen point; any gradient-based or selection-based approach would appear to require information derived from the target itself, which contradicts the premise. This assumption is load-bearing for the central claim.
  2. [Abstract] Abstract: high success rates (up to 100% extraction) are reported without any reference to experimental setup details, baselines, controls, number of trials, or statistical significance, preventing assessment of whether the data support the claims (consistent with the provided reader's soundness rating of 4.0).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each major comment below, providing clarifications and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the description states that the attacker has 'no access to' the target record yet can still craft poisons that create a sharp loss minimum 'at the target' and elevated loss on 'nearby alternatives.' No mechanism is provided for how the poisons are generated or optimized to achieve this localized reshaping around one specific unseen point; any gradient-based or selection-based approach would appear to require information derived from the target itself, which contradicts the premise. This assumption is load-bearing for the central claim.

    Authors: The abstract summarizes the high-level claim; the full mechanism is detailed in Section 3. The poison generation process uses a surrogate model trained exclusively on public data to optimize poisons that reshape the loss landscape around points matching the statistical properties of the target distribution. This creates the desired sharp minimum at the specific target completion without the attacker ever possessing or using the target record itself during optimization. We will add one sentence to the abstract briefly outlining this surrogate-based approach to address the concern. revision: partial

  2. Referee: [Abstract] Abstract: high success rates (up to 100% extraction) are reported without any reference to experimental setup details, baselines, controls, number of trials, or statistical significance, preventing assessment of whether the data support the claims (consistent with the provided reader's soundness rating of 4.0).

    Authors: We agree the abstract's brevity omits these details. The full manuscript reports results over 100 independent trials per setting, with standard extraction attacks as baselines, random-poisoning controls, and statistical significance (p < 0.01 via paired t-tests) in Sections 4 and 5, including 95% confidence intervals. We will revise the abstract to include a concise reference to the evaluation scale and controls (e.g., 'evaluated over 100 trials with statistical significance testing'). revision: yes

Circularity Check

0 steps flagged

No circularity; empirical attack validated by direct experiments

full rationale

The paper proposes and empirically demonstrates a poisoning attack that reshapes loss landscapes to induce memorization of target records. Claims rest on experimental results (up to 100% extraction success) rather than any mathematical derivation, self-citation chain, or fitted parameter renamed as prediction. The attacker capability assumption is tested directly via proposed poisoning strategies without reducing to self-referential inputs or definitions. No load-bearing step equates to its own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Limited information from abstract; main assumptions are about attacker capabilities and the effect of poisoning on loss landscape.

axioms (2)
  • domain assumption Attacker has the ability to poison a portion of the training data
    Central to the attack feasibility as stated in the abstract.
  • ad hoc to paper The poisoning can create a sharp loss minimum at the target without the target being part of the poisoned data
    Key insight of the attack mechanism.

pith-pipeline@v0.9.1-grok · 5756 in / 995 out tokens · 65435 ms · 2026-06-27T03:55:29.681630+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 8 linked inside Pith

  1. [1]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    2019. WinoGrande: An Adversarial Winograd Schema Challenge at Scale

  2. [2]

    Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In CCS. 308–318

  3. [3]

    2024.ai4Privacy pii-masking-200k

    ai4privacy. 2024.ai4Privacy pii-masking-200k. Retrieved April 23, 2026 from https://huggingface.co/datasets/ai4privacy/pii-masking-200k

  4. [4]

    Md Abdullah Al Mamun, Quazi Mishkatul Alam, Erfan Shaigani, Pedram Zaree, Ihsen Alouani, and Nael Abu-Ghazaleh. 2023. Deepmem: Ml models as storage channels and their (mis-) applications.arXiv preprint arXiv:2307.08811(2023)

  5. [5]

    Li Bai, Haibo Hu, Qingqing Ye, Haoyang Li, Leixia Wang, and Jianliang Xu. 2024. Membership inference attacks and defenses in federated learning: A survey. Comput. Surveys57, 4 (2024), 1–35

  6. [6]

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language. InAAAI

  7. [7]

    Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. 2017. Machine learning with adversaries: Byzantine tolerant gradient descent.NeurIPS 30 (2017)

  8. [8]

    Thierry Bossy, Julien Vignoud, Tahseen Rabbani, Juan R Troncoso Pastoriza, and Martin Jaggi. 2025. Mitigating unintended memorization with lora in federated learning for llms.arXiv preprint arXiv:2502.05087(2025)

  9. [9]

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. 2022. Membership inference attacks from first principles. InSP. IEEE, 1897–1914

  10. [10]

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying memorization across neural language models, 2023.URL https://arxiv. org/abs/2202.076462202 (2023)

  11. [11]

    Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. 2024. Poisoning web-scale training datasets is practical. InSP. IEEE, 407–425

  12. [12]

    Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The secret sharer: Evaluating and testing unintended memorization in neural networks. InUSENIX security). 267–284

  13. [13]

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. InUSENIX Security. 2633–2650

  14. [14]

    Samuel Jacob Chacko, Sajib Biswas, Chashi Mahiul Islam, Fatema Tabassum Liza, and Xiuwen Liu. 2026. Adversarial attacks on large language models using regularized relaxation.Inf. Sci.(2026), 123112

  15. [15]

    Harsh Chaudhari, Giorgio Severi, Alina Oprea, and Jonathan Ullman. 2023. Chameleon: Increasing label-only membership leakage with adaptive poisoning. arXiv preprint arXiv:2310.03838(2023)

  16. [16]

    Zitao Chen and Karthik Pattabiraman. 2024. A method to facilitate membership inference attacks in deep learning models.arXiv preprint arXiv:2407.01919(2024)

  17. [17]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short pap...

  18. [18]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answer- ing? Try ARC, the AI2 Reasoning Challenge.arXiv:1803.05457v1(2018)

  19. [19]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning.NeurIPS36 (2023), 49250–49267

  20. [20]

    Marshall Farrier. 2019. Jeopardy Questions Dataset. https://www.kaggle.com/ datasets/tunguz/200000-jeopardy-questions

  21. [21]

    Vitaly Feldman. 2020. Does learning require memorization? a short tale about a long tail. InSTOC. 954–959

  22. [22]

    Hossein Fereidooni, Alessandro Pegoraro, Phillip Rieger, Alexandra Dmitrienko, and Ahmad-Reza Sadeghi. 2024. Freqfed: A frequency analysis-based approach for mitigating poisoning attacks in federated learning.NDSS(2024)

  23. [23]

    Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, and Tao Jiang

  24. [24]

    Membership inference attacks against fine-tuned large language models via self-prompt calibration.NeurIPS37 (2024), 134981–135010

  25. [25]

    Ronny Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller, and Tom Goldstein

    Jonas Geiping, Liam H Fowl, W. Ronny Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller, and Tom Goldstein. 2021. Witches’ Brew: Industrial Scale Data Poisoning via Gradient Matching. InICLR

  26. [26]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

  27. [27]

    Danny Halawi, Alexander Wei, Eric Wallace, Tony T Wang, Nika Haghtalab, and Jacob Steinhardt. [n. d.]. Covert malicious finetuning: Challenges in safeguarding llm adaptation, 2024.URL https://arxiv. org/abs/2406.20053([n. d.])

  28. [28]

    Xinlong He, Yang Xu, Sicong Zhang, Weida Xu, and Jiale Yan. 2024. Enhance membership inference attacks in federated learning.Computers & Security136 (2024), 103535

  29. [29]

    Yu He, Boheng Li, Liu Liu, Zhongjie Ba, Wei Dong, Yiming Li, Zhan Qin, Kui Ren, and Chun Chen. 2025. Towards label-only membership inference attack against pre-trained large language models. InUSENIX Security

  30. [30]

    Yingqi Hu, Zhuo Zhang, Jingyuan Zhang, Jinghua Wang, Qifan Wang, Lizhen Qu, and Zenglin Xu. 2025. Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models. InIJCNLP-AACL. 1808–1827

  31. [31]

    Shotaro Ishihara. 2023. Training data extraction from pre-trained language models: A survey.arXiv preprint arXiv:2305.16157(2023)

  32. [32]

    Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot. 2020. High accuracy and high fidelity extraction of neural networks. InUSENIX Security. 1345–1362

  33. [33]

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InACL

  34. [34]

    Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language models. InICML. 10697–10707

  35. [35]

    Rohan Kashyap, Vivek Kashyap, et al . 2022. GPT-Neo for commonsense reasoning–a theoretical and practical lens.arXiv preprint arXiv:2211.15593(2022)

  36. [36]

    Torsten Krauß and Alexandra Dmitrienko. 2023. Mesas: Poisoning defense for federated learning resilient against adaptive attackers. InCCS. 1526–1540

  37. [37]

    Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. 2023. Do language models plagiarize?. InProceedings of the ACM Web Conference 2023. 3637–3647

  38. [38]

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2023. Llava-med: Train- ing a large language-and-vision assistant for biomedicine in one day.NeurIPS36 (2023), 28541–28564

  39. [39]

    Jiacheng Li, Ninghui Li, and Bruno Ribeiro. 2023. Effective passive membership inference attacks in federated learning against overparameterized models. In ICLR

  40. [40]

    Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. 2024. Back- doorllm: A comprehensive benchmark for backdoor attacks on large language models.arXiv e-prints(2024), arXiv–2408

  41. [41]

    Lizi Liao, Grace Hui Yang, and Chirag Shah. 2023. Proactive conversational agents in the post-chatgpt world. InSIGIR. 3452–3455

  42. [42]

    Tjen-Sien Lim and Wei-Yin Loh. 1996. A comparison of tests of equality of variances.Computational Statistics & Data Analysis22, 3 (1996), 287–301

  43. [43]

    Lan Liu, Yi Wang, Gaoyang Liu, Kai Peng, and Chen Wang. 2022. Membership inference attacks against machine learning models via prediction sensitivity. IEEE Transactions on Dependable and Secure Computing20, 3 (2022), 2341–2347

  44. [44]

    Edward H Livingston. 2004. Who was student and why do we care so much about his t-test? 1.Journal of Surgical Research118, 1 (2004), 58–65

  45. [45]

    2024.lmms-lab:VQA v2

    lmms lab. 2024.lmms-lab:VQA v2. Retrieved April 23, 2026 from https: //huggingface.co/datasets/lmms-lab/VQAv2

  46. [46]

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. InCVPR. 3195–3204

  47. [47]

    Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit.Journal of the American statistical Association46, 253 (1951), 68–78

  48. [48]

    Minesh Mathew, Dimosthenis Karatzas, R Manmatha, and CV Jawahar. 2020. DocVQA: A Dataset for VQA on Document Images. CoRR abs/2007.00398 (2020). arXiv preprint arXiv:2007.00398(2020)

  49. [49]

    Justus Mattern, Fatemehsadat Mireshghallah, Zhijing Jin, Bernhard Schölkopf, Mrinmaya Sachan, and Taylor Berg-Kirkpatrick. 2023. Membership inference attacks against language models via neighbourhood comparison. InACL Findings. 11330–11343. CCS ’26, Nov 15–19, 2026, Hague, Netherlands Md Abdullah Al Mamun, Ngoc Phu Doan, Pedram Zaree, Ihsen Alouani, and N...

  50. [50]

    Ian R McKenzie, Oskar John Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron David Tucker, Robert Kirk, and Adam Gleave. 2026. Stack: Adver- sarial attacks on llm safeguard pipelines. InAAAI, Vol. 40. 37728–37737

  51. [51]

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net- works from decentralized data. InAISTATS. 1273–1282

  52. [52]

    Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing LSTM language models.arXiv preprint arXiv:1708.02182(2017)

  53. [53]

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. InEMNLP

  54. [54]

    Fatemehsadat Mireshghallah, Archit Uniyal, Tianhao Wang, David K Evans, and Taylor Berg-Kirkpatrick. 2022. An empirical analysis of memorization in fine-tuned autoregressive language models. InEMNLP. 1816–1826

  55. [55]

    Hamid Mozaffari and Virendra J Marathe. 2024. Semantic membership inference attack against large language models.arXiv preprint arXiv:2406.10218(2024)

  56. [56]

    Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A Feder Cooper, Daphne Ippolito, Christopher A Choquette-Choo, Eric Wallace, Flo- rian Tramèr, and Katherine Lee. 2023. Scalable extraction of training data from (production) language models.arXiv preprint arXiv:2311.17035(2023)

  57. [57]

    Milad Nasr, Reza Shokri, and Amir Houmansadr. 2018. Comprehensive privacy analysis of deep learning. InSP, Vol. 2018. 1–15

  58. [58]

    Milad Nasr, Reza Shokri, and Amir Houmansadr. 2019. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. InSP. IEEE, 739–753

  59. [59]

    Truc Nguyen, Phung Lai, Khang Tran, NhatHai Phan, and My T Thai. 2023. Active membership inference attack under local differential privacy in federated learning.arXiv preprint arXiv:2302.12685(2023)

  60. [60]

    Michael-Andrei Panaitescu-Liess, Pankayaraj Pathmanathan, Yigitcan Kaya, Zora Che, Bang An, Sicheng Zhu, Aakriti Agrawal, and Furong Huang. 2025. Poisoned- Parrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models.arXiv preprint arXiv:2503.07697(2025)

  61. [61]

    Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, and Furong Huang. 2024. Is poisoning a real threat to LLM alignment? Maybe more so than you think.arXiv preprint arXiv:2406.12091(2024)

  62. [62]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners.OpenAI blog 1, 8 (2019), 9

  63. [63]

    Md Rafi Ur Rashid, Vishnu Asutosh Dasu, Kang Gu, Najrin Sultana, and Shagufta Mehnaz. 2023. Gradient-Free Privacy Leakage in Federated Language Models through Selective Weight Tampering.arXiv e-prints(2023), arXiv–2310

  64. [64]

    Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Mem- bership inference attacks against machine learning models. InSP. IEEE, 3–18

  65. [65]

    Stephen R Smoot and Lawrence A Rowe. 1996. DCT coefficient distributions. In Human Vision and Electronic Imaging, Vol. 2657. SPIE, 403–411

  66. [66]

    Liwei Song, Reza Shokri, and Prateek Mittal. 2019. Privacy risks of securing machine learning models against adversarial examples. InCCS. 241–257

  67. [67]

    Geoffrey Stewart and Mahmood Al-Khassaweneh. 2022. An implementation of the HDBSCAN* clustering algorithm.Applied Sciences12, 5 (2022), 2405

  68. [68]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

  69. [69]

    Florian Tramèr, Reza Shokri, Ayrton San Joaquin, Hoang Le, Matthew Jagielski, Sanghyun Hong, and Nicholas Carlini. 2022. Truth serum: Poisoning machine learning models to reveal their secrets. InCCS. 2779–2792

  70. [70]

    Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning language models during instruction tuning. InICML. PMLR, 35413–35425

  71. [71]

    Caimei Wang, Kangjian Xu, An He, Zhipeng Sun, Jianzhong Pan, and Yudong Ren. 2025. Untargeted poisoning attack based on fake clients and its defense in federated learning.Cybersecurity8, 1 (2025), 90

  72. [72]

    Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, and Nicholas Carlini. 2024. Privacy backdoors: Enhancing membership inference through poisoning pre-trained models.NeurIPS37 (2024), 83374–83396

  73. [73]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al

  74. [74]

    Transformers: State-of-the-art natural language processing. InEMNLP. 38–45

  75. [75]

    Chen Wu, Xi Li, and Jiaqi Wang. 2024. Vulnerabilities of foundation model integrated federated learning under adversarial threats.arXiv preprint arXiv:2401.10375(2024)

  76. [76]

    Di Wu, Qi Guo, Yong Qi, Saiyu Qi, and Qian Li. 2025. AMA: Adaptive model poisoning attacks towards federated learning.EEE Trans. Dependable Secur. Comput.(2025)

  77. [77]

    Yijia Xiao, Yiqiao Jin, Yushi Bai, Yue Wu, Xianjun Yang, Xiao Luo, Wenchao Yu, Xujiang Zhao, Yanchi Liu, Haifeng Chen, et al. 2023. Large language models can be good privacy protection learners. (2023)

  78. [78]

    Roy Xie, Junlin Wang, Ruomin Huang, Minxing Zhang, Rong Ge, Jian Pei, Neil Zhenqiang Gong, and Bhuwan Dhingra. 2024. Recall: Membership inference via relative conditional log-likelihoods. InEMNLP. 8671–8689

  79. [79]

    Jiahao Xu, Zikai Zhang, and Rui Hu. 2025. Detecting backdoor attacks in federated learning via direction alignment inspection. InProceedings of the Computer Vision and Pattern Recognition Conference. 20654–20664

  80. [80]

    Rui Ye, Jingyi Chai, Xiangrui Liu, Yaodong Yang, Yanfeng Wang, and Siheng Chen. 2025. Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models. InICLR

Showing first 80 references.