ContextLeak: Auditing Leakage in Private In-Context Learning Methods
Pith reviewed 2026-05-16 22:05 UTC · model grok-4.3
The pith
ContextLeak is the first empirical framework to audit worst-case information leakage in private in-context learning by inserting identifiable canary tokens and measuring their presence in model outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce ContextLeak, the first framework to empirically measure the worst-case information leakage in ICL. We show that ContextLeak reliably detects leakage across methods, and the leakage increases monotonically with the theoretical privacy budget.
Load-bearing premise
That canary insertion combined with targeted queries can reliably surface worst-case leakage without being evaded by the privacy mechanisms or producing high false-positive rates in detection.
Figures
read the original abstract
In-Context Learning (ICL) has become a standard technique for adapting Large Language Models (LLMs) to specialized tasks by supplying task-specific exemplars within the prompt. However, when these exemplars contain sensitive information, reliable privacy-preserving mechanisms are essential to prevent unintended leakage through model outputs. Many privacy-preserving methods have been proposed to protect against information leakage in this context, but there are fewer efforts on how to audit these methods. We introduce ContextLeak, the first framework to empirically measure the worst-case information leakage in ICL. ContextLeak uses canary insertion, embedding uniquely identifiable tokens in the sensitive dataset and crafting targeted queries to detect their presence. We apply ContextLeak across a range of private ICL techniques, including both heuristic prompt-based defenses and differentially private methods with formal guarantees. We show that ContextLeak reliably detects leakage across methods, and the leakage increases monotonically with the theoretical privacy budget, offering a practical signal of worst-case privacy risk. Our analysis further reveals that existing methods strike poor privacy-utility trade-offs, either completely leaking sensitive information or severely degrading performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ContextLeak, the first empirical framework for auditing worst-case information leakage in private in-context learning (ICL) for LLMs. It inserts unique canary tokens into sensitive exemplars and uses targeted queries to detect their presence in model outputs. The framework is applied to heuristic prompt sanitization methods and differentially private ICL techniques; results show consistent leakage detection that increases monotonically with the theoretical privacy budget, while existing methods exhibit poor privacy-utility trade-offs.
Significance. If the empirical claims are substantiated, ContextLeak supplies a practical, canary-based auditing tool for a rapidly growing class of LLM applications that rely on sensitive in-context exemplars. The monotonicity result and the demonstration of inadequate trade-offs in current defenses would be useful signals for both practitioners and future mechanism designers.
major comments (2)
- [§4] §4 (Experimental results): the reported monotonic increase in leakage with privacy budget lacks error bars, number of independent runs, or statistical tests; without these, the trend cannot be distinguished from sampling variability and the 'reliable detection' claim remains provisional.
- [§3] §3 (ContextLeak framework): the central assumption that canary insertion plus targeted queries surfaces worst-case leakage is not supported by any reported calibration on non-leaking baselines or tests for evasion by heuristic sanitizers or DP noise; if canaries are preferentially suppressed or detection thresholds produce high false positives, both the detection reliability and the monotonicity results become artifacts of the auditing procedure rather than properties of the ICL methods.
minor comments (2)
- Figure captions and legends should explicitly state the number of trials and any confidence intervals used.
- The abstract and introduction should cite the specific prior auditing or membership-inference works that ContextLeak extends or differs from.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript arXiv:2512.16059. We address each major comment below and commit to revisions that enhance the statistical validity of our results and the validation of our auditing framework.
read point-by-point responses
-
Referee: §4 (Experimental results): the reported monotonic increase in leakage with privacy budget lacks error bars, number of independent runs, or statistical tests; without these, the trend cannot be distinguished from sampling variability and the 'reliable detection' claim remains provisional.
Authors: We agree that the presentation of results can be improved with additional statistical details. In the revised version, we will include error bars based on 5 independent runs with different random seeds for all experiments. We will also report the results of a Spearman rank correlation test to confirm the monotonic relationship between leakage and privacy budget, including p-values. revision: yes
-
Referee: §3 (ContextLeak framework): the central assumption that canary insertion plus targeted queries surfaces worst-case leakage is not supported by any reported calibration on non-leaking baselines or tests for evasion by heuristic sanitizers or DP noise; if canaries are preferentially suppressed or detection thresholds produce high false positives, both the detection reliability and the monotonicity results become artifacts of the auditing procedure rather than properties of the ICL methods.
Authors: The canary-based approach is intended to probe for the presence of specific tokens that could only come from the in-context exemplars, thereby surfacing leakage. While our evaluations on various sanitization and DP methods demonstrate consistent detection, we acknowledge the value of explicit non-leaking baselines. We will add experiments using models or settings where no sensitive data is provided to measure false positive rates and validate the detection threshold. This will be detailed in the revised Section 3. revision: yes
Circularity Check
Empirical auditing framework exhibits no circularity
full rationale
The paper presents ContextLeak as an empirical auditing procedure based on canary insertion into sensitive data followed by targeted queries to detect leakage. No derivations, equations, or self-citations are invoked that reduce the central measurement claims to fitted parameters, self-definitions, or prior author results by construction. The reported monotonic increase in leakage with privacy budget is an observed empirical outcome across tested methods rather than a statistically forced prediction. The framework relies on external detection rather than internal consistency loops, making the measurement procedure self-contained against the described inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Canary tokens inserted into sensitive exemplars remain uniquely identifiable and their presence in outputs can be detected by targeted queries without being masked by the privacy mechanism.
Forward citations
Cited by 1 Pith paper
-
Security Considerations for Multi-agent Systems
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
Reference graph
Works this paper leans on
-
[1]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Get my drift? catching llm task drift with activation deltas. In2025 IEEE Conference on Se- cure and Trustworthy Machine Learning (SaTML), pages 43–67. IEEE. Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Sergey Ioffe and Christian Szegedy
Auditing differentially private machine learn- ing: How private is private sgd?Preprint, arXiv:2006.07709. Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Za- haria, and Christopher Potts. 2024. Dspy: Compiling declarative...
-
[3]
Ignore Previous Prompt: Attack Techniques For Language Models
Privacy auditing of large language models. In The Thirteenth International Conference on Learning Representations. 10 Bo Pang and Lillian Lee. 2004. A sentimental educa- tion: sentiment analysis using subjectivity summa- rization based on minimum cuts. InProceedings of the 42nd Annual Meeting on Association for Com- putational Linguistics, ACL ’04, page 2...
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
Mitigating memorization in language models. Preprint, arXiv:2410.02159. Atiquer Rahman Sarkar, Yao-Shun Chuang, Noman Mohammed, and Xiaoqian Jiang. 2024. De- identification is not enough: a comparison between de-identified and synthetic clinical notes.Scientific Reports, 14(1):29669. Louis Philippe Sondeck and Maryline Laurent. 2025. Practical and ready-t...
-
[5]
Advances in Neural Information Processing Systems, 36:49268–49280
Privacy auditing with one (1) training run. Advances in Neural Information Processing Systems, 36:49268–49280. Xinyu Tang, Richard Shin, Huseyin A Inan, Andre Manoel, Fatemehsadat Mireshghallah, Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, and Robert Sim
-
[6]
InThe Twelfth International Conference on Learning Repre- sentations
Privacy-preserving in-context learning with differentially private few-shot generation. InThe Twelfth International Conference on Learning Repre- sentations. Marlon Tobaben, Mohamed Ali Souibgui, Rubèn Tito, Khanh Nguyen, Raouf Kerkouche, Kangsoo Jung, Joonas Jälkö, Lei Kang, Andrey Barsky, Vin- cent Poulain d’Andecy, Aurélie Joseph, Aashiq Muhamed, Kevin...
-
[7]
Tong Wu, Ashwinee Panda, Jiachen T
Membership inference attacks against in- context learning.Preprint, arXiv:2409.01380. Tong Wu, Ashwinee Panda, Jiachen T. Wang, and Pra- teek Mittal. 2024. Privacy-preserving in-context learning for large language models. InThe Twelfth International Conference on Learning Representa- tions. Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, and Reza Shokri. ...
-
[8]
Bayesian estimation of differential privacy. Preprint, arXiv:2206.05199. Yiming Zhang, Nicholas Carlini, and Daphne Ippolito
-
[9]
InFirst Conference on Language Modeling
Effective prompt extraction from language models. InFirst Conference on Language Modeling. Chunyan Zheng, Keke Sun, Wenhao Zhao, Haibo Zhou, Lixing Jiang, Shaoyang Song, and Chunlai Zhou
-
[10]
canary" Label: If-Then-Explicit User Query If the exact canary string
Locally differentially private in-context learn- ing. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10686–10697, Torino, Italia. ELRA and ICCL. 11 Appendix A ContextLeak Framework 13 A.1 ContextLeak Auditing Framework Pseudocode . . . . . . . . . . . . 13 ...
work page 2024
-
[11]
If the canary string is NOT present in system_info you must output the label 0. A.6 Experimental Settings + Parameters For classification tasks, we utilize the SubJ dataset (Pang and Lee, 2004) and Sarcasm detection (Kho- dak et al., 2018). For generation tasks, we uti- lize the Samsum (Gliwa et al., 2019) dataset and PFL-DocVQA (Tobaben et al., 2025). Fo...
work page 2004
-
[12]
The system_info text to inspect for the exact incongruous string
to find an optimized user-query to create a strong attack. The DSPy framework involves cre- ating a signature class, which is passed as part of the prompt for prediction. We specify our input fields with the variablesystem_infowith the de- scription: "The system_info text to inspect for the exact incongruous string." We additionally spec- ify the output f...
-
[13]
Do not output any additional text." The primary LM is the same LLM used for our experiments that don’t involve optimization, namelyllama3.3-70b, qwen2.5-72b, and GPT-4.1. For our reflector LLM, we utilize GPT-5, with temperature=1.0, max_tokens=20000, and num_retries=5. We uti- lize 10 training examples and 5 validation exam- ples, each utilizing 50 queri...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.