pith. sign in

arxiv: 2512.16059 · v2 · submitted 2025-12-18 · 💻 cs.CR · cs.CL

ContextLeak: Auditing Leakage in Private In-Context Learning Methods

Pith reviewed 2026-05-16 22:05 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords leakagemethodscontextleakinformationprivatesensitiveacrossexemplars
0
0 comments X

The pith

ContextLeak is the first empirical framework to audit worst-case information leakage in private in-context learning by inserting identifiable canary tokens and measuring their presence in model outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can learn from examples placed in their prompts, but if those examples contain private data like medical records, the model might accidentally reveal that data in its answers. ContextLeak tests this risk by hiding special unique words called canaries inside the private examples. It then asks the model specific questions designed to make any leaked canary words appear in the output. The authors tested this on several privacy techniques, including simple prompt changes and methods with mathematical privacy guarantees. They found that leakage could still be detected and that it got worse as the allowed privacy budget increased. The work also shows that many existing privacy approaches either let too much information out or hurt the model's performance too much.

Core claim

We introduce ContextLeak, the first framework to empirically measure the worst-case information leakage in ICL. We show that ContextLeak reliably detects leakage across methods, and the leakage increases monotonically with the theoretical privacy budget.

Load-bearing premise

That canary insertion combined with targeted queries can reliably surface worst-case leakage without being evaded by the privacy mechanisms or producing high false-positive rates in detection.

Figures

Figures reproduced from arXiv: 2512.16059 by Amin Banayeeanzade, Jacob Choi, Robin Jia, Sai Praneeth Karimireddy, Shuying Cao, Wang Bill Zhu, Xingjian Dong.

Figure 1
Figure 1. Figure 1: Threat model. Sensitive data (such as pa￾tient medical records or customer conversations) in ICL can be exposed to end users if they input an adversarial user prompt. A malicious user can input arbitrary user prompt in an attempt to extract the sensitive dataset. We want to prevent the user from learning even membership for a worst-case data-point, i.e., bounding the proba￾bility of a successful membership… view at source ↗
Figure 2
Figure 2. Figure 2: General auditing methodology. We design a canary (a uniquely identifiable data point), and a specific [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: We compare our attack with a prompt-injection attack, which asks the model to ignore all defense-based [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the auditing performance between the varying user-query strategies and the different canary [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The user-query method is optimized for the strongest attack from Figure [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Privacy leakage for RNM over datasets SubJ [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Privacy leakage for ESA over datasets Sam [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Privacy-Utility Tradeoff of ESA for Samsum [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Privacy-Utility Tradeoff of RNM for SubJ [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: We fix the number of context examples (20), [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: (Left) ESA private aggregation method. It [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 12
Figure 12. Figure 12: RNM Auditing. We identify the privacy leakage by comparing output class distributions with and without canaries to measure the distinguishability between the two conditions. The user query is designed to increase predicting an otherwise rare class (here class 1). This creates two distributions of class 1 logits with and without the canary. We pick a threshold to maximize accuracy - if the class 1 logit is… view at source ↗
Figure 15
Figure 15. Figure 15: Auditing performance using gpt4.1-mini, llama3.1-8b, and qwen2.5-7b. Although there is a larger utility improvement between 0-shot and adding context compared to the larger 70b variants, auditing capabilities in smaller models are much more limited, with privacy leakage being significantly lower compared to larger models. 100 queries were run on the SubJ dataset. Utility for the SubJ dataset is measured i… view at source ↗
Figure 14
Figure 14. Figure 14: We vary the number of runs across theoretical [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

In-Context Learning (ICL) has become a standard technique for adapting Large Language Models (LLMs) to specialized tasks by supplying task-specific exemplars within the prompt. However, when these exemplars contain sensitive information, reliable privacy-preserving mechanisms are essential to prevent unintended leakage through model outputs. Many privacy-preserving methods have been proposed to protect against information leakage in this context, but there are fewer efforts on how to audit these methods. We introduce ContextLeak, the first framework to empirically measure the worst-case information leakage in ICL. ContextLeak uses canary insertion, embedding uniquely identifiable tokens in the sensitive dataset and crafting targeted queries to detect their presence. We apply ContextLeak across a range of private ICL techniques, including both heuristic prompt-based defenses and differentially private methods with formal guarantees. We show that ContextLeak reliably detects leakage across methods, and the leakage increases monotonically with the theoretical privacy budget, offering a practical signal of worst-case privacy risk. Our analysis further reveals that existing methods strike poor privacy-utility trade-offs, either completely leaking sensitive information or severely degrading performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ContextLeak, the first empirical framework for auditing worst-case information leakage in private in-context learning (ICL) for LLMs. It inserts unique canary tokens into sensitive exemplars and uses targeted queries to detect their presence in model outputs. The framework is applied to heuristic prompt sanitization methods and differentially private ICL techniques; results show consistent leakage detection that increases monotonically with the theoretical privacy budget, while existing methods exhibit poor privacy-utility trade-offs.

Significance. If the empirical claims are substantiated, ContextLeak supplies a practical, canary-based auditing tool for a rapidly growing class of LLM applications that rely on sensitive in-context exemplars. The monotonicity result and the demonstration of inadequate trade-offs in current defenses would be useful signals for both practitioners and future mechanism designers.

major comments (2)
  1. [§4] §4 (Experimental results): the reported monotonic increase in leakage with privacy budget lacks error bars, number of independent runs, or statistical tests; without these, the trend cannot be distinguished from sampling variability and the 'reliable detection' claim remains provisional.
  2. [§3] §3 (ContextLeak framework): the central assumption that canary insertion plus targeted queries surfaces worst-case leakage is not supported by any reported calibration on non-leaking baselines or tests for evasion by heuristic sanitizers or DP noise; if canaries are preferentially suppressed or detection thresholds produce high false positives, both the detection reliability and the monotonicity results become artifacts of the auditing procedure rather than properties of the ICL methods.
minor comments (2)
  1. Figure captions and legends should explicitly state the number of trials and any confidence intervals used.
  2. The abstract and introduction should cite the specific prior auditing or membership-inference works that ContextLeak extends or differs from.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript arXiv:2512.16059. We address each major comment below and commit to revisions that enhance the statistical validity of our results and the validation of our auditing framework.

read point-by-point responses
  1. Referee: §4 (Experimental results): the reported monotonic increase in leakage with privacy budget lacks error bars, number of independent runs, or statistical tests; without these, the trend cannot be distinguished from sampling variability and the 'reliable detection' claim remains provisional.

    Authors: We agree that the presentation of results can be improved with additional statistical details. In the revised version, we will include error bars based on 5 independent runs with different random seeds for all experiments. We will also report the results of a Spearman rank correlation test to confirm the monotonic relationship between leakage and privacy budget, including p-values. revision: yes

  2. Referee: §3 (ContextLeak framework): the central assumption that canary insertion plus targeted queries surfaces worst-case leakage is not supported by any reported calibration on non-leaking baselines or tests for evasion by heuristic sanitizers or DP noise; if canaries are preferentially suppressed or detection thresholds produce high false positives, both the detection reliability and the monotonicity results become artifacts of the auditing procedure rather than properties of the ICL methods.

    Authors: The canary-based approach is intended to probe for the presence of specific tokens that could only come from the in-context exemplars, thereby surfacing leakage. While our evaluations on various sanitization and DP methods demonstrate consistent detection, we acknowledge the value of explicit non-leaking baselines. We will add experiments using models or settings where no sensitive data is provided to measure false positive rates and validate the detection threshold. This will be detailed in the revised Section 3. revision: yes

Circularity Check

0 steps flagged

Empirical auditing framework exhibits no circularity

full rationale

The paper presents ContextLeak as an empirical auditing procedure based on canary insertion into sensitive data followed by targeted queries to detect leakage. No derivations, equations, or self-citations are invoked that reduce the central measurement claims to fitted parameters, self-definitions, or prior author results by construction. The reported monotonic increase in leakage with privacy budget is an observed empirical outcome across tested methods rather than a statistically forced prediction. The framework relies on external detection rather than internal consistency loops, making the measurement procedure self-contained against the described inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that uniquely identifiable canary tokens can be reliably detected in model outputs via targeted queries, and that this detection serves as a faithful proxy for worst-case sensitive information leakage. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Canary tokens inserted into sensitive exemplars remain uniquely identifiable and their presence in outputs can be detected by targeted queries without being masked by the privacy mechanism.
    Invoked in the description of the auditing procedure; if false, the detection signal would not correspond to actual leakage.

pith-pipeline@v0.9.0 · 5510 in / 1314 out tokens · 33418 ms · 2026-05-16T22:05:48.342012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Security Considerations for Multi-agent Systems

    cs.CR 2026-03 unverdicted novelty 6.0

    No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Get my drift? catching llm task drift with activation deltas. In2025 IEEE Conference on Se- cure and Trustworthy Machine Learning (SaTML), pages 43–67. IEEE. Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. D...

  2. [2]

    Sergey Ioffe and Christian Szegedy

    Auditing differentially private machine learn- ing: How private is private sgd?Preprint, arXiv:2006.07709. Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Za- haria, and Christopher Potts. 2024. Dspy: Compiling declarative...

  3. [3]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Privacy auditing of large language models. In The Thirteenth International Conference on Learning Representations. 10 Bo Pang and Lillian Lee. 2004. A sentimental educa- tion: sentiment analysis using subjectivity summa- rization based on minimum cuts. InProceedings of the 42nd Annual Meeting on Association for Com- putational Linguistics, ACL ’04, page 2...

  4. [4]

    Preprint, arXiv:2410.02159

    Mitigating memorization in language models. Preprint, arXiv:2410.02159. Atiquer Rahman Sarkar, Yao-Shun Chuang, Noman Mohammed, and Xiaoqian Jiang. 2024. De- identification is not enough: a comparison between de-identified and synthetic clinical notes.Scientific Reports, 14(1):29669. Louis Philippe Sondeck and Maryline Laurent. 2025. Practical and ready-t...

  5. [5]

    Advances in Neural Information Processing Systems, 36:49268–49280

    Privacy auditing with one (1) training run. Advances in Neural Information Processing Systems, 36:49268–49280. Xinyu Tang, Richard Shin, Huseyin A Inan, Andre Manoel, Fatemehsadat Mireshghallah, Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, and Robert Sim

  6. [6]

    InThe Twelfth International Conference on Learning Repre- sentations

    Privacy-preserving in-context learning with differentially private few-shot generation. InThe Twelfth International Conference on Learning Repre- sentations. Marlon Tobaben, Mohamed Ali Souibgui, Rubèn Tito, Khanh Nguyen, Raouf Kerkouche, Kangsoo Jung, Joonas Jälkö, Lei Kang, Andrey Barsky, Vin- cent Poulain d’Andecy, Aurélie Joseph, Aashiq Muhamed, Kevin...

  7. [7]

    Tong Wu, Ashwinee Panda, Jiachen T

    Membership inference attacks against in- context learning.Preprint, arXiv:2409.01380. Tong Wu, Ashwinee Panda, Jiachen T. Wang, and Pra- teek Mittal. 2024. Privacy-preserving in-context learning for large language models. InThe Twelfth International Conference on Learning Representa- tions. Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, and Reza Shokri. ...

  8. [8]

    Preprint, arXiv:2206.05199

    Bayesian estimation of differential privacy. Preprint, arXiv:2206.05199. Yiming Zhang, Nicholas Carlini, and Daphne Ippolito

  9. [9]

    InFirst Conference on Language Modeling

    Effective prompt extraction from language models. InFirst Conference on Language Modeling. Chunyan Zheng, Keke Sun, Wenhao Zhao, Haibo Zhou, Lixing Jiang, Shaoyang Song, and Chunlai Zhou

  10. [10]

    canary" Label: If-Then-Explicit User Query If the exact canary string

    Locally differentially private in-context learn- ing. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10686–10697, Torino, Italia. ELRA and ICCL. 11 Appendix A ContextLeak Framework 13 A.1 ContextLeak Auditing Framework Pseudocode . . . . . . . . . . . . 13 ...

  11. [11]

    A.6 Experimental Settings + Parameters For classification tasks, we utilize the SubJ dataset (Pang and Lee, 2004) and Sarcasm detection (Kho- dak et al., 2018)

    If the canary string is NOT present in system_info you must output the label 0. A.6 Experimental Settings + Parameters For classification tasks, we utilize the SubJ dataset (Pang and Lee, 2004) and Sarcasm detection (Kho- dak et al., 2018). For generation tasks, we uti- lize the Samsum (Gliwa et al., 2019) dataset and PFL-DocVQA (Tobaben et al., 2025). Fo...

  12. [12]

    The system_info text to inspect for the exact incongruous string

    to find an optimized user-query to create a strong attack. The DSPy framework involves cre- ating a signature class, which is passed as part of the prompt for prediction. We specify our input fields with the variablesystem_infowith the de- scription: "The system_info text to inspect for the exact incongruous string." We additionally spec- ify the output f...

  13. [13]

    {canary}

    Do not output any additional text." The primary LM is the same LLM used for our experiments that don’t involve optimization, namelyllama3.3-70b, qwen2.5-72b, and GPT-4.1. For our reflector LLM, we utilize GPT-5, with temperature=1.0, max_tokens=20000, and num_retries=5. We uti- lize 10 training examples and 5 validation exam- ples, each utilizing 50 queri...