pith. sign in

arxiv: 2605.15248 · v1 · pith:ZGQEWN3Rnew · submitted 2026-05-14 · 💻 cs.SE · cs.CR

Probing Privacy Leaks in LLM-based Code Generation via Test Generation

Pith reviewed 2026-05-19 16:16 UTC · model grok-4.3

classification 💻 cs.SE cs.CR
keywords privacy leakageLLM code generationtest generationpersonally identifiable informationprompt engineeringsoftware securitydata memorization
0
0 comments X

The pith

A pipeline using test generation and a privacy feature library detects 2.56 times more privacy leaks in LLM code generation than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to better detect when large language models for code generation have memorized and reproduce personally identifiable information from their training data. Current detection approaches use manually or automatically designed prompts that do not match how such information actually appears in real code. The new pipeline instead simulates practical code generation tasks and uses automatically generated test cases driven by a library of privacy features to pull out the leaked data. Experiments across five popular LLMs confirm this finds substantially more verified leaks than prior baselines. This matters because it provides a more realistic way to audit privacy risks in widely used code assistants.

Core claim

We propose a pipeline that simulates practical privacy-related code generation scenarios and adopts a test-driven strategy to elicit the memorized information from the generated test cases. We further introduce an automatically constructed privacy feature library that replaces manual prompt engineering by providing realistic templates and examples to guide test case generation. Large-scale experiments on 5 widely used LLMs show that our pipeline exposes more confirmed privacy leakage, achieving a 2.56 times increase in detected leakage compared to existing baselines.

What carries the argument

A test-driven strategy paired with an automatically constructed privacy feature library that supplies realistic templates and examples to guide test case generation for eliciting memorized PII.

If this is right

  • LLMs can leak more PII under realistic code-generation prompts than ad-hoc tests reveal.
  • Automatic privacy feature libraries can replace manual prompt design for leakage detection.
  • The test-driven approach scales across multiple LLMs and yields consistently higher detection rates.
  • Confirmed leaks identified this way can guide targeted removal of sensitive data from training sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The detection method could be embedded into continuous auditing tools for deployed code LLMs.
  • Similar test-generation ideas might apply to other memorized content such as security vulnerabilities or copyrighted code.
  • Training pipelines could incorporate generated tests as a regularizer to discourage memorization of PII.

Load-bearing premise

Existing privacy-leakage detection methods rely on ad-hoc prompt construction that does not adequately approximate the real-world contexts in which PII appears in code corpora.

What would settle it

Apply both the new pipeline and baseline methods to the same five LLMs, count the distinct confirmed PII leaks extracted in each case, and check whether the new method produces at least 2.5 times as many verified leaks; failure to do so would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.15248 by Chunrong Fang, Juan Zhai, Weisong Sun, Xia Feng, Xiaofang Zhang, Yang Liu, Yifei Ge, Yuchen Chen, Zhenpeng Chen, Zhenyu Chen.

Figure 1
Figure 1. Figure 1: Number of confirmed privacy instances under [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the privacy leakage evaluation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Clustering of tem￾plate tokens (Λtmp), repre￾senting structure-dominated parts of the code. Different colors indicate different pri￾vacy attributes [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: shows the prompt format used in Section 4.2 to instantiate privacy-related code￾generation questions. Given a development sce￾nario s and its associated attribute set A(s), we ask a question-generation model to produce a list of concrete coding tasks that naturally operate on these attributes. The resulting questions serve as the inputs to the evaluated LLM in the next stage, ensuring that privacy attribut… view at source ↗
Figure 7
Figure 7. Figure 7: An example of generating a code snippet involving privacy attributes, followed by test cases gen￾erated for it that contain potential privacy content. C Additional Quantitative Results C.1 Results for the GPT Family [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between Human and Judge LLM. D Results Validation D.1 Judge LLM Reliability [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: visualizes the ablation study reported in the main text by comparing leakage outcomes with and without the privacy feature library (FL). With￾out FL, the model is more likely to generate low￾information or placeholder-like inputs, which are less likely to survive strict verification, leading to fewer confirmed leaks overall. This effect is especially pronounced for attributes with stricter or less intuitiv… view at source ↗
read the original abstract

The widespread availability of large-scale code datasets has fueled the rapid development of large language models (LLMs) for code-related tasks. These datasets may include sensitive personally identifiable information (PII), which can lead to privacy leakage when LLMs memorize and reproduce it. However, existing privacy-leakage detection methods rely on ad-hoc prompt construction (manually or automatically designed). Therefore, they do not adequately approximate the real-world contexts in which PII appears in code corpora, making it difficult to extract realistic privacy leakage. In this paper, we propose a pipeline that simulates practical privacy-related code generation scenarios and adopts a test-driven strategy to elicit the memorized information from the generated test cases. We further introduce an automatically constructed privacy feature library that replaces manual prompt engineering by providing realistic templates and examples to guide test case generation. Large-scale experiments on 5 widely used LLMs show that our pipeline exposes more confirmed privacy leakage, achieving a 2.56 times increase in detected leakage compared to existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a test-driven pipeline augmented by an automatically constructed privacy feature library to simulate realistic code-generation contexts containing PII. It reports large-scale experiments on five LLMs showing that the pipeline elicits 2.56 times more confirmed privacy leaks than existing ad-hoc baseline methods.

Significance. If the confirmation procedure is shown to be independent of the generation method and able to distinguish memorization from plausible generation, the work would meaningfully advance privacy auditing for code LLMs by replacing manual prompt engineering with a more systematic, test-driven approach. The scale of the evaluation across five models is a clear strength.

major comments (1)
  1. [Abstract and experimental results] Abstract and experimental results section: the central claim of a 2.56× increase in 'confirmed privacy leakage' rests on an unspecified confirmation procedure. No description is given of the exact matching criterion (string match against a known PII corpus, membership inference, manual review, or heuristic), whether the same procedure was applied uniformly to baselines, or how it rules out plausible PII-containing code rather than regurgitated training examples. This directly affects whether the measured improvement can be attributed to better elicitation of real-world contexts.
minor comments (1)
  1. [Abstract] The abstract states that the privacy feature library 'replaces manual prompt engineering' but does not clarify whether any manual curation was still required for the library templates themselves.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding the confirmation procedure below and will revise the paper accordingly to improve clarity.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results section: the central claim of a 2.56× increase in 'confirmed privacy leakage' rests on an unspecified confirmation procedure. No description is given of the exact matching criterion (string match against a known PII corpus, membership inference, manual review, or heuristic), whether the same procedure was applied uniformly to baselines, or how it rules out plausible PII-containing code rather than regurgitated training examples. This directly affects whether the measured improvement can be attributed to better elicitation of real-world contexts.

    Authors: We agree that the confirmation procedure requires more explicit description in the abstract and experimental results section. In the revised manuscript we will add a dedicated paragraph detailing that confirmation relies on exact string matching against the specific PII instances stored in the automatically constructed privacy feature library. The identical matching criterion is applied uniformly to outputs from our pipeline and all baseline methods. The test-driven design further reduces the chance of counting plausible but non-memorized PII by requiring the generated code to reproduce the exact library-derived PII inside the realistic context supplied by the test case; we will expand the discussion to clarify why this targets regurgitation more directly than ad-hoc prompting. revision: yes

Circularity Check

0 steps flagged

Empirical comparison of leakage detection pipelines with no self-referential derivation

full rationale

The paper presents an empirical pipeline for eliciting privacy leaks via test generation and an automatically constructed feature library, then reports a measured 2.56× increase in confirmed leaks over baselines across five LLMs. No equations, fitted parameters renamed as predictions, or self-definitional steps appear. The central claim rests on experimental counts of confirmed leakage rather than any derivation that reduces to its own inputs by construction. The confirmation procedure is described as independent of the generation method in the abstract framing, and the work is self-contained against external baselines without load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs memorize PII from training data and that simulated test cases can reliably elicit it without introducing artifacts.

axioms (2)
  • domain assumption LLMs trained on code corpora containing PII will memorize and reproduce that information under appropriate prompting.
    Stated in the opening of the abstract as the motivation for the work.
  • domain assumption Test generation can approximate real-world code contexts sufficiently to extract memorized PII.
    Core premise of the proposed pipeline.

pith-pipeline@v0.9.0 · 5729 in / 1223 out tokens · 59826 ms · 2026-05-19T16:16:11.816549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 5 internal anchors

  1. [1]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , journal =

    Hamel Husain and Ho. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , journal =

  2. [2]

    AI-based Programming Assistants for Privacy-related Code Generation: The Developers’ Experience , volume =

    Kashumi Madampe and John Grundy and Nalin Arachchilage , journal =. AI-based Programming Assistants for Privacy-related Code Generation: The Developers’ Experience , volume =

  3. [3]

    CodeT: Code Generation with Generated Tests

    Codet: Code generation with generated tests , author=. arXiv preprint arXiv:2207.10397 , year=

  4. [4]

    Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages=

    Using large language models to generate junit tests: An empirical study , author=. Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages=

  5. [5]

    arXiv preprint arXiv:2412.18843 , year=

    Improving the readability of automatically generated tests using large language models , author=. arXiv preprint arXiv:2412.18843 , year=

  6. [6]

    32nd USENIX Security Symposium (USENIX Security 23) , pages=

    \ CodexLeaks \ : Privacy leaks from code generation language models in \ GitHub \ copilot , author=. 32nd USENIX Security Symposium (USENIX Security 23) , pages=

  7. [7]

    30th USENIX security symposium (USENIX Security 21) , pages=

    Extracting training data from large language models , author=. 30th USENIX security symposium (USENIX Security 21) , pages=

  8. [8]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  9. [9]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  10. [10]

    Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , volume=

    Enterprise data breach: causes, challenges, prevention, and future directions , author=. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , volume=. 2017 , publisher=

  11. [11]

    2022 , howpublished =

  12. [12]

    2023 , howpublished =

  13. [13]

    2016 , howpublished =

  14. [14]

    , author=

    How bad can it git? characterizing secret leakage in public github repositories. , author=. NDSS , year=

  15. [15]

    A.; Kamath, G.; Kulkarni, J.; Lee, Y

    Differentially private fine-tuning of language models , author=. arXiv preprint arXiv:2110.06500 , year=

  16. [16]

    arXiv preprint arXiv:2205.01863 , year=

    Provably confidential language modelling , author=. arXiv preprint arXiv:2205.01863 , year=

  17. [17]

    Proceedings of the Third Workshop on Privacy in Natural Language Processing , pages=

    Understanding unintended memorization in language models under federated learning , author=. Proceedings of the Third Workshop on Privacy in Natural Language Processing , pages=

  18. [18]

    2018 IEEE 31st computer security foundations symposium (CSF) , pages=

    Privacy risk in machine learning: Analyzing the connection to overfitting , author=. 2018 IEEE 31st computer security foundations symposium (CSF) , pages=. 2018 , organization=

  19. [19]

    2019 IEEE symposium on security and privacy (SP) , pages=

    Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning , author=. 2019 IEEE symposium on security and privacy (SP) , pages=. 2019 , organization=

  20. [20]

    , author=

    Obfuscation-Resilient Privacy Leak Detection for Mobile Apps Through Differential Analysis. , author=. NDSS , volume=

  21. [21]

    Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services , pages=

    Recon: Revealing and controlling pii leaks in mobile network traffic , author=. Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services , pages=

  22. [22]

    Optus notifies customers of cyberattack compromising customer information , year =

  23. [23]

    Holmes , title =

    A. Holmes , title =. 2021 , howpublished =

  24. [24]

    2025 , month = feb, url =

    Daniel, Lars , title =. 2025 , month = feb, url =

  25. [25]

    2024 , month = dec, day =

    Xiao Xiao , title =. 2024 , month = dec, day =

  26. [26]

    California Consumer Privacy Act of 2018 (CCPA) , year =

  27. [27]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  28. [28]

    2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR) , pages=

    Secretbench: A dataset of software secrets , author=. 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR) , pages=. 2023 , organization=

  29. [29]

    2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=

    ChatGPT-Based Test Generation for Refactoring Engines Enhanced by Feature Analysis on Examples , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization =

  30. [30]

    ACM computing surveys , volume=

    Survey of hallucination in natural language generation , author=. ACM computing surveys , volume=. 2023 , publisher=

  31. [31]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Security attacks on llm-based code completion tools , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  32. [32]

    28th USENIX security symposium (USENIX security 19) , pages=

    The secret sharer: Evaluating and testing unintended memorization in neural networks , author=. 28th USENIX security symposium (USENIX security 19) , pages=

  33. [33]

    Proceedings of the 2020 ACM SIGSAC conference on computer and communications security , pages=

    Analyzing information leakage of updates to natural language models , author=. Proceedings of the 2020 ACM SIGSAC conference on computer and communications security , pages=

  34. [34]

    arXiv preprint arXiv:2203.13920 , year=

    Canary extraction in natural language understanding models , author=. arXiv preprint arXiv:2203.13920 , year=

  35. [35]

    Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , pages=

    Codereval: A benchmark of pragmatic code generation with generative pre-trained models , author=. Proceedings of the 46th IEEE/ACM International Conference on Software Engineering , pages=

  36. [36]

    2024 IEEE International Conference on Artificial Intelligence Testing (AITest) , pages=

    ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation , author=. 2024 IEEE International Conference on Artificial Intelligence Testing (AITest) , pages=. 2024 , organization=

  37. [37]

    arXiv preprint arXiv:2503.03988 , year=

    AI-based Programming Assistants for Privacy-related Code Generation: The Developers' Experience , author=. arXiv preprint arXiv:2503.03988 , year=

  38. [38]

    Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

    Evaluating large language models in class-level code generation , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

  39. [39]

    arXiv preprint arXiv:2412.18573 , year=

    Top General Performance= Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark , author=. arXiv preprint arXiv:2412.18573 , year=

  40. [40]

    Ippolito, F

    Preventing verbatim memorization in language models gives a false sense of privacy , author=. arXiv preprint arXiv:2210.17546 , year=

  41. [41]

    Transactions of the Association for Computational Linguistics , volume=

    How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven , author=. Transactions of the Association for Computational Linguistics , volume=

  42. [42]

    2025 IEEE Symposium on Security and Privacy (SP) , pages=

    Codebreaker: Dynamic Extraction Attacks on Code Language Models , author=. 2025 IEEE Symposium on Security and Privacy (SP) , pages=. 2025 , organization=

  43. [43]

    2025 IEEE Symposium on Security and Privacy (SP) , pages=

    Fuzz-testing meets llm-based agents: An automated and efficient framework for jailbreaking text-to-image generation models , author=. 2025 IEEE Symposium on Security and Privacy (SP) , pages=. 2025 , organization=

  44. [44]

    International Journal of Advanced Computer Science and Applications , volume=

    Personally identifiable information (pii) detection in the unstructured large text corpus using natural language processing and unsupervised learning technique , author=. International Journal of Advanced Computer Science and Applications , volume=. 2021 , publisher=

  45. [45]

    ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Recovering from privacy-preserving masking with large language models , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

  46. [46]

    The Twelfth International Conference on Learning Representations,

    Weijia Shi and Anirudh Ajith and Mengzhou Xia and Yangsibo Huang and Daogao Liu and Terra Blevins and Danqi Chen and Luke Zettlemoyer , title =. The Twelfth International Conference on Learning Representations,

  47. [47]

    arXiv preprint arXiv:2512.05459 , year=

    PrivCode: When Code Generation Meets Differential Privacy , author=. arXiv preprint arXiv:2512.05459 , year=

  48. [48]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  49. [49]

    Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

    Unveiling memorization in code models , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

  50. [50]

    Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

    Traces of memorisation in large language models for code , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

  51. [51]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  52. [52]

    2024 , eprint=

    DeepSeek-V3 Technical Report , author=. 2024 , eprint=

  53. [53]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  54. [54]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=