pith. machine review for the scientific record. sign in

arxiv: 2604.17814 · v1 · submitted 2026-04-20 · 💻 cs.CR · cs.AI

Recognition: unknown

Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:57 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords code LLMssecret leakageBPE tokenizationmemorizationgibberish biastoken entropycybersecuritytokenizer design
0
0 comments X

The pith

BPE tokenization creates a gibberish bias that makes certain high-entropy secrets easiest for code LLMs to memorize and leak.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that Byte-Pair Encoding tokenization produces an unexpected bias in how code large language models memorize secrets. Secrets that register as high-entropy at the character level but low-entropy at the token level turn out to be among the most readily memorized. A sympathetic reader cares because this identifies a concrete mechanism behind secret leakage in AI coding assistants and ties it to a mismatch between training token distributions and the structure of real secrets. The work backs the bias with measurements and examines how it scales with the ongoing push toward larger vocabularies. It closes by outlining mitigation approaches and the resulting pressure on tokenizer design choices.

Core claim

The central claim is that BPE tokenization leads to gibberish bias in CLLM secret memorization: certain secrets become the easiest to memorize because they exhibit high character-level entropy yet low token-level entropy. This pattern is traced to a token distribution shift between the models' training data and the secret strings themselves. The authors present numerical evidence for the bias, describe its behavior under the larger-vocabulary trend, and outline mitigation strategies together with implications for how tokenizers should be designed for code models.

What carries the argument

Gibberish bias: the preferential memorization of secrets that have high character-level entropy but low token-level entropy, driven by token distribution shift between CLLM training data and secret data.

If this is right

  • Secrets that appear random at character level but align with common training tokens face elevated leakage risk through CLLMs.
  • The bias is expected to grow stronger as token vocabularies continue to enlarge.
  • Mitigation can target tokenizer choices or post-processing steps that raise token-level entropy of secrets.
  • Tokenizer design for code models must now account for unintended memorization pathways created by distribution shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mechanism may increase leakage risk for any high-entropy strings whose token sequences match training patterns, not only passwords.
  • Secret generation practices could deliberately maximize token entropy to reduce this particular memorization route.
  • Character-aware or hybrid tokenization schemes might be tested to reduce the gap between character and token entropy.

Load-bearing premise

The observed bias stems primarily from the token distribution shift between training data and secret strings and can be isolated from confounding effects of model scale or overall training data composition.

What would settle it

Train CLLMs on data whose token distribution matches that of typical secrets and check whether the preference for low token-entropy secrets disappears.

Figures

Figures reproduced from arXiv: 2604.17814 by Huang Nianchen, Meifang Chen, Michael R. Lyu, Yichen Li, Yizhan Huang, Zhe Yang, Zihan Li.

Figure 1
Figure 1. Figure 1: The risk map of secret leakage through CLLMs. The red box highlights the focus of this paper. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualizing how Qwen2.5-Coder (left) and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Token distributions on the subsampled Stack [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mitigation Strategy Visualized code secrets. Recent research (Vieira et al., 2025) demonstrates that token-level models can be rein￾terpreted at the character level through a search￾based alignment algorithm. By reconstructing conditional distributions over characters and em￾ploying beam search pruning for efficiency, this approach provides a principled means to approx￾imate character-level behavior, there… view at source ↗
Figure 7
Figure 7. Figure 7: 2-char sub-string tokenization on unigram [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary'' trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Byte-Pair Encoding (BPE) tokenization in Code Large Language Models (CLLMs) produces a 'gibberish bias' in secret memorization: secrets exhibiting high character-level entropy but low token-level entropy are among the easiest for models to memorize. It reports numerical evidence for this bias, attributes the root cause to token distribution shift between CLLM training corpora and secret data, examines how the bias interacts with the trend toward larger vocabularies, and outlines potential mitigation strategies.

Significance. If the central attribution to tokenization-induced distribution shift can be isolated from confounders, the result would be significant for understanding memorization risks in code assistants and for informing tokenizer design choices that reduce leakage of high-entropy secrets. The work highlights a concrete, measurable interaction between token boundaries and memorization difficulty that is not currently emphasized in the security or LLM literature.

major comments (2)
  1. [Abstract] Abstract and experimental sections: the claim that numerical data supports attribution of the bias to token distribution shift is load-bearing, yet the abstract provides no description of datasets, exclusion criteria, statistical controls, or ablations (e.g., fixed-scale models or synthetic corpora with matched n-gram statistics but altered token boundaries). Without these, the isolation of tokenization as the primary cause cannot be verified and risks circularity with the entropy-based definition of the bias itself.
  2. [Results / Discussion (token distribution shift analysis)] The weakest assumption—that token distribution shift can be isolated as the primary driver without confounding from model scale or training-data composition—is not addressed by any reported control experiments. If such controls are absent, the central causal claim remains unproven even if the correlation between low token-entropy secrets and higher memorization rates holds.
minor comments (2)
  1. [Introduction] The term 'gibberish bias' is introduced without an explicit formal definition or pseudocode for computing the character- versus token-level entropy contrast; a short definition box or equation would improve clarity.
  2. [Figures] Figure captions and axis labels for any entropy or memorization-rate plots should explicitly state the tokenization scheme, model sizes, and secret-selection criteria used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key opportunities to strengthen the causal claims in our work on gibberish bias. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental sections: the claim that numerical data supports attribution of the bias to token distribution shift is load-bearing, yet the abstract provides no description of datasets, exclusion criteria, statistical controls, or ablations (e.g., fixed-scale models or synthetic corpora with matched n-gram statistics but altered token boundaries). Without these, the isolation of tokenization as the primary cause cannot be verified and risks circularity with the entropy-based definition of the bias itself.

    Authors: We agree that the abstract should supply more context to support the attribution. In the revised manuscript we will expand the abstract to include a concise description of the secret datasets and training corpora employed, the exclusion criteria applied to high-entropy secrets, and the statistical procedures used to compute memorization rates and entropies. We will also clarify that token-level entropy is obtained by applying the fixed CLLM tokenizer to each secret string, while character-level entropy is computed independently via Shannon entropy over characters; this separation avoids direct circularity. Full synthetic-corpus ablations with matched n-gram statistics lie beyond the current experiments, but we will add an explicit limitations paragraph noting this gap and outlining how such controls could be constructed in follow-up work. revision: yes

  2. Referee: [Results / Discussion (token distribution shift analysis)] The weakest assumption—that token distribution shift can be isolated as the primary driver without confounding from model scale or training-data composition—is not addressed by any reported control experiments. If such controls are absent, the central causal claim remains unproven even if the correlation between low token-entropy secrets and higher memorization rates holds.

    Authors: The referee correctly notes the absence of explicit controls for model scale and training-data composition. Our current results demonstrate a consistent correlation between low token-entropy secrets and elevated memorization rates, together with direct measurements of token-distribution shift between the CLLM training corpora and the secret strings. In the revision we will insert a new subsection that enumerates potential confounders (model scale, training-data composition) and discusses how each could interact with the observed bias. Where possible we will re-analyze existing results across the multiple model sizes already evaluated in the paper. We acknowledge that definitive isolation would require additional controlled experiments (e.g., fixed-scale models trained on synthetically altered corpora); these are noted as future work rather than claimed as completed. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines gibberish bias via observed differences in character-level vs. token-level entropy for secrets that are easiest to memorize, then attributes the pattern to token distribution shift on the basis of numerical data presented in the manuscript. No equations, self-citations, or fitted-parameter renamings are exhibited that reduce the central attribution to a tautology or to the input observations by construction. The derivation therefore remains self-contained with empirical support that is not forced by the definitions themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on entropy as a proxy for memorization difficulty and on the existence of a measurable token distribution shift that explains the bias; these are treated as domain assumptions without independent derivation in the abstract.

axioms (2)
  • domain assumption Secrets can be meaningfully distinguished by comparing character-level entropy to token-level entropy under BPE
    Used to identify which secrets exhibit the bias
  • domain assumption Memorization behavior in CLLMs is primarily driven by token distribution mismatch with training data
    Core attribution for the observed bias
invented entities (1)
  • gibberish bias no independent evidence
    purpose: To name and explain the unexpected ease of memorizing certain high-entropy secrets under BPE tokenization
    New term introduced to describe the phenomenon

pith-pipeline@v0.9.0 · 5484 in / 1426 out tokens · 34814 ms · 2026-05-10T04:57:16.398912+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 20 canonical work pages · 6 internal anchors

  1. [1]

    The Twelfth International Conference on Learning Representations , year=

    What's In My Big Data? , author=. The Twelfth International Conference on Learning Representations , year=

  2. [2]

    Proceedings of the 8th ACM conference on security & privacy in wireless and mobile networks , pages=

    Harvesting developer credentials in android apps , author=. Proceedings of the 8th ACM conference on security & privacy in wireless and mobile networks , pages=

  3. [3]

    Proceedings of the 44th International Conference on Software Engineering , pages=

    Automated detection of password leakage from public GitHub repositories , author=. Proceedings of the 44th International Conference on Software Engineering , pages=

  4. [4]

    , author=

    How bad can it git? characterizing secret leakage in public github repositories. , author=. NDSS , year=

  5. [5]

    ICML 2022 Workshop on Knowledge Retrieval and Language Models , year =

    Are Large Pre-Trained Language Models Leaking Your Personal Information? , author=. ICML 2022 Workshop on Knowledge Retrieval and Language Models , year =

  6. [6]

    2023 IEEE Symposium on Security and Privacy (SP) , year =

    Analyzing Leakage of Personally Identifiable Information in Language Models , author =. 2023 IEEE Symposium on Security and Privacy (SP) , year =

  7. [7]

    Don’t Leak Your Keys: Understanding, Measuring, and Exploiting the AppSecret Leaks in Mini-Programs

    Zhang, Yue and Yang, Yuqing and Lin, Zhiqiang , title =. 2023 , isbn =. doi:10.1145/3576915.3616591 , booktitle =

  8. [8]

    , title =

    Huang, Yizhan and Li, Yichen and Wu, Weibin and Zhang, Jianping and Lyu, Michael R. , title =. 2024 , issue_date =. doi:10.1145/3660818 , journal =

  9. [9]

    Proceedings of the Network and Distributed System Security Symposium (NDSS) , year =

    The Skeleton Keys: A Large-Scale Analysis of Credential Leakage in Mini-Apps , author =. Proceedings of the Network and Distributed System Security Symposium (NDSS) , year =

  10. [10]

    Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages =

    Svyatkovskiy, Alexey and Deng, Shao Kun and Fu, Shengyu and Sundaresan, Neel , title =. Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages =. 2020 , publisher =

  11. [11]

    naturalizing

    Natgen: generative pre-training by “naturalizing” source code , author=. Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

  12. [12]

    Proceedings of the 45th International Conference on Software Engineering , pages =

    Li, Jia and Li, Yongmin and Li, Ge and Jin, Zhi and Hao, Yiyang and Hu, Xing , title =. Proceedings of the 45th International Conference on Software Engineering , pages =. 2023 , isbn =

  13. [13]

    GitHub Blog , author=

    GitHub Copilot research recitation , url=. GitHub Blog , author=. 2021 , month=

  14. [14]

    LLaMA: Open and Efficient Foundation Language Models

    LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , year=

  15. [15]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  16. [16]

    Singh and D

    Tokenization counts: the impact of tokenization on arithmetic in frontier llms , author=. arXiv preprint arXiv:2402.14903 , year=

  17. [17]

    Qwen2.5-Coder Technical Report

    Qwen2. 5-Coder Technical Report , author=. arXiv preprint arXiv:2409.12186 , year=

  18. [18]

    arXiv preprint , year =

    StarCoder 2 and The Stack v2: The Next Generation , author =. arXiv preprint , year =

  19. [19]

    GitHub surpasses over 15 million users: Microsoft , url=

    Entrepreneur , year=. GitHub surpasses over 15 million users: Microsoft , url=. GitHub Surpasses Over 15 Million Users: Microsoft , publisher=

  20. [20]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence , author=. arXiv preprint arXiv:2401.14196 , year=

  21. [21]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

  22. [22]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  23. [23]

    Language Models are Unsupervised Multitask Learners , author=

  24. [24]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  25. [25]

    2024 , isbn =

    Yang, Zhou and Zhao, Zhipeng and Wang, Chenyu and Shi, Jieke and Kim, Dongsun and Han, Donggyun and Lo, David , title =. 2024 , isbn =. doi:10.1145/3597503.3639074 , booktitle =

  26. [26]

    2022 IEEE Symposium on Security and Privacy (SP) , pages=

    Asleep at the keyboard? assessing the security of github copilot’s code contributions , author=. 2022 IEEE Symposium on Security and Privacy (SP) , pages=. 2022 , organization=

  27. [27]

    Díaz Ferreyra, and Riccardo Scandari- ato

    Basak, Setu Kumar and Neil, Lorenzo and Reaves, Bradley and Williams, Laurie , booktitle =. 2023 , volume =. doi:10.1109/MSR59073.2023.00053 , url =

  28. [28]

    The GitHub Blog , author=

    GitHub. The GitHub Blog , author=. 2023 , month=

  29. [29]

    2023 , month=

    Amazon , author=. 2023 , month=

  30. [30]

    Lyu , journal=

    Yizhan Huang and Zhe YANG and Meifang Chen and HUANG Nianchen and Jianping Zhang and Michael R. Lyu , journal=. Data Compressibility Quantifies. 2026 , url=

  31. [31]

    The GitHub Blog , author=

    Behind github's new authentication token formats , url=. The GitHub Blog , author=. 2021 , month=

  32. [32]

    ACM SIGMOBILE Mobile Computing and Communications Review , year =

    Shannon, Claude Elwood , title =. ACM SIGMOBILE Mobile Computing and Communications Review , year =

  33. [33]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  34. [34]

    Forty-second International Conference on Machine Learning , year=

    Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling , author=. Forty-second International Conference on Machine Learning , year=

  35. [35]

    First Conference on Language Modeling , year=

    Compression Represents Intelligence Linearly , author=. First Conference on Language Modeling , year=

  36. [36]

    The Stack: 3

    Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu. The Stack: 3. Transactions on Machine Learning Research , issn=. 2023 , url=

  37. [37]

    Proceedings of the ACM on Software Engineering , volume=

    Your code secret belongs to me: Neural code completion tools can memorize hard-coded credentials , author=. Proceedings of the ACM on Software Engineering , volume=. 2024 , publisher=

  38. [38]

    Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

    Unveiling memorization in code models , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

  39. [39]

    Proceedings of the ACM on Software Engineering , volume=

    An Empirical Study of Code Clones from Commercial AI Code Generators , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=

  40. [40]

    2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=

    Decoding secret memorization in code llms through token-level characterization , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=

  41. [41]

    2013 , eprint=

    Efficient Estimation of Word Representations in Vector Space , author=. 2013 , eprint=

  42. [42]

    doi: 10.18653/v1/P16-1162

    Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

  43. [43]

    SentencePiece:

    Kudo, Taku and Richardson, John. S entence P iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018. doi:10.18653/v1/D18-2012

  44. [44]

    C ode T 5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

    Wang, Yue and Wang, Weishi and Joty, Shafiq and Hoi, Steven C.H. C ode T 5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.685

  45. [45]

    2022 , isbn =

    Ciniselli, Matteo and Pascarella, Luca and Bavota, Gabriele , title =. 2022 , isbn =. doi:10.1145/3524842.3528440 , booktitle =

  46. [46]

    2023 , volume =

    Al-Kaswan, Ali and Izadi, Maliheh , booktitle =. 2023 , volume =. doi:10.1109/NLBSE59153.2023.00008 , url =

  47. [47]

    2025 , month=

    Cursor , author=. 2025 , month=

  48. [48]

    OpenAI Codex: A Series of Models for Code Generation , year =

  49. [49]

    Claude Statistics 2026: 18.9M Active Users , year =

  50. [50]

    Claude Code: An Agentic CLI for Developers , year =

  51. [51]

    Forty-second International Conference on Machine Learning , year=

    From Language Models over Tokens to Language Models over Characters , author=. Forty-second International Conference on Machine Learning , year=

  52. [52]

    Yandex Cloud - Documentation , author=

    Yandex Cloud Documentation: Yandex identity and Access Management: Oauth token , url=. Yandex Cloud - Documentation , author=. 2023 , month=

  53. [53]

    H., Ivison, H., Magnusson, I., Wang, Y., et al

    Olmo: Accelerating the science of language models , author=. arXiv preprint arXiv:2402.00838 , year=

  54. [54]

    Claude Code Statistics 2026: Key Numbers, Data & Facts , year =

  55. [55]

    22 Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, and Dong Yu

    Dolma: An open corpus of three trillion tokens for language model pretraining research , author=. arXiv preprint arXiv:2402.00159 , year=

  56. [56]

    2025 , eprint=

    Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? , author=. 2025 , eprint=

  57. [57]

    Advances in Neural Information Processing Systems , volume=

    Zero-shot tokenizer transfer , author=. Advances in Neural Information Processing Systems , volume=

  58. [58]

    The Rise of Agentic Engineering: Claude Code Adoption and Developer Milestones , year =

  59. [59]

    Fast Vocabulary Transfer for Language Model Compression

    Gee, Leonidas and Zugarini, Andrea and Rigutini, Leonardo and Torroni, Paolo , year=. Fast Vocabulary Transfer for Language Model Compression , url=. doi:10.18653/v1/2022.emnlp-industry.41 , booktitle=

  60. [60]

    arXiv preprint arXiv:2411.08671 , year=

    Theoretical analysis of byte-pair encoding , author=. arXiv preprint arXiv:2411.08671 , year=

  61. [61]

    C Users J

    Gage, Philip , title =. C Users J. , month = feb, pages =. 1994 , issue_date =

  62. [62]

    Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching , author=

  63. [63]

    30th USENIX security symposium (USENIX Security 21) , pages=

    Extracting training data from large language models , author=. 30th USENIX security symposium (USENIX Security 21) , pages=

  64. [64]

    Advances in Neural Information Processing Systems , year=

    Language Model Tokenizers Introduce Unfairness Between Languages , author=. Advances in Neural Information Processing Systems , year=

  65. [65]

    arXiv preprint arXiv:2510.14972 , year=

    TOKDRIFT: When LLM Speaks in Subwords but Code Speaks in Grammar , author=. arXiv preprint arXiv:2510.14972 , year=