Recognition: unknown
Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective
Pith reviewed 2026-05-10 04:57 UTC · model grok-4.3
The pith
BPE tokenization creates a gibberish bias that makes certain high-entropy secrets easiest for code LLMs to memorize and leak.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that BPE tokenization leads to gibberish bias in CLLM secret memorization: certain secrets become the easiest to memorize because they exhibit high character-level entropy yet low token-level entropy. This pattern is traced to a token distribution shift between the models' training data and the secret strings themselves. The authors present numerical evidence for the bias, describe its behavior under the larger-vocabulary trend, and outline mitigation strategies together with implications for how tokenizers should be designed for code models.
What carries the argument
Gibberish bias: the preferential memorization of secrets that have high character-level entropy but low token-level entropy, driven by token distribution shift between CLLM training data and secret data.
If this is right
- Secrets that appear random at character level but align with common training tokens face elevated leakage risk through CLLMs.
- The bias is expected to grow stronger as token vocabularies continue to enlarge.
- Mitigation can target tokenizer choices or post-processing steps that raise token-level entropy of secrets.
- Tokenizer design for code models must now account for unintended memorization pathways created by distribution shifts.
Where Pith is reading between the lines
- The same mechanism may increase leakage risk for any high-entropy strings whose token sequences match training patterns, not only passwords.
- Secret generation practices could deliberately maximize token entropy to reduce this particular memorization route.
- Character-aware or hybrid tokenization schemes might be tested to reduce the gap between character and token entropy.
Load-bearing premise
The observed bias stems primarily from the token distribution shift between training data and secret strings and can be isolated from confounding effects of model scale or overall training data composition.
What would settle it
Train CLLMs on data whose token distribution matches that of typical secrets and check whether the preference for low token-entropy secrets disappears.
Figures
read the original abstract
Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary'' trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Byte-Pair Encoding (BPE) tokenization in Code Large Language Models (CLLMs) produces a 'gibberish bias' in secret memorization: secrets exhibiting high character-level entropy but low token-level entropy are among the easiest for models to memorize. It reports numerical evidence for this bias, attributes the root cause to token distribution shift between CLLM training corpora and secret data, examines how the bias interacts with the trend toward larger vocabularies, and outlines potential mitigation strategies.
Significance. If the central attribution to tokenization-induced distribution shift can be isolated from confounders, the result would be significant for understanding memorization risks in code assistants and for informing tokenizer design choices that reduce leakage of high-entropy secrets. The work highlights a concrete, measurable interaction between token boundaries and memorization difficulty that is not currently emphasized in the security or LLM literature.
major comments (2)
- [Abstract] Abstract and experimental sections: the claim that numerical data supports attribution of the bias to token distribution shift is load-bearing, yet the abstract provides no description of datasets, exclusion criteria, statistical controls, or ablations (e.g., fixed-scale models or synthetic corpora with matched n-gram statistics but altered token boundaries). Without these, the isolation of tokenization as the primary cause cannot be verified and risks circularity with the entropy-based definition of the bias itself.
- [Results / Discussion (token distribution shift analysis)] The weakest assumption—that token distribution shift can be isolated as the primary driver without confounding from model scale or training-data composition—is not addressed by any reported control experiments. If such controls are absent, the central causal claim remains unproven even if the correlation between low token-entropy secrets and higher memorization rates holds.
minor comments (2)
- [Introduction] The term 'gibberish bias' is introduced without an explicit formal definition or pseudocode for computing the character- versus token-level entropy contrast; a short definition box or equation would improve clarity.
- [Figures] Figure captions and axis labels for any entropy or memorization-rate plots should explicitly state the tokenization scheme, model sizes, and secret-selection criteria used.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies key opportunities to strengthen the causal claims in our work on gibberish bias. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental sections: the claim that numerical data supports attribution of the bias to token distribution shift is load-bearing, yet the abstract provides no description of datasets, exclusion criteria, statistical controls, or ablations (e.g., fixed-scale models or synthetic corpora with matched n-gram statistics but altered token boundaries). Without these, the isolation of tokenization as the primary cause cannot be verified and risks circularity with the entropy-based definition of the bias itself.
Authors: We agree that the abstract should supply more context to support the attribution. In the revised manuscript we will expand the abstract to include a concise description of the secret datasets and training corpora employed, the exclusion criteria applied to high-entropy secrets, and the statistical procedures used to compute memorization rates and entropies. We will also clarify that token-level entropy is obtained by applying the fixed CLLM tokenizer to each secret string, while character-level entropy is computed independently via Shannon entropy over characters; this separation avoids direct circularity. Full synthetic-corpus ablations with matched n-gram statistics lie beyond the current experiments, but we will add an explicit limitations paragraph noting this gap and outlining how such controls could be constructed in follow-up work. revision: yes
-
Referee: [Results / Discussion (token distribution shift analysis)] The weakest assumption—that token distribution shift can be isolated as the primary driver without confounding from model scale or training-data composition—is not addressed by any reported control experiments. If such controls are absent, the central causal claim remains unproven even if the correlation between low token-entropy secrets and higher memorization rates holds.
Authors: The referee correctly notes the absence of explicit controls for model scale and training-data composition. Our current results demonstrate a consistent correlation between low token-entropy secrets and elevated memorization rates, together with direct measurements of token-distribution shift between the CLLM training corpora and the secret strings. In the revision we will insert a new subsection that enumerates potential confounders (model scale, training-data composition) and discusses how each could interact with the observed bias. Where possible we will re-analyze existing results across the multiple model sizes already evaluated in the paper. We acknowledge that definitive isolation would require additional controlled experiments (e.g., fixed-scale models trained on synthetically altered corpora); these are noted as future work rather than claimed as completed. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines gibberish bias via observed differences in character-level vs. token-level entropy for secrets that are easiest to memorize, then attributes the pattern to token distribution shift on the basis of numerical data presented in the manuscript. No equations, self-citations, or fitted-parameter renamings are exhibited that reduce the central attribution to a tautology or to the input observations by construction. The derivation therefore remains self-contained with empirical support that is not forced by the definitions themselves.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Secrets can be meaningfully distinguished by comparing character-level entropy to token-level entropy under BPE
- domain assumption Memorization behavior in CLLMs is primarily driven by token distribution mismatch with training data
invented entities (1)
-
gibberish bias
no independent evidence
Reference graph
Works this paper leans on
-
[1]
The Twelfth International Conference on Learning Representations , year=
What's In My Big Data? , author=. The Twelfth International Conference on Learning Representations , year=
-
[2]
Proceedings of the 8th ACM conference on security & privacy in wireless and mobile networks , pages=
Harvesting developer credentials in android apps , author=. Proceedings of the 8th ACM conference on security & privacy in wireless and mobile networks , pages=
-
[3]
Proceedings of the 44th International Conference on Software Engineering , pages=
Automated detection of password leakage from public GitHub repositories , author=. Proceedings of the 44th International Conference on Software Engineering , pages=
-
[4]
, author=
How bad can it git? characterizing secret leakage in public github repositories. , author=. NDSS , year=
-
[5]
ICML 2022 Workshop on Knowledge Retrieval and Language Models , year =
Are Large Pre-Trained Language Models Leaking Your Personal Information? , author=. ICML 2022 Workshop on Knowledge Retrieval and Language Models , year =
2022
-
[6]
2023 IEEE Symposium on Security and Privacy (SP) , year =
Analyzing Leakage of Personally Identifiable Information in Language Models , author =. 2023 IEEE Symposium on Security and Privacy (SP) , year =
2023
-
[7]
Don’t Leak Your Keys: Understanding, Measuring, and Exploiting the AppSecret Leaks in Mini-Programs
Zhang, Yue and Yang, Yuqing and Lin, Zhiqiang , title =. 2023 , isbn =. doi:10.1145/3576915.3616591 , booktitle =
-
[8]
Huang, Yizhan and Li, Yichen and Wu, Weibin and Zhang, Jianping and Lyu, Michael R. , title =. 2024 , issue_date =. doi:10.1145/3660818 , journal =
-
[9]
Proceedings of the Network and Distributed System Security Symposium (NDSS) , year =
The Skeleton Keys: A Large-Scale Analysis of Credential Leakage in Mini-Apps , author =. Proceedings of the Network and Distributed System Security Symposium (NDSS) , year =
-
[10]
Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages =
Svyatkovskiy, Alexey and Deng, Shao Kun and Fu, Shengyu and Sundaresan, Neel , title =. Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages =. 2020 , publisher =
2020
-
[11]
naturalizing
Natgen: generative pre-training by “naturalizing” source code , author=. Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=
-
[12]
Proceedings of the 45th International Conference on Software Engineering , pages =
Li, Jia and Li, Yongmin and Li, Ge and Jin, Zhi and Hao, Yiyang and Hu, Xing , title =. Proceedings of the 45th International Conference on Software Engineering , pages =. 2023 , isbn =
2023
-
[13]
GitHub Blog , author=
GitHub Copilot research recitation , url=. GitHub Blog , author=. 2021 , month=
2021
-
[14]
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and Efficient Foundation Language Models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Tokenization counts: the impact of tokenization on arithmetic in frontier llms , author=. arXiv preprint arXiv:2402.14903 , year=
-
[17]
Qwen2.5-Coder Technical Report
Qwen2. 5-Coder Technical Report , author=. arXiv preprint arXiv:2409.12186 , year=
work page internal anchor Pith review arXiv
-
[18]
arXiv preprint , year =
StarCoder 2 and The Stack v2: The Next Generation , author =. arXiv preprint , year =
-
[19]
GitHub surpasses over 15 million users: Microsoft , url=
Entrepreneur , year=. GitHub surpasses over 15 million users: Microsoft , url=. GitHub Surpasses Over 15 Million Users: Microsoft , publisher=
-
[20]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence , author=. arXiv preprint arXiv:2401.14196 , year=
work page internal anchor Pith review arXiv
-
[21]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=
work page internal anchor Pith review arXiv
-
[22]
2021 , eprint=
Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=
2021
-
[23]
Language Models are Unsupervised Multitask Learners , author=
-
[24]
2023 , eprint=
Mistral 7B , author=. 2023 , eprint=
2023
-
[25]
Yang, Zhou and Zhao, Zhipeng and Wang, Chenyu and Shi, Jieke and Kim, Dongsun and Han, Donggyun and Lo, David , title =. 2024 , isbn =. doi:10.1145/3597503.3639074 , booktitle =
-
[26]
2022 IEEE Symposium on Security and Privacy (SP) , pages=
Asleep at the keyboard? assessing the security of github copilot’s code contributions , author=. 2022 IEEE Symposium on Security and Privacy (SP) , pages=. 2022 , organization=
2022
-
[27]
Díaz Ferreyra, and Riccardo Scandari- ato
Basak, Setu Kumar and Neil, Lorenzo and Reaves, Bradley and Williams, Laurie , booktitle =. 2023 , volume =. doi:10.1109/MSR59073.2023.00053 , url =
-
[28]
The GitHub Blog , author=
GitHub. The GitHub Blog , author=. 2023 , month=
2023
-
[29]
2023 , month=
Amazon , author=. 2023 , month=
2023
-
[30]
Lyu , journal=
Yizhan Huang and Zhe YANG and Meifang Chen and HUANG Nianchen and Jianping Zhang and Michael R. Lyu , journal=. Data Compressibility Quantifies. 2026 , url=
2026
-
[31]
The GitHub Blog , author=
Behind github's new authentication token formats , url=. The GitHub Blog , author=. 2021 , month=
2021
-
[32]
ACM SIGMOBILE Mobile Computing and Communications Review , year =
Shannon, Claude Elwood , title =. ACM SIGMOBILE Mobile Computing and Communications Review , year =
-
[33]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[34]
Forty-second International Conference on Machine Learning , year=
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling , author=. Forty-second International Conference on Machine Learning , year=
-
[35]
First Conference on Language Modeling , year=
Compression Represents Intelligence Linearly , author=. First Conference on Language Modeling , year=
-
[36]
The Stack: 3
Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu. The Stack: 3. Transactions on Machine Learning Research , issn=. 2023 , url=
2023
-
[37]
Proceedings of the ACM on Software Engineering , volume=
Your code secret belongs to me: Neural code completion tools can memorize hard-coded credentials , author=. Proceedings of the ACM on Software Engineering , volume=. 2024 , publisher=
2024
-
[38]
Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=
Unveiling memorization in code models , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=
-
[39]
Proceedings of the ACM on Software Engineering , volume=
An Empirical Study of Code Clones from Commercial AI Code Generators , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=
2025
-
[40]
2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=
Decoding secret memorization in code llms through token-level characterization , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) , pages=. 2025 , organization=
2025
-
[41]
2013 , eprint=
Efficient Estimation of Word Representations in Vector Space , author=. 2013 , eprint=
2013
-
[42]
Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162
-
[43]
Kudo, Taku and Richardson, John. S entence P iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018. doi:10.18653/v1/D18-2012
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
-
[44]
Wang, Yue and Wang, Weishi and Joty, Shafiq and Hoi, Steven C.H. C ode T 5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.685
-
[45]
Ciniselli, Matteo and Pascarella, Luca and Bavota, Gabriele , title =. 2022 , isbn =. doi:10.1145/3524842.3528440 , booktitle =
-
[46]
Al-Kaswan, Ali and Izadi, Maliheh , booktitle =. 2023 , volume =. doi:10.1109/NLBSE59153.2023.00008 , url =
-
[47]
2025 , month=
Cursor , author=. 2025 , month=
2025
-
[48]
OpenAI Codex: A Series of Models for Code Generation , year =
-
[49]
Claude Statistics 2026: 18.9M Active Users , year =
2026
-
[50]
Claude Code: An Agentic CLI for Developers , year =
-
[51]
Forty-second International Conference on Machine Learning , year=
From Language Models over Tokens to Language Models over Characters , author=. Forty-second International Conference on Machine Learning , year=
-
[52]
Yandex Cloud - Documentation , author=
Yandex Cloud Documentation: Yandex identity and Access Management: Oauth token , url=. Yandex Cloud - Documentation , author=. 2023 , month=
2023
-
[53]
H., Ivison, H., Magnusson, I., Wang, Y., et al
Olmo: Accelerating the science of language models , author=. arXiv preprint arXiv:2402.00838 , year=
-
[54]
Claude Code Statistics 2026: Key Numbers, Data & Facts , year =
2026
-
[55]
22 Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, and Dong Yu
Dolma: An open corpus of three trillion tokens for language model pretraining research , author=. arXiv preprint arXiv:2402.00159 , year=
-
[56]
2025 , eprint=
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? , author=. 2025 , eprint=
2025
-
[57]
Advances in Neural Information Processing Systems , volume=
Zero-shot tokenizer transfer , author=. Advances in Neural Information Processing Systems , volume=
-
[58]
The Rise of Agentic Engineering: Claude Code Adoption and Developer Milestones , year =
-
[59]
Fast Vocabulary Transfer for Language Model Compression
Gee, Leonidas and Zugarini, Andrea and Rigutini, Leonardo and Torroni, Paolo , year=. Fast Vocabulary Transfer for Language Model Compression , url=. doi:10.18653/v1/2022.emnlp-industry.41 , booktitle=
-
[60]
arXiv preprint arXiv:2411.08671 , year=
Theoretical analysis of byte-pair encoding , author=. arXiv preprint arXiv:2411.08671 , year=
-
[61]
C Users J
Gage, Philip , title =. C Users J. , month = feb, pages =. 1994 , issue_date =
1994
-
[62]
Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching , author=
-
[63]
30th USENIX security symposium (USENIX Security 21) , pages=
Extracting training data from large language models , author=. 30th USENIX security symposium (USENIX Security 21) , pages=
-
[64]
Advances in Neural Information Processing Systems , year=
Language Model Tokenizers Introduce Unfairness Between Languages , author=. Advances in Neural Information Processing Systems , year=
-
[65]
arXiv preprint arXiv:2510.14972 , year=
TOKDRIFT: When LLM Speaks in Subwords but Code Speaks in Grammar , author=. arXiv preprint arXiv:2510.14972 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.