Secret Leak Detection in Software Issue Reports using LLMs: A Comprehensive Evaluation
Pith reviewed 2026-05-23 19:11 UTC · model grok-4.3
The pith
Fine-tuned LLMs detect secret leaks in GitHub issue reports at up to 94.49% F1.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that regex-based extraction combined with LLM contextual classification can identify real secrets in GitHub issue reports, with fine-tuned larger open-source models reaching 94.49% F1 on a constructed benchmark and maintaining 81.6% F1 when applied to 178 actual GitHub repositories.
What carries the argument
The LLM-based contextual classification step that reduces false positives from initial regex matches.
If this is right
- Regex and entropy-based approaches achieve high recall but poor precision.
- Classical and deep learning models such as RoBERTa improve to 92.70% F1.
- Proprietary models like GPT-4o in few-shot settings reach 80.13% F1.
- Fine-tuned open-source LLMs outperform at up to 94.49% F1.
- The pipeline generalizes to real-world scenarios with 81.6% F1 on 178 repositories.
Where Pith is reading between the lines
- This indicates that issue reports contain a substantial number of secret leaks that code scanners miss.
- Integration into issue tracking platforms could prevent accidental exposures before they are public.
- Similar techniques might apply to other text-based developer communications like commit messages or pull request discussions.
Load-bearing premise
The 5,881 manually verified true secrets are labeled correctly without bias and the benchmark represents typical secret leaks in GitHub issue reports.
What would settle it
A drop in F1 score below 70% when the pipeline is tested on a fresh collection of issue reports from additional repositories would indicate the claim does not hold.
Figures
read the original abstract
In the digital era, accidental exposure of sensitive information such as API keys, tokens, and credentials is a growing security threat. While most prior work focuses on detecting secrets in source code, leakage in software issue reports remains largely unexplored. This study fills that gap through a large-scale analysis and a practical detection pipeline for exposed secrets in GitHub issues. Our pipeline combines regular expression-based extraction with large language model (LLM)-based contextual classification to detect real secrets and reduce false positives. We build a benchmark of 54,148 instances from public GitHub issues, including 5,881 manually verified true secrets. Using this dataset, we evaluate entropy-based baselines and keyword heuristics used by prior secret detection tools, classical machine learning, deep learning, and LLM-based methods. Regex and entropy based approaches achieve high recall but poor precision, while smaller models such as RoBERTa and CodeBERT greatly improve performance (F1 = 92.70%). Proprietary models like GPT-4o perform moderately in few-shot settings (F1 = 80.13%), and fine-tuned open-source larger LLMs such as Qwen and LLaMA reach up to 94.49% F1. Finally, we also validate our approach on 178 real-world GitHub repositories, achieving an F1-score of 81.6% which demonstrates our approach's strong ability to generalize to in-the-wild scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to fill the gap in detecting secret leaks in software issue reports (as opposed to source code) by proposing a pipeline that combines regex-based extraction with LLM-based contextual classification. It constructs a new benchmark of 54,148 instances from GitHub issues with 5,881 manually verified true secrets, evaluates a range of baselines from entropy/keyword methods to classical ML, DL, and various LLMs (few-shot and fine-tuned), reports peak F1 of 94.49% for fine-tuned Qwen/LLaMA, and achieves 81.6% F1 on an external validation set of 178 real-world GitHub repositories.
Significance. If the ground-truth labels prove reliable, this provides the first large-scale benchmark and systematic comparison for secret detection in issue reports, an important but neglected setting. The inclusion of real-world validation and the demonstration that fine-tuned open-source LLMs outperform proprietary models and traditional methods in this domain are notable strengths. The work supplies both a dataset and practical pipeline that could be adopted by practitioners.
major comments (2)
- [Benchmark construction] Benchmark construction paragraph: the 5,881 manually verified true secrets are described only as 'manually verified' with no annotation protocol, inter-annotator agreement, number of annotators, or adjudication rules for ambiguous issue-report contexts. All reported F1 scores (94.49% peak and 81.6% real-world) rest directly on these labels; without these details it is impossible to assess label quality or potential systematic bias correlated with contextual cues the LLMs exploit.
- [Real-world validation] Real-world validation section: the sampling strategy used to select the 178 GitHub repositories and the procedure for establishing ground truth on this external set are not described. This information is required to evaluate whether the 81.6% F1 result supports the generalization claim.
minor comments (2)
- [Evaluation] Evaluation section: no statistical significance tests or confidence intervals are reported for F1 differences across methods; adding them would strengthen the comparative claims without altering the central empirical narrative.
- [Abstract] Abstract: the benchmark size is stated but the annotation and sampling details are omitted; a one-sentence summary of the verification process would improve completeness for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for greater transparency in our benchmark and validation procedures. We address each major comment below and will revise the manuscript accordingly to incorporate the requested details.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction paragraph: the 5,881 manually verified true secrets are described only as 'manually verified' with no annotation protocol, inter-annotator agreement, number of annotators, or adjudication rules for ambiguous issue-report contexts. All reported F1 scores (94.49% peak and 81.6% real-world) rest directly on these labels; without these details it is impossible to assess label quality or potential systematic bias correlated with contextual cues the LLMs exploit.
Authors: We agree that these methodological details are essential for readers to evaluate label quality. In the revised manuscript we will add a dedicated subsection in the benchmark construction section that fully describes the annotation protocol, the number of annotators, the inter-annotator agreement statistics (including Cohen's kappa), and the adjudication process for ambiguous cases. This addition will directly address concerns about potential bias in the ground-truth labels. revision: yes
-
Referee: [Real-world validation] Real-world validation section: the sampling strategy used to select the 178 GitHub repositories and the procedure for establishing ground truth on this external set are not described. This information is required to evaluate whether the 81.6% F1 result supports the generalization claim.
Authors: We concur that explicit description of the sampling strategy and ground-truth procedure is required to substantiate the generalization claim. We will expand the real-world validation section to detail the repository selection criteria and sampling method, as well as the exact steps used to establish ground truth on the 178 repositories. These clarifications will strengthen the external validation results. revision: yes
Circularity Check
No circularity: purely empirical benchmarking on new dataset
full rationale
The paper constructs a new benchmark (54,148 instances, 5,881 manually verified secrets) and reports direct experimental F1 scores for regex, ML, and LLM methods, plus an external validation on 178 repositories. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. All reported metrics are independent measurements on the constructed data rather than reductions to inputs by construction. Labeling reliability is a validity concern, not a circularity issue.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Manual verification of 5,881 secrets produces accurate and unbiased labels
Forward citations
Cited by 1 Pith paper
-
IssueGuard: Real-Time Secret Leak Prevention Tool for GitHub Issue Reports
IssueGuard delivers real-time secret detection in GitHub issues via regex and CodeBERT, reaching 92.7% F1-score and outperforming pure regex scanners.
Reference graph
Works this paper leans on
-
[1]
AGWA. 2012. git-crypt. Retrieved 2024-03-26 from https://github.com/AGWA/git-crypt
work page 2012
-
[2]
Atlassian. 2008. Bitbucket. Retrieved 2024-04-16 from https://bitbucket.org/product
work page 2008
-
[3]
Setu Kumar Basak, Lorenzo Neil, Bradley Reaves, and Laurie Williams. 2022. What are the practices for secret management in software artifacts?. In 2022 IEEE Secure Development Conference (SecDev) . IEEE, 69–76
work page 2022
- [4]
- [5]
-
[6]
GitLab B.V. 2011. GitLab. Retrieved 2024-04-16 from https://about.gitlab.com/
work page 2011
-
[7]
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[8]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46
work page 1960
-
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
Runhan Feng, Ziyang Yan, Shiyan Peng, and Yuanyuan Zhang. 2022. Automated detection of password leakage from public github repositories. In Proceedings of the 44th International Conference on Software Engineering . 175–186
work page 2022
-
[11]
Sally Fincher and Josh Tenenberg. 2005. Making sense of card sorting data. Expert Systems 22, 3 (2005), 89–93
work page 2005
-
[12]
GitGuardian. 2017. ggshield. Retrieved 2024-02-02 from https://www.gitguardian.com/ggshield
work page 2017
-
[13]
GitGuardian. 2024. State of Secrets Sprawl Report 2023 . Retrieved 2024-03-12 from https://www.gitguardian.com/state-of-secrets-sprawl-report-2023 28 Wahab et al
work page 2024
-
[14]
GitHub. 2009. GitHub Issues Documentation. Retrieved 2024-10-16 from https://docs.github.com/en/issues
work page 2009
-
[15]
GitHub. 2024. Getting Started with the REST API . Retrieved 2024-03-26 from https://docs.github.com/en/rest/using-the-rest-api/getting-started- with-the-rest-api?apiVersion=2022-11-28
work page 2024
-
[16]
Gitleaks. 2018. Gitleaks. Retrieved 2024-02-02 from https://github.com/gitleaks/gitleaks
work page 2018
-
[17]
Khalid Hasan, Partho Chakraborty, Rifat Shahriyar, Anindya Iqbal, and Gias Uddin. 2021. A Survey-Based Qualitative Study to Characterize Expectations of Software Developers from Five Stakeholders. In Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) . 1–11
work page 2021
-
[18]
Connor Jones. 2023. Cryptojackers steal A WS credentials from GitHub in 5 minutes . Retrieved 2024-02-02 from https://www.theregister.com/2023/10/ 30/cryptojackers_steal_aws_credentials_github/
work page 2023
-
[19]
Rafael Kallis, Maliheh Izadi, Luca Pascarella, Oscar Chaparro, and Pooja Rani. 2023. NLBSE’23 issue report dataset. Retrieved 2024-02-02 from https://github.com/nlbse2023/issue-report-classification
work page 2023
-
[20]
Ramakrishnan Kandasamy. 2020. Secret Detection Tools for Source Codes . Retrieved 2024-02-02 from https://github.com/rmkanda/tools
work page 2020
-
[21]
Venus Kohli. 2023. Context Window. Retrieved 2024-04-13 from https://www.techtarget.com/whatis/definition/context-window#:~:text=A% 20context%20window%20is%20a,time%20the%20information%20is%20generated
work page 2023
-
[22]
Igal Kreichman. 2021. The secrets about exposed secrets in code . Retrieved 2024-02-02 from https://apiiro.com/blog/the-secrets-about-secrets-in-code/
work page 2021
-
[23]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[24]
Navaneeth Malingan. 2024. Attention Mechanism in Deep Learning . Retrieved 2024-06-24 from https://www.scaler.com/topics/deep-learning/ attention-mechanism-deep-learning/
work page 2024
-
[25]
Michael Meli, Matthew R McNiece, and Bradley Reaves. 2019. How bad can it git? characterizing secret leakage in public github repositories.. In NDSS
work page 2019
-
[26]
Michenriksen. 2014. gitrob. Retrieved 2024-03-26 from https://github.com/michenriksen/gitrob
work page 2014
-
[27]
Matthew B Miles. 1994. Qualitative data analysis: An expanded sourcebook. Thousand Oaks (1994)
work page 1994
-
[28]
Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Aakanksha Saha, Tamara Denning, Vivek Srikumar, and Sneha Kumar Kasera. 2020. Secrets in source code: Reducing false positives using machine learning. In 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS) . IEEE, 168–175
work page 2020
-
[30]
Takaya Saito and Marc Rehmsmeier. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one 10, 3 (2015), e0118432
work page 2015
-
[31]
Truffle Security. 2016. Regular Expressions used in TruffleHog. Retrieved 2024-03-27 from https://github.com/trufflesecurity/trufflehog/tree/main/ pkg/detectors
work page 2016
-
[32]
Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379–423
work page 1948
-
[33]
Vibha Singhal Sinha, Diptikalyan Saha, Pankaj Dhoolia, Rohan Padhye, and Senthil Mani. 2015. Detecting and mitigating secret-key leaks in source code repositories. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories . IEEE, 396–400
work page 2015
-
[34]
Sophos. 2023. Sophos 2023 Threat Report . Retrieved 2024-03-21 from https://www.sophos.com/en-us/content/security-threat-report
work page 2023
-
[35]
TruffleSecurity. 2016. TruffleHog. Retrieved 2024-02-02 from https://github.com/trufflesecurity/trufflehog
work page 2016
-
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.