pith. sign in

arxiv: 2410.23657 · v4 · submitted 2024-10-31 · 💻 cs.SE

Secret Leak Detection in Software Issue Reports using LLMs: A Comprehensive Evaluation

Pith reviewed 2026-05-23 19:11 UTC · model grok-4.3

classification 💻 cs.SE
keywords secret leak detectionlarge language modelsGitHub issue reportssoftware securitybenchmark evaluationcontextual classificationfalse positive reduction
0
0 comments X

The pith

Fine-tuned LLMs detect secret leaks in GitHub issue reports at up to 94.49% F1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a detection pipeline for sensitive information leaks in software issue reports on GitHub, an area previously overlooked in favor of source code analysis. It constructs a benchmark dataset of 54,148 instances including 5,881 verified true secrets and compares multiple methods, finding that fine-tuned open-source LLMs like Qwen and LLaMA achieve the highest performance. The work shows practical value by validating the approach on 178 real-world repositories with an F1-score of 81.6%. A sympathetic reader would care because such leaks pose real security risks and the method offers a way to catch them automatically.

Core claim

The paper claims that regex-based extraction combined with LLM contextual classification can identify real secrets in GitHub issue reports, with fine-tuned larger open-source models reaching 94.49% F1 on a constructed benchmark and maintaining 81.6% F1 when applied to 178 actual GitHub repositories.

What carries the argument

The LLM-based contextual classification step that reduces false positives from initial regex matches.

If this is right

  • Regex and entropy-based approaches achieve high recall but poor precision.
  • Classical and deep learning models such as RoBERTa improve to 92.70% F1.
  • Proprietary models like GPT-4o in few-shot settings reach 80.13% F1.
  • Fine-tuned open-source LLMs outperform at up to 94.49% F1.
  • The pipeline generalizes to real-world scenarios with 81.6% F1 on 178 repositories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This indicates that issue reports contain a substantial number of secret leaks that code scanners miss.
  • Integration into issue tracking platforms could prevent accidental exposures before they are public.
  • Similar techniques might apply to other text-based developer communications like commit messages or pull request discussions.

Load-bearing premise

The 5,881 manually verified true secrets are labeled correctly without bias and the benchmark represents typical secret leaks in GitHub issue reports.

What would settle it

A drop in F1 score below 70% when the pipeline is tested on a fresh collection of issue reports from additional repositories would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2410.23657 by Gias Uddin, Md Nafiu Rahman, Rifat Shahriyar, Sadif Ahmed, Zahin Wahab.

Figure 1
Figure 1. Figure 1: Example of secret breach in GitHub issue report [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Likelihood of encountering a secret breach in issue reports [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Severities associated with secret breaches in GitHub issue reports [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Factors influencing sharing of secrets in issue reports [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Steps for curating benchmark dataset Step 2: Enumerating regular expressions. TruffleHog [35], a widely used open-source secret-scanning tool, comprises a package of secret detectors. Basak et al. [5] extracted 751 regular expressions (regex) patterns from the source code of the detector package and incorporated them into their pattern set. Additionally, they also included 10 regex patterns from Meli et al… view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation of language models with context window = 100, learning rate = [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation of language models with context window = 125, learning rate = [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation of language models with context window = 200, learning rate = [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Evaluation of language models with context window = 250, learning rate = [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: learning rate vs 𝐹 1, minority 𝐹 1, 𝐹 𝛽 context window=100, AdamW optimizer Varying Optimizer. For optimizers, we experimented with AdamW, SGD, AdaGrad, and RMSProp [28]. We can clearly see from [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Optimizer vs 𝐹 1, minority 𝐹 1, 𝐹 𝛽 context window=100, lr=1𝑒 −5 Varying Length of Context Window. As we varied the length of context windows (shown in [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Context Window vs 𝐹 1, minority 𝐹 1, 𝐹 𝛽 with AdamW optimizer and lr=1𝑒 −5 [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sensitivity analysis of evaluation metrics with respect to the size of context window [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparing test time with and without context window (AdamW optimizer and lr= [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparing training time with and without context window (AdamW optimizer and lr= [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Confusion Matrix of true labels vs. predicted labels of SOTA regexes run on the raw dataset [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Confusion Matrix of true labels vs predicted labels of SOTA regexes run on the pre-processed dataset [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Confusion Matrix of true labels vs predicted labels of Language Model ( [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Pipeline for Secret Breach Detection in Issue Reports. [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Workflow of the Secret Mitigator Extension [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Likelihood of detecting a secret breach in issue reports by our bot [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Comparison of user confidence in avoiding secret breaches. [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Likelihood of detecting a secret breach in issue reports by our bot [PITH_FULL_IMAGE:figures/full_fig_p023_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Future improvement opportunities of bot [PITH_FULL_IMAGE:figures/full_fig_p024_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Additional tool supports needed for issue report management [PITH_FULL_IMAGE:figures/full_fig_p025_25.png] view at source ↗
read the original abstract

In the digital era, accidental exposure of sensitive information such as API keys, tokens, and credentials is a growing security threat. While most prior work focuses on detecting secrets in source code, leakage in software issue reports remains largely unexplored. This study fills that gap through a large-scale analysis and a practical detection pipeline for exposed secrets in GitHub issues. Our pipeline combines regular expression-based extraction with large language model (LLM)-based contextual classification to detect real secrets and reduce false positives. We build a benchmark of 54,148 instances from public GitHub issues, including 5,881 manually verified true secrets. Using this dataset, we evaluate entropy-based baselines and keyword heuristics used by prior secret detection tools, classical machine learning, deep learning, and LLM-based methods. Regex and entropy based approaches achieve high recall but poor precision, while smaller models such as RoBERTa and CodeBERT greatly improve performance (F1 = 92.70%). Proprietary models like GPT-4o perform moderately in few-shot settings (F1 = 80.13%), and fine-tuned open-source larger LLMs such as Qwen and LLaMA reach up to 94.49% F1. Finally, we also validate our approach on 178 real-world GitHub repositories, achieving an F1-score of 81.6% which demonstrates our approach's strong ability to generalize to in-the-wild scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to fill the gap in detecting secret leaks in software issue reports (as opposed to source code) by proposing a pipeline that combines regex-based extraction with LLM-based contextual classification. It constructs a new benchmark of 54,148 instances from GitHub issues with 5,881 manually verified true secrets, evaluates a range of baselines from entropy/keyword methods to classical ML, DL, and various LLMs (few-shot and fine-tuned), reports peak F1 of 94.49% for fine-tuned Qwen/LLaMA, and achieves 81.6% F1 on an external validation set of 178 real-world GitHub repositories.

Significance. If the ground-truth labels prove reliable, this provides the first large-scale benchmark and systematic comparison for secret detection in issue reports, an important but neglected setting. The inclusion of real-world validation and the demonstration that fine-tuned open-source LLMs outperform proprietary models and traditional methods in this domain are notable strengths. The work supplies both a dataset and practical pipeline that could be adopted by practitioners.

major comments (2)
  1. [Benchmark construction] Benchmark construction paragraph: the 5,881 manually verified true secrets are described only as 'manually verified' with no annotation protocol, inter-annotator agreement, number of annotators, or adjudication rules for ambiguous issue-report contexts. All reported F1 scores (94.49% peak and 81.6% real-world) rest directly on these labels; without these details it is impossible to assess label quality or potential systematic bias correlated with contextual cues the LLMs exploit.
  2. [Real-world validation] Real-world validation section: the sampling strategy used to select the 178 GitHub repositories and the procedure for establishing ground truth on this external set are not described. This information is required to evaluate whether the 81.6% F1 result supports the generalization claim.
minor comments (2)
  1. [Evaluation] Evaluation section: no statistical significance tests or confidence intervals are reported for F1 differences across methods; adding them would strengthen the comparative claims without altering the central empirical narrative.
  2. [Abstract] Abstract: the benchmark size is stated but the annotation and sampling details are omitted; a one-sentence summary of the verification process would improve completeness for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater transparency in our benchmark and validation procedures. We address each major comment below and will revise the manuscript accordingly to incorporate the requested details.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction paragraph: the 5,881 manually verified true secrets are described only as 'manually verified' with no annotation protocol, inter-annotator agreement, number of annotators, or adjudication rules for ambiguous issue-report contexts. All reported F1 scores (94.49% peak and 81.6% real-world) rest directly on these labels; without these details it is impossible to assess label quality or potential systematic bias correlated with contextual cues the LLMs exploit.

    Authors: We agree that these methodological details are essential for readers to evaluate label quality. In the revised manuscript we will add a dedicated subsection in the benchmark construction section that fully describes the annotation protocol, the number of annotators, the inter-annotator agreement statistics (including Cohen's kappa), and the adjudication process for ambiguous cases. This addition will directly address concerns about potential bias in the ground-truth labels. revision: yes

  2. Referee: [Real-world validation] Real-world validation section: the sampling strategy used to select the 178 GitHub repositories and the procedure for establishing ground truth on this external set are not described. This information is required to evaluate whether the 81.6% F1 result supports the generalization claim.

    Authors: We concur that explicit description of the sampling strategy and ground-truth procedure is required to substantiate the generalization claim. We will expand the real-world validation section to detail the repository selection criteria and sampling method, as well as the exact steps used to establish ground truth on the 178 repositories. These clarifications will strengthen the external validation results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking on new dataset

full rationale

The paper constructs a new benchmark (54,148 instances, 5,881 manually verified secrets) and reports direct experimental F1 scores for regex, ML, and LLM methods, plus an external validation on 178 repositories. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. All reported metrics are independent measurements on the constructed data rather than reductions to inputs by construction. Labeling reliability is a validity concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that manual secret verification produces reliable ground truth and that standard ML train/test splits generalize; no new entities or fitted constants are introduced beyond typical model hyperparameters.

axioms (1)
  • domain assumption Manual verification of 5,881 secrets produces accurate and unbiased labels
    Benchmark construction described in abstract depends on this labeling step for all reported F1 scores.

pith-pipeline@v0.9.0 · 5798 in / 1251 out tokens · 28561 ms · 2026-05-23T19:11:45.978102+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IssueGuard: Real-Time Secret Leak Prevention Tool for GitHub Issue Reports

    cs.CR 2026-02 unverdicted novelty 4.0

    IssueGuard delivers real-time secret detection in GitHub issues via regex and CodeBERT, reaching 92.7% F1-score and outperforming pure regex scanners.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    AGWA. 2012. git-crypt. Retrieved 2024-03-26 from https://github.com/AGWA/git-crypt

  2. [2]

    Atlassian. 2008. Bitbucket. Retrieved 2024-04-16 from https://bitbucket.org/product

  3. [3]

    Setu Kumar Basak, Lorenzo Neil, Bradley Reaves, and Laurie Williams. 2022. What are the practices for secret management in software artifacts?. In 2022 IEEE Secure Development Conference (SecDev) . IEEE, 69–76

  4. [4]

    Setu Kumar Basak, Lorenzo Neil, Bradley Reaves, and Laurie Williams. 2023. Regular Expressions used in SecretBench dataset . Retrieved 2024-03-27 from https://zenodo.org/records/7571266

  5. [5]

    Setu Kumar Basak, Lorenzo Neil, Bradley Reaves, and Laurie Williams. 2023. SecretBench: A Dataset of Software Secrets. arXiv preprint arXiv:2303.06729 (2023)

  6. [6]

    GitLab B.V. 2011. GitLab. Retrieved 2024-04-16 from https://about.gitlab.com/

  7. [7]

    Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020)

  8. [8]

    Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46

  9. [9]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  10. [10]

    Runhan Feng, Ziyang Yan, Shiyan Peng, and Yuanyuan Zhang. 2022. Automated detection of password leakage from public github repositories. In Proceedings of the 44th International Conference on Software Engineering . 175–186

  11. [11]

    Sally Fincher and Josh Tenenberg. 2005. Making sense of card sorting data. Expert Systems 22, 3 (2005), 89–93

  12. [12]

    GitGuardian. 2017. ggshield. Retrieved 2024-02-02 from https://www.gitguardian.com/ggshield

  13. [13]

    GitGuardian. 2024. State of Secrets Sprawl Report 2023 . Retrieved 2024-03-12 from https://www.gitguardian.com/state-of-secrets-sprawl-report-2023 28 Wahab et al

  14. [14]

    GitHub. 2009. GitHub Issues Documentation. Retrieved 2024-10-16 from https://docs.github.com/en/issues

  15. [15]

    GitHub. 2024. Getting Started with the REST API . Retrieved 2024-03-26 from https://docs.github.com/en/rest/using-the-rest-api/getting-started- with-the-rest-api?apiVersion=2022-11-28

  16. [16]

    Gitleaks. 2018. Gitleaks. Retrieved 2024-02-02 from https://github.com/gitleaks/gitleaks

  17. [17]

    Khalid Hasan, Partho Chakraborty, Rifat Shahriyar, Anindya Iqbal, and Gias Uddin. 2021. A Survey-Based Qualitative Study to Characterize Expectations of Software Developers from Five Stakeholders. In Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) . 1–11

  18. [18]

    Connor Jones. 2023. Cryptojackers steal A WS credentials from GitHub in 5 minutes . Retrieved 2024-02-02 from https://www.theregister.com/2023/10/ 30/cryptojackers_steal_aws_credentials_github/

  19. [19]

    Rafael Kallis, Maliheh Izadi, Luca Pascarella, Oscar Chaparro, and Pooja Rani. 2023. NLBSE’23 issue report dataset. Retrieved 2024-02-02 from https://github.com/nlbse2023/issue-report-classification

  20. [20]

    Ramakrishnan Kandasamy. 2020. Secret Detection Tools for Source Codes . Retrieved 2024-02-02 from https://github.com/rmkanda/tools

  21. [21]

    Venus Kohli. 2023. Context Window. Retrieved 2024-04-13 from https://www.techtarget.com/whatis/definition/context-window#:~:text=A% 20context%20window%20is%20a,time%20the%20information%20is%20generated

  22. [22]

    Igal Kreichman. 2021. The secrets about exposed secrets in code . Retrieved 2024-02-02 from https://apiiro.com/blog/the-secrets-about-secrets-in-code/

  23. [23]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  24. [24]

    Navaneeth Malingan. 2024. Attention Mechanism in Deep Learning . Retrieved 2024-06-24 from https://www.scaler.com/topics/deep-learning/ attention-mechanism-deep-learning/

  25. [25]

    Michael Meli, Matthew R McNiece, and Bradley Reaves. 2019. How bad can it git? characterizing secret leakage in public github repositories.. In NDSS

  26. [26]

    Michenriksen. 2014. gitrob. Retrieved 2024-03-26 from https://github.com/michenriksen/gitrob

  27. [27]

    Matthew B Miles. 1994. Qualitative data analysis: An expanded sourcebook. Thousand Oaks (1994)

  28. [28]

    Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)

  29. [29]

    Aakanksha Saha, Tamara Denning, Vivek Srikumar, and Sneha Kumar Kasera. 2020. Secrets in source code: Reducing false positives using machine learning. In 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS) . IEEE, 168–175

  30. [30]

    Takaya Saito and Marc Rehmsmeier. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one 10, 3 (2015), e0118432

  31. [31]

    Truffle Security. 2016. Regular Expressions used in TruffleHog. Retrieved 2024-03-27 from https://github.com/trufflesecurity/trufflehog/tree/main/ pkg/detectors

  32. [32]

    Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379–423

  33. [33]

    Vibha Singhal Sinha, Diptikalyan Saha, Pankaj Dhoolia, Rohan Padhye, and Senthil Mani. 2015. Detecting and mitigating secret-key leaks in source code repositories. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories . IEEE, 396–400

  34. [34]

    Sophos. 2023. Sophos 2023 Threat Report . Retrieved 2024-03-21 from https://www.sophos.com/en-us/content/security-threat-report

  35. [35]

    TruffleSecurity. 2016. TruffleHog. Retrieved 2024-02-02 from https://github.com/trufflesecurity/trufflehog

  36. [36]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)