pith. machine review for the scientific record. sign in

arxiv: 2605.05867 · v1 · submitted 2026-05-07 · 💻 cs.SE

Recognition: unknown

On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:06 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI-generated codecode securityfine-tuningprompt engineeringCommon Weakness Enumerationstatic analysissoftware vulnerabilitiesmodel evaluation
0
0 comments X

The pith

No fine-tuning or prompting strategy consistently eliminates security weaknesses in AI-generated code across models and languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether fine-tuning code generation models or crafting special prompts can remove security flaws from their outputs. It applies these strategies to multiple models and measures the resulting changes in Common Weakness Enumerations using static analysis. The evaluation tracks not only reductions in weaknesses but also their severity, co-occurrence patterns, and whether fixing one issue creates new ones elsewhere in the code. Results show that any gains remain tied to the specific strategy and model, with side effects common and no approach succeeding reliably in every case.

Core claim

Security improvements achieved through fine-tuning and prompting strategies are highly strategy- and model-dependent. Some approaches reduce specific classes of weaknesses yet frequently introduce new weaknesses as side effects of the remediation. No strategy consistently eliminates weaknesses across all models and scenarios, demonstrating the absence of a universally effective solution for secure AI-generated code.

What carries the argument

Comparative evaluation of fine-tuning and prompting refinement methods on code generation models, quantified by shifts in CWE prevalence, severity, and co-occurrence detected through static analysis.

If this is right

  • Fixing one class of weakness in AI-generated code often creates new weaknesses in other parts of the same program.
  • The success of any remediation method depends on both the underlying model and the target programming language.
  • Developers cannot assume that applying one tested strategy will produce secure code in all deployment contexts.
  • Combined or adaptive approaches may be required because single-strategy fixes leave residual risks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If static analysis misses context-dependent flaws, the reported side effects of fixes could be understated.
  • Model-specific tuning might encourage fragmented security practices rather than portable solutions.
  • Future pipelines could incorporate iterative verification steps that re-scan code after each attempted fix.

Load-bearing premise

Static analysis tools provide a complete and unbiased detection of security weaknesses without missing important vulnerabilities or producing false positives that would change the relative performance of the tested strategies.

What would settle it

Re-running the full set of generated code samples through independent manual security audits by multiple experts and comparing the resulting weakness counts and types against the static analysis results before and after each strategy.

Figures

Figures reproduced from arXiv: 2605.05867 by Ali Soltanian Fard Jahromi, Amjed Tahir, Foutse Khomh, Peng Liang.

Figure 1
Figure 1. Figure 1: An overview of the study methodology view at source ↗
Figure 2
Figure 2. Figure 2: Example of code from a selected scenario and code generated using an AI model. view at source ↗
Figure 3
Figure 3. Figure 3: Negative Example Prompting used for Scenario 10 (cross-site scripting). view at source ↗
Figure 4
Figure 4. Figure 4: Chain-of-Thought Prompting used for Scenario 10 (cross-site scripting). view at source ↗
Figure 5
Figure 5. Figure 5: Instructions given to each model for generating a prompt that the model deems effective for secure view at source ↗
Figure 6
Figure 6. Figure 6: Percentage of code samples containing security weaknesses (by model and language). view at source ↗
Figure 7
Figure 7. Figure 7: Frequency and distribution of CWEs by programming language. CWE frequencies are aggregated view at source ↗
Figure 8
Figure 8. Figure 8: Frequency and distribution of CWEs by model. CWE frequencies are aggregated across all four view at source ↗
Figure 9
Figure 9. Figure 9: Heatmaps showing the Cramer’s V correlation between the original CWEs and the CWEs introduced view at source ↗
Figure 10
Figure 10. Figure 10: Frequency of introduced CWEs per model (aggregated across programming languages.) view at source ↗
Figure 11
Figure 11. Figure 11: Original vs introduced CWEs when using Meta Prompting with DeepSeek R1 32B Distill. view at source ↗
Figure 12
Figure 12. Figure 12: Original vs introduced CWEs when using Meta Prompting with DeepSeek R1 32B Distill. view at source ↗
Figure 13
Figure 13. Figure 13: Network graph illustrating CWEs introduced by each model refinement technique, along with the view at source ↗
read the original abstract

The security of AI-generated code remains a major obstacle to its widespread adoption. Although code generation models achieve strong performance on functional benchmarks, their outputs frequently contain bugs and security weaknesses that undermine their trustworthiness. Prior work has explored a range of approaches to mitigate security issues in AI-generated code, e.g., using static analysis-guided generation and prompt engineering. However, their effectiveness varies widely across models and settings. This paper presents a systematic investigation of strategies for hardening model-generated code against a list of Common Weakness Enumeration (CWE). We assess the extent to which these strategies improve security across models and programming languages, using fine-tuning and prompting approaches for model output refinement. Beyond the prevalence of security weaknesses, we analyse the severity of identified CWEs, their co-occurrence, and the unintended consequences of remediation (i.e., whether fixing certain weaknesses introduces new weaknesses elsewhere in the same code). Our results show that security improvements are highly strategy- and model-dependent. Although some approaches reduce specific classes of weaknesses, they often introduce new weaknesses as side effects of the fixes. Moreover, no strategy consistently eliminates weaknesses across all models and scenarios, highlighting the absence of a universally effective "bulletproof" solution for secure AI-generated code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a systematic empirical evaluation of fine-tuning and prompting strategies aimed at mitigating Common Weakness Enumerations (CWEs) in code generated by large language models. It assesses these strategies across multiple models and programming languages, measuring not only reductions in specific weaknesses but also their severity, co-occurrence patterns, and unintended side effects where remediation efforts introduce new CWEs. The core claim is that security improvements are highly strategy- and model-dependent, with fixes often creating new vulnerabilities and no approach proving universally effective across all scenarios.

Significance. If the empirical measurements hold after addressing validation concerns, the work is significant for demonstrating the trade-offs and lack of a 'bulletproof' solution in securing AI-generated code. The analysis of side effects and co-occurrences provides concrete evidence that remediation is not straightforward, which can guide future research toward more holistic or model-specific techniques. The multi-model, multi-language design adds practical value to the software engineering and AI security literature.

major comments (2)
  1. [§3 (Experimental Setup / CWE Detection)] §3 (Experimental Setup / CWE Detection): The central claims—that certain strategies reduce specific CWEs but introduce new ones as side effects, and that no strategy is universally effective—rest entirely on outputs from static analysis tools. No validation of these tools is reported (e.g., precision/recall against manual inspection of a sample, false-positive rates, or coverage for semantic vulnerabilities). Static analyzers are known to have incomplete coverage and language/model biases; without a concrete validation step (such as auditing 100 random samples per strategy), the side-effect findings risk being tool artifacts rather than genuine code changes. This directly affects the load-bearing conclusion about strategy dependence.
  2. [§4 (Results and Analysis)] §4 (Results and Analysis): The manuscript reports that improvements are 'highly strategy- and model-dependent' and that fixes 'often introduce new weaknesses,' but provides insufficient detail on sample sizes per model/strategy, number of generations evaluated, data exclusion criteria, and any statistical tests (e.g., significance of differences in CWE counts). These omissions make it impossible to assess whether the observed side effects and lack of universal effectiveness are robust or sensitive to post-hoc choices, as noted in the low soundness assessment.
minor comments (2)
  1. [Abstract] The abstract refers to 'a list of Common Weakness Enumeration (CWE)' without naming the specific CWEs or their selection criteria; moving this detail to the abstract or §2 would improve immediate clarity for readers.
  2. [Results figures/tables] Figure or table captions describing CWE co-occurrence or severity distributions could explicitly state the total number of code samples underlying each bar or cell to aid interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We are pleased that the referee recognizes the significance of our empirical evaluation of fine-tuning and prompting strategies for securing AI-generated code. We address each major comment below and outline the revisions we plan to make to improve transparency and validation.

read point-by-point responses
  1. Referee: [§3 (Experimental Setup / CWE Detection)] §3 (Experimental Setup / CWE Detection): The central claims—that certain strategies reduce specific CWEs but introduce new ones as side effects, and that no strategy is universally effective—rest entirely on outputs from static analysis tools. No validation of these tools is reported (e.g., precision/recall against manual inspection of a sample, false-positive rates, or coverage for semantic vulnerabilities). Static analyzers are known to have incomplete coverage and language/model biases; without a concrete validation step (such as auditing 100 random samples per strategy), the side-effect findings risk being tool artifacts rather than genuine code changes. This directly affects the load-bearing conclusion about strategy dependence.

    Authors: We agree that validating the static analysis tools is essential to substantiate our claims about side effects and strategy dependence. The original manuscript relied on established static analyzers (CodeQL for C/C++ and Java, Bandit for Python) without a reported manual validation step. We will add a new subsection to §3 describing a manual audit of 100 randomly selected code samples per strategy-model-language combination. This audit will compute precision, recall, and false-positive rates for CWE detections and note any semantic issues missed by the tools. The results and their implications for the side-effect findings will be discussed, directly addressing the concern that observations may be tool artifacts. revision: yes

  2. Referee: [§4 (Results and Analysis)] §4 (Results and Analysis): The manuscript reports that improvements are 'highly strategy- and model-dependent' and that fixes 'often introduce new weaknesses,' but provides insufficient detail on sample sizes per model/strategy, number of generations evaluated, data exclusion criteria, and any statistical tests (e.g., significance of differences in CWE counts). These omissions make it impossible to assess whether the observed side effects and lack of universal effectiveness are robust or sensitive to post-hoc choices, as noted in the low soundness assessment.

    Authors: We acknowledge that additional experimental details are required for reproducibility and to evaluate the robustness of our conclusions. The original submission summarized generation counts at a high level but omitted granular breakdowns and statistical analysis. In the revised manuscript, we will expand §4 with a dedicated 'Experimental Details' subsection reporting: exact sample sizes (e.g., 500 generations per model-strategy-language configuration), data exclusion criteria (non-parsable or non-compilable outputs), and statistical tests including p-values from McNemar's test for paired CWE presence and effect sizes for count differences. Updated tables and confidence intervals will be included to allow assessment of whether side effects and strategy dependence are sensitive to analysis choices. revision: yes

Circularity Check

0 steps flagged

Minor self-citation present but not load-bearing for empirical claims

full rationale

This is an empirical measurement study whose central claims (strategy- and model-dependent security improvements, introduction of new weaknesses as side effects, and absence of a universal solution) rest on experimental outcomes from applying fine-tuning/prompting strategies and detecting CWEs via static analysis. No derivation chain, equations, or fitted parameters reduce to inputs by construction. A minor self-citation to prior prompting work exists but is not load-bearing for the new side-effect findings, which have independent experimental content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that CWE detection via static analysis is a valid proxy for real security risk and that the chosen models and languages are representative. No free parameters are introduced because the study is comparative rather than predictive modeling.

axioms (1)
  • domain assumption Static analysis tools reliably identify the listed CWEs without systematic false positives or negatives that would bias strategy comparisons.
    Invoked when the paper uses CWE counts as the primary outcome measure.

pith-pipeline@v0.9.0 · 5529 in / 1191 out tokens · 55184 ms · 2026-05-08T09:06:14.459952+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Anisuzzaman, Jeffrey G

    D.M. Anisuzzaman, Jeffrey G. Malins, Paul A. Friedman, and Zachi I. Attia. 2025. Fine-Tuning Large Language Models for Specialized Use Cases.Mayo Clinic Proceedings: Digital Health3, 1 (2025), 100184

  2. [2]

    Anthropic. 2026. Claude Code. https://claude.com/product/claude-code Accessed: 2026-03-01

  3. [3]

    Anysphere Inc. 2025. The AI Code Editor. https://cursor.com/en. Accessed: 2025-08-01

  4. [4]

    Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. 2025. Reasoning language models: A blueprint.arXiv preprint arXiv:2501.11223 (2025)

  5. [5]

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. 2024. LoRA Learns Less and Forgets Less.Transactions on Machine Learning Research(2024), 1–39

  6. [6]

    Marc Bruni, Fabio Gabrielli, Mohammad Ghafari, and Martin Kropp. 2025. Benchmarking prompt engineering techniques for secure code generation with gpt models. InProceedings of the 2nd ACM International Conference on AI Foundation Models and Software Engineering (FORGE). IEEE, 93–103

  7. [7]

    Cloudsmith. 2025. 2025 Artifact Management Report. https://cloudsmith.com/campaigns/2025-artifact-management- report. Accessed: 2026-01-07

  8. [8]

    The MITRE Corporation. 2025. CWE Top 25 Most Dangerous Software Weaknesses. https://cwe.mitre.org/top25/. Accessed: 2025-08-16

  9. [9]

    Roddy Correa, Juan-Ramón Higuera, Javier Bermejo, Juan Antonio Montalvo, Manuel Sanchez, and Ángel Magreñán

  10. [10]

    Hybrid Security Assessment Methodology for Web Applications.Computer Modeling in Engineering & Sciences 126 (2020), 89–124

  11. [11]

    DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

  12. [12]

    Greta Dolcetti and Eleonora Iotti. 2025. A Dual Perspective Review on Large Language Models and Code Verification. Frontiers in Computer Science7 (2025), 1655469

  13. [13]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. InProceedings of the IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53

  14. [14]

    Ahmed Fawzy, Amjed Tahir, and Kelly Blincoe. 2026. Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook–a Grey Literature Review. InProceedings of the IEEE International Conference on Software Engineering - Software Engineering in Practice (SEIP). IEEE

  15. [15]

    Forum of Incident Response and Security Teams, Inc. (FIRST). 2019. Common Vulnerability Scoring System v3.1: Specification Document. https://www.first.org/cvss/v3-1/specification-document. Accessed: 2025-08-24

  16. [16]

    OpenJS Foundation. 2025. Express - Fast, unopinionated, minimalist web framework for Node.js. https://expressjs.com/. Accessed: 2025-11-30. On Fixing Insecure AI-Generated Code Suggestions 41

  17. [17]

    Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu, and Jinfu Chen. 2025. Security weaknesses of copilot-generated code in github projects: An empirical study.ACM Transactions on Software Engineering and Methodology34, 8 (2025), 1–34

  18. [18]

    Vrunda Gadesha, Eda Kavlakoglu, and Vanna Winland. 2025. What is chain of thought (CoT) prompting? IBM Think website. https://www.ibm.com/think/topics/chain-of-thoughts

  19. [19]

    GitHub. 2021. CodeQL. https://codeql.github.com/ Accessed: 2025-08-19

  20. [20]

    GitHub. 2021. CodeQL code scanning: new severity levels for security alerts. https://github.blog/changelog/2021-07- 19-codeql-code-scanning-new-severity-levels-for-security-alerts/. Accessed: 2025-08-24

  21. [21]

    GitHub. 2025. AI that builds with you. https://github.com/features/copilot. Accessed: 2025-08-01

  22. [22]

    GitHub. 2025. Octoverse: A new developer joins GitHub every second as AI leads TypeScript to #1. https://github.blog/ news-insights/octoverse/octoverse-a-new-developer-joins-github-every-second-as-ai-leads-typescript-to-1/. Ac- cessed: 2026-05-04

  23. [23]

    GitHub. 2025. Survey reveals AI’s impact on the developer experience. https://github.blog/news-insights/research/ survey-reveals-ais-impact-on-the-developer-experience/. Accessed: 2025-08-01

  24. [24]

    Google Developers. 2025. Gemini 2.0: Flash, Flash-Lite and Pro. Google Developers Blog. https://developers.googleblog. com/en/gemini-2-family-expands/ Accessed on August 16, 2025

  25. [25]

    Hossein Hajipour, Keno Hassler, Thorsten Holz, Lea Schönherr, and Mario Fritz. 2024. Codelmsec benchmark: Systematically evaluating and finding security vulnerabilities in black-box code language models. InProceedings of the 2nd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 684–709

  26. [26]

    Jingxuan He, Mark Vero, Gabriela Krasnopolska, and Martin Vechev. 2024. Instruction tuning for secure code generation. InProceedings of the 41st International Conference on Machine Learning (ICML). IEEE, 18043–18062

  27. [27]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  28. [28]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

  29. [29]

    InProceedings of the 10th International Conference on Learning Representations (ICLR)

    LoRA: Low-Rank Adaptation of Large Language Models. InProceedings of the 10th International Conference on Learning Representations (ICLR). OpenReview.net, 1–13

  30. [30]

    IBM. [n. d.]. What is DevSecOps? https://www.ibm.com/think/topics/devsecops Accessed: 2026-04-05

  31. [31]

    Mahmoud Jahanshahi and Audris Mockus. 2025. Cracks in the stack: Hidden vulnerabilities and licensing risks in llm pre-training datasets. InProceedings of the 2nd IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, 104–111

  32. [32]

    Arvinder Kaur and Ruchikaa Nayyar. 2020. A Comparative Study of Static Code Analysis tools for Vulnerability Detection in C/C++ and JAVA Source Code.Procedia Computer Science171 (2020), 2023–2029

  33. [33]

    Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, and David Mohaisen. 2025. Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis.arXiv preprint arXiv:2502.01853(2025)

  34. [34]

    Jan H. Klemmer, Stefan Albert Horstmann, Nikhil Patnaik, Cordelia Ludden, Cordell Burton Jr., Carson Powers, Fabio Massacci, Akond Rahman, Daniel Votipka, Heather Richter Lipford, Awais Rashid, Alena Naiakshina, and Sascha Fahl. 2024. Using AI Assistants in Software Development: A Qualitative Study on Security Practices and Concerns. In Proceedings of the...

  35. [35]

    Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. 2022. The stack: 3 tb of permissively licensed source code.arXiv preprint arXiv:2211.15533(2022)

  36. [36]

    Junjie Li, Fazle Rabbi, Cheng Cheng, Aseem Sangalay, Yuan Tian, and Jinqiu Yang. 2024. An exploratory study on fine-tuning large language models for secure code generation.arXiv preprint arXiv:2408.09078(2024)

  37. [37]

    Junjie Li, Aseem Sangalay, Cheng Cheng, Yuan Tian, and Jinqiu Yang. 2024. Fine tuning large language model for secure code generation. InProceedings of the 1st Special Event of AI Foundation Models and Software Engineering (FORGE). ACM, 86–90

  38. [38]

    Xinghang Li, Jingzhe Ding, Chao Peng, Bing Zhao, Xiang Gao, Hongwan Gao, and Xinchen Gu. 2025. SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code.arXiv preprint arXiv:2506.05692 (2025)

  39. [39]

    Nikolaos Lykousas and Constantinos Patsakis. 2023. Tales from the Git: Automating the detection of secrets on code and assessing developers’ passwords choices. InProceedings of the 8th IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). IEEE, 68–75

  40. [40]

    Vahid Majdinasab, Michael Joshua Bishop, Shawn Rasheed, Arghavan Moradidakhel, Amjed Tahir, and Foutse Khomh

  41. [41]

    InProceedings of the 31st IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

    Assessing the Security of GitHub Copilot Generated Code – A Targeted Replication Study. InProceedings of the 31st IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 435–444. 42 Soltanian Fard Jahromi et al

  42. [42]

    MITRE Corporation. 2025. CWE-79: Improper Neutralization of Input During Web Page Generation (Cross -site Scripting). https://cwe.mitre.org/data/definitions/79.html. Accessed: 2025-08-01

  43. [43]

    MITRE Corporation. 2025. CWE-89: Improper Neutralization of Special Elements used in an SQL Command (’SQL Injection’). https://cwe.mitre.org/data/definitions/89.html. Accessed: 2025-08-01

  44. [44]

    MITRE Corporation. 2025. New to CWE. https://cwe.mitre.org/about/new_to_cwe.html. Accessed: 2025-08-01

  45. [45]

    Hassan Onsori Delicheh, Alexandre Decan, and Tom Mens. 2024. Quantifying security issues in reusable JavaScript actions in GitHub workflows. InProceedings of the 21st International Conference on Mining Software Repositories (MSR). ACM, 692–703

  46. [46]

    OpenAI. 2025. Introducing GPT-5. OpenAI website. https://openai.com/index/introducing-gpt-5/ Accessed on November 25, 2025

  47. [47]

    OpenAI. 2025. Introducing GPT -4.1 in the API. OpenAI website. https://openai.com/index/gpt-4-1/ Accessed on August 16, 2025

  48. [48]

    OpenAI. 2025. Introducing OpenAI o3 and o4-mini. OpenAI website. https://openai.com/index/introducing-o3-and- o4-mini/ Accessed on August 16, 2025

  49. [49]

    Oracle. 2025. Java EE at a Glance. https://www.oracle.com/nz/java/technologies/java-ee-glance.html. Accessed: 2025-11-30

  50. [50]

    Stack Overflow. 2025. 2025 Stack Overflow Developer Survey - AI. https://survey.stackoverflow.co/2025/ai Accessed: 2025-10-29

  51. [51]

    2025.Flask Documentation

    Pallets Projects. 2025.Flask Documentation. Pallets Projects. https://flask.palletsprojects.com/en/stable/ Version 3.1.2

  52. [52]

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. InProceedings of the 43rd IEEE Symposium on Security and Privacy (SP). IEEE, 754–768

  53. [53]

    The Go Project. 2025. Package net/http - HTTP client and server implementations in Go. https://pkg.go.dev/net/http. Accessed: 2025-11-30

  54. [54]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)

  55. [55]

    Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. 2023. From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. InProceedings of the 12th IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9

  56. [56]

    Andreas Schaad, Stefan Götz, and Dominik Binder. 2025. You Still have to Study On the Security of LLM Generated Code. InICT Systems Security and Privacy Protection, Lili Nemec Zlatolas, Kai Rannenberg, Tatjana Welzer, and Joaquin Garcia-Alfaro (Eds.). Springer, 111–124

  57. [57]

    Tabnine. 2025. Tabnine. https://www.tabnine.com/. Accessed: 2025-08-01

  58. [58]

    Cordeiro

    Norbert Tihanyi, Tamas Bisztray, Mohamed Amine Ferrag, Ridhi Jain, and Lucas C. Cordeiro. 2024. How secure is AI-generated code: a large-scale comparison of large language models.Empirical Software Engineering30, 47 (2024), 1–47

  59. [59]

    Catherine Tony, Markus Mutas, Nicolás Díaz Ferreyra, and Riccardo Scandariato. 2023. LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations. InProceedings of the 20th International Conference on Mining Software Repositories (MSR). IEEE, 588–592

  60. [60]

    Lukas Twist, Jie M Zhang, Mark Harman, Don Syme, Joost Noppen, and Detlef Nauck. 2025. LLMs Love Python: A Study of LLMs’ Bias for Programming Languages and Libraries.arXiv e-prints(2025), arXiv–2503

  61. [61]

    Darko Ðurđev. 2024. Popularity of programming languages.AIDASCO Reviews2, 2 (2024), 24–29

  62. [62]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in Neural Information Processing Systems30 (2017), 1–11

  63. [63]

    Luping Wang, Sheng Chen, Linnan Jiang, Shu Pan, Runze Cai, Sen Yang, and Fei Yang. 2025. Parameter-efficient fine-tuning in large language models: a survey of methodologies.Artificial Intelligence Review58, 8 (2025), Article No.: 227

  64. [64]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems35 (2022), 24824–24837

  65. [65]

    Xiangzhe Xu, Zian Su, Jinyao Guo, Kaiyuan Zhang, Zhenting Wang, and Xiangyu Zhang. 2024. ProSec: Fortifying Code LLMs with Proactive Security Alignment.arXiv preprint arXiv:2411.12882(2024)

  66. [66]

    Hao Yan, Swapneel Suhas Vaidya, Xiaokuan Zhang, and Ziyu Yao. 2025. Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation.arXiv preprint arXiv:2506.23034(2025). On Fixing Insecure AI-Generated Code Suggestions 43

  67. [67]

    Yifan Zhang, Yang Yuan, and Andrew Chi-Chih Yao. 2023. Meta prompting for ai systems.arXiv preprint arXiv:2311.11482(2023)

  68. [68]

    Jianguo Zhao, Yuqiang Sun, Cheng Huang, Chengwei Liu, YaoHui Guan, Yutong Zeng, and Yang Liu. 2025. Towards secure code generation with LLMs: A study on common weakness enumeration.IEEE Transactions on Software Engineering51, 12 (2025), 3507–3523

  69. [69]

    Dewu Zheng, Yanlin Wang, Ensheng Shi, Xilin Liu, Yuchi Ma, Hongyu Zhang, and Zibin Zheng. 2024. Top general performance= top domain performance? domaincodebench: A multi-domain code generation benchmark.arXiv preprint arXiv:2412.18573(2024). A Network Graph of CWEs Introduced By Model Refinement Techniques 44 Soltanian Fard Jahromi et al. Fig. 13. Network...