arxiv: 2605.05867 · v1 · submitted 2026-05-07 · 💻 cs.SE

Recognition: unknown

On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies

Ali Soltanian Fard Jahromi , Amjed Tahir , Peng Liang , Foutse Khomh

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:06 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI-generated codecode securityfine-tuningprompt engineeringCommon Weakness Enumerationstatic analysissoftware vulnerabilitiesmodel evaluation

0 comments

The pith

No fine-tuning or prompting strategy consistently eliminates security weaknesses in AI-generated code across models and languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether fine-tuning code generation models or crafting special prompts can remove security flaws from their outputs. It applies these strategies to multiple models and measures the resulting changes in Common Weakness Enumerations using static analysis. The evaluation tracks not only reductions in weaknesses but also their severity, co-occurrence patterns, and whether fixing one issue creates new ones elsewhere in the code. Results show that any gains remain tied to the specific strategy and model, with side effects common and no approach succeeding reliably in every case.

Core claim

Security improvements achieved through fine-tuning and prompting strategies are highly strategy- and model-dependent. Some approaches reduce specific classes of weaknesses yet frequently introduce new weaknesses as side effects of the remediation. No strategy consistently eliminates weaknesses across all models and scenarios, demonstrating the absence of a universally effective solution for secure AI-generated code.

What carries the argument

Comparative evaluation of fine-tuning and prompting refinement methods on code generation models, quantified by shifts in CWE prevalence, severity, and co-occurrence detected through static analysis.

If this is right

Fixing one class of weakness in AI-generated code often creates new weaknesses in other parts of the same program.
The success of any remediation method depends on both the underlying model and the target programming language.
Developers cannot assume that applying one tested strategy will produce secure code in all deployment contexts.
Combined or adaptive approaches may be required because single-strategy fixes leave residual risks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If static analysis misses context-dependent flaws, the reported side effects of fixes could be understated.
Model-specific tuning might encourage fragmented security practices rather than portable solutions.
Future pipelines could incorporate iterative verification steps that re-scan code after each attempted fix.

Load-bearing premise

Static analysis tools provide a complete and unbiased detection of security weaknesses without missing important vulnerabilities or producing false positives that would change the relative performance of the tested strategies.

What would settle it

Re-running the full set of generated code samples through independent manual security audits by multiple experts and comparing the resulting weakness counts and types against the static analysis results before and after each strategy.

Figures

Figures reproduced from arXiv: 2605.05867 by Ali Soltanian Fard Jahromi, Amjed Tahir, Foutse Khomh, Peng Liang.

**Figure 1.** Figure 1: An overview of the study methodology view at source ↗

**Figure 2.** Figure 2: Example of code from a selected scenario and code generated using an AI model. view at source ↗

**Figure 3.** Figure 3: Negative Example Prompting used for Scenario 10 (cross-site scripting). view at source ↗

**Figure 4.** Figure 4: Chain-of-Thought Prompting used for Scenario 10 (cross-site scripting). view at source ↗

**Figure 5.** Figure 5: Instructions given to each model for generating a prompt that the model deems effective for secure view at source ↗

**Figure 6.** Figure 6: Percentage of code samples containing security weaknesses (by model and language). view at source ↗

**Figure 7.** Figure 7: Frequency and distribution of CWEs by programming language. CWE frequencies are aggregated view at source ↗

**Figure 8.** Figure 8: Frequency and distribution of CWEs by model. CWE frequencies are aggregated across all four view at source ↗

**Figure 9.** Figure 9: Heatmaps showing the Cramer’s V correlation between the original CWEs and the CWEs introduced view at source ↗

**Figure 10.** Figure 10: Frequency of introduced CWEs per model (aggregated across programming languages.) view at source ↗

**Figure 11.** Figure 11: Original vs introduced CWEs when using Meta Prompting with DeepSeek R1 32B Distill. view at source ↗

**Figure 12.** Figure 12: Original vs introduced CWEs when using Meta Prompting with DeepSeek R1 32B Distill. view at source ↗

**Figure 13.** Figure 13: Network graph illustrating CWEs introduced by each model refinement technique, along with the view at source ↗

read the original abstract

The security of AI-generated code remains a major obstacle to its widespread adoption. Although code generation models achieve strong performance on functional benchmarks, their outputs frequently contain bugs and security weaknesses that undermine their trustworthiness. Prior work has explored a range of approaches to mitigate security issues in AI-generated code, e.g., using static analysis-guided generation and prompt engineering. However, their effectiveness varies widely across models and settings. This paper presents a systematic investigation of strategies for hardening model-generated code against a list of Common Weakness Enumeration (CWE). We assess the extent to which these strategies improve security across models and programming languages, using fine-tuning and prompting approaches for model output refinement. Beyond the prevalence of security weaknesses, we analyse the severity of identified CWEs, their co-occurrence, and the unintended consequences of remediation (i.e., whether fixing certain weaknesses introduces new weaknesses elsewhere in the same code). Our results show that security improvements are highly strategy- and model-dependent. Although some approaches reduce specific classes of weaknesses, they often introduce new weaknesses as side effects of the fixes. Moreover, no strategy consistently eliminates weaknesses across all models and scenarios, highlighting the absence of a universally effective "bulletproof" solution for secure AI-generated code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fixes for AI code security are model-dependent and introduce side effects, but static analysis may limit the reliability of those observations.

read the letter

The main point is that fixing insecure AI-generated code with fine-tuning or prompting is model- and strategy-specific, often reducing some weaknesses but introducing others, and no approach eliminates them consistently. The paper does a solid job of comparing these strategies across different models and languages while also examining co-occurrence patterns and side effects. Tracking whether fixes create new problems is a step forward from studies that only report on weakness reduction. The systematic nature of the evaluation gives a clearer picture of the trade-offs involved. One area that could use more scrutiny is the use of static analysis to identify the CWEs. Static tools don't catch everything and can produce false positives or miss context-dependent issues, which might affect how we interpret the side-effect results. If the paper includes any manual validation or sensitivity checks on the tool outputs, that would strengthen the claims. Otherwise, the findings on new weaknesses could partly reflect tool limitations rather than genuine code changes. This work is useful for researchers and engineers focused on secure code generation from AI models. It provides concrete comparisons that could guide choices in real deployments, especially the emphasis on ongoing vigilance rather than one-time fixes. The paper deserves serious peer review. The empirical questions are practical and the breadth is good, so it should go to referees who can evaluate the methods in detail. I would recommend accepting it for review, with attention to the security measurement approach.

Referee Report

2 major / 2 minor

Summary. The paper presents a systematic empirical evaluation of fine-tuning and prompting strategies aimed at mitigating Common Weakness Enumerations (CWEs) in code generated by large language models. It assesses these strategies across multiple models and programming languages, measuring not only reductions in specific weaknesses but also their severity, co-occurrence patterns, and unintended side effects where remediation efforts introduce new CWEs. The core claim is that security improvements are highly strategy- and model-dependent, with fixes often creating new vulnerabilities and no approach proving universally effective across all scenarios.

Significance. If the empirical measurements hold after addressing validation concerns, the work is significant for demonstrating the trade-offs and lack of a 'bulletproof' solution in securing AI-generated code. The analysis of side effects and co-occurrences provides concrete evidence that remediation is not straightforward, which can guide future research toward more holistic or model-specific techniques. The multi-model, multi-language design adds practical value to the software engineering and AI security literature.

major comments (2)

[§3 (Experimental Setup / CWE Detection)] §3 (Experimental Setup / CWE Detection): The central claims—that certain strategies reduce specific CWEs but introduce new ones as side effects, and that no strategy is universally effective—rest entirely on outputs from static analysis tools. No validation of these tools is reported (e.g., precision/recall against manual inspection of a sample, false-positive rates, or coverage for semantic vulnerabilities). Static analyzers are known to have incomplete coverage and language/model biases; without a concrete validation step (such as auditing 100 random samples per strategy), the side-effect findings risk being tool artifacts rather than genuine code changes. This directly affects the load-bearing conclusion about strategy dependence.
[§4 (Results and Analysis)] §4 (Results and Analysis): The manuscript reports that improvements are 'highly strategy- and model-dependent' and that fixes 'often introduce new weaknesses,' but provides insufficient detail on sample sizes per model/strategy, number of generations evaluated, data exclusion criteria, and any statistical tests (e.g., significance of differences in CWE counts). These omissions make it impossible to assess whether the observed side effects and lack of universal effectiveness are robust or sensitive to post-hoc choices, as noted in the low soundness assessment.

minor comments (2)

[Abstract] The abstract refers to 'a list of Common Weakness Enumeration (CWE)' without naming the specific CWEs or their selection criteria; moving this detail to the abstract or §2 would improve immediate clarity for readers.
[Results figures/tables] Figure or table captions describing CWE co-occurrence or severity distributions could explicitly state the total number of code samples underlying each bar or cell to aid interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We are pleased that the referee recognizes the significance of our empirical evaluation of fine-tuning and prompting strategies for securing AI-generated code. We address each major comment below and outline the revisions we plan to make to improve transparency and validation.

read point-by-point responses

Referee: [§3 (Experimental Setup / CWE Detection)] §3 (Experimental Setup / CWE Detection): The central claims—that certain strategies reduce specific CWEs but introduce new ones as side effects, and that no strategy is universally effective—rest entirely on outputs from static analysis tools. No validation of these tools is reported (e.g., precision/recall against manual inspection of a sample, false-positive rates, or coverage for semantic vulnerabilities). Static analyzers are known to have incomplete coverage and language/model biases; without a concrete validation step (such as auditing 100 random samples per strategy), the side-effect findings risk being tool artifacts rather than genuine code changes. This directly affects the load-bearing conclusion about strategy dependence.

Authors: We agree that validating the static analysis tools is essential to substantiate our claims about side effects and strategy dependence. The original manuscript relied on established static analyzers (CodeQL for C/C++ and Java, Bandit for Python) without a reported manual validation step. We will add a new subsection to §3 describing a manual audit of 100 randomly selected code samples per strategy-model-language combination. This audit will compute precision, recall, and false-positive rates for CWE detections and note any semantic issues missed by the tools. The results and their implications for the side-effect findings will be discussed, directly addressing the concern that observations may be tool artifacts. revision: yes
Referee: [§4 (Results and Analysis)] §4 (Results and Analysis): The manuscript reports that improvements are 'highly strategy- and model-dependent' and that fixes 'often introduce new weaknesses,' but provides insufficient detail on sample sizes per model/strategy, number of generations evaluated, data exclusion criteria, and any statistical tests (e.g., significance of differences in CWE counts). These omissions make it impossible to assess whether the observed side effects and lack of universal effectiveness are robust or sensitive to post-hoc choices, as noted in the low soundness assessment.

Authors: We acknowledge that additional experimental details are required for reproducibility and to evaluate the robustness of our conclusions. The original submission summarized generation counts at a high level but omitted granular breakdowns and statistical analysis. In the revised manuscript, we will expand §4 with a dedicated 'Experimental Details' subsection reporting: exact sample sizes (e.g., 500 generations per model-strategy-language configuration), data exclusion criteria (non-parsable or non-compilable outputs), and statistical tests including p-values from McNemar's test for paired CWE presence and effect sizes for count differences. Updated tables and confidence intervals will be included to allow assessment of whether side effects and strategy dependence are sensitive to analysis choices. revision: yes

Circularity Check

0 steps flagged

Minor self-citation present but not load-bearing for empirical claims

full rationale

This is an empirical measurement study whose central claims (strategy- and model-dependent security improvements, introduction of new weaknesses as side effects, and absence of a universal solution) rest on experimental outcomes from applying fine-tuning/prompting strategies and detecting CWEs via static analysis. No derivation chain, equations, or fitted parameters reduce to inputs by construction. A minor self-citation to prior prompting work exists but is not load-bearing for the new side-effect findings, which have independent experimental content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that CWE detection via static analysis is a valid proxy for real security risk and that the chosen models and languages are representative. No free parameters are introduced because the study is comparative rather than predictive modeling.

axioms (1)

domain assumption Static analysis tools reliably identify the listed CWEs without systematic false positives or negatives that would bias strategy comparisons.
Invoked when the paper uses CWE counts as the primary outcome measure.

pith-pipeline@v0.9.0 · 5529 in / 1191 out tokens · 55184 ms · 2026-05-08T09:06:14.459952+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Anisuzzaman, Jeffrey G

D.M. Anisuzzaman, Jeffrey G. Malins, Paul A. Friedman, and Zachi I. Attia. 2025. Fine-Tuning Large Language Models for Specialized Use Cases.Mayo Clinic Proceedings: Digital Health3, 1 (2025), 100184

2025
[2]

Anthropic. 2026. Claude Code. https://claude.com/product/claude-code Accessed: 2026-03-01

2026
[3]

Anysphere Inc. 2025. The AI Code Editor. https://cursor.com/en. Accessed: 2025-08-01

2025
[4]

Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. 2025. Reasoning language models: A blueprint.arXiv preprint arXiv:2501.11223 (2025)

work page arXiv 2025
[5]

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. 2024. LoRA Learns Less and Forgets Less.Transactions on Machine Learning Research(2024), 1–39

2024
[6]

Marc Bruni, Fabio Gabrielli, Mohammad Ghafari, and Martin Kropp. 2025. Benchmarking prompt engineering techniques for secure code generation with gpt models. InProceedings of the 2nd ACM International Conference on AI Foundation Models and Software Engineering (FORGE). IEEE, 93–103

2025
[7]

Cloudsmith. 2025. 2025 Artifact Management Report. https://cloudsmith.com/campaigns/2025-artifact-management- report. Accessed: 2026-01-07

2025
[8]

The MITRE Corporation. 2025. CWE Top 25 Most Dangerous Software Weaknesses. https://cwe.mitre.org/top25/. Accessed: 2025-08-16

2025
[9]

Roddy Correa, Juan-Ramón Higuera, Javier Bermejo, Juan Antonio Montalvo, Manuel Sanchez, and Ángel Magreñán
[10]

Hybrid Security Assessment Methodology for Web Applications.Computer Modeling in Engineering & Sciences 126 (2020), 89–124

2020
[11]

DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2025
[12]

Greta Dolcetti and Eleonora Iotti. 2025. A Dual Perspective Review on Large Language Models and Code Verification. Frontiers in Computer Science7 (2025), 1655469

2025
[13]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. InProceedings of the IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53

2023
[14]

Ahmed Fawzy, Amjed Tahir, and Kelly Blincoe. 2026. Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook–a Grey Literature Review. InProceedings of the IEEE International Conference on Software Engineering - Software Engineering in Practice (SEIP). IEEE

2026
[15]

Forum of Incident Response and Security Teams, Inc. (FIRST). 2019. Common Vulnerability Scoring System v3.1: Specification Document. https://www.first.org/cvss/v3-1/specification-document. Accessed: 2025-08-24

2019
[16]

OpenJS Foundation. 2025. Express - Fast, unopinionated, minimalist web framework for Node.js. https://expressjs.com/. Accessed: 2025-11-30. On Fixing Insecure AI-Generated Code Suggestions 41

2025
[17]

Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu, and Jinfu Chen. 2025. Security weaknesses of copilot-generated code in github projects: An empirical study.ACM Transactions on Software Engineering and Methodology34, 8 (2025), 1–34

2025
[18]

Vrunda Gadesha, Eda Kavlakoglu, and Vanna Winland. 2025. What is chain of thought (CoT) prompting? IBM Think website. https://www.ibm.com/think/topics/chain-of-thoughts

2025
[19]

GitHub. 2021. CodeQL. https://codeql.github.com/ Accessed: 2025-08-19

2021
[20]

GitHub. 2021. CodeQL code scanning: new severity levels for security alerts. https://github.blog/changelog/2021-07- 19-codeql-code-scanning-new-severity-levels-for-security-alerts/. Accessed: 2025-08-24

2021
[21]

GitHub. 2025. AI that builds with you. https://github.com/features/copilot. Accessed: 2025-08-01

2025
[22]

GitHub. 2025. Octoverse: A new developer joins GitHub every second as AI leads TypeScript to #1. https://github.blog/ news-insights/octoverse/octoverse-a-new-developer-joins-github-every-second-as-ai-leads-typescript-to-1/. Ac- cessed: 2026-05-04

2025
[23]

GitHub. 2025. Survey reveals AI’s impact on the developer experience. https://github.blog/news-insights/research/ survey-reveals-ais-impact-on-the-developer-experience/. Accessed: 2025-08-01

2025
[24]

Google Developers. 2025. Gemini 2.0: Flash, Flash-Lite and Pro. Google Developers Blog. https://developers.googleblog. com/en/gemini-2-family-expands/ Accessed on August 16, 2025

2025
[25]

Hossein Hajipour, Keno Hassler, Thorsten Holz, Lea Schönherr, and Mario Fritz. 2024. Codelmsec benchmark: Systematically evaluating and finding security vulnerabilities in black-box code language models. InProceedings of the 2nd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 684–709

2024
[26]

Jingxuan He, Mark Vero, Gabriela Krasnopolska, and Martin Vechev. 2024. Instruction tuning for secure code generation. InProceedings of the 41st International Conference on Machine Learning (ICML). IEEE, 18043–18062

2024
[27]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

2024
[28]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
[29]

InProceedings of the 10th International Conference on Learning Representations (ICLR)

LoRA: Low-Rank Adaptation of Large Language Models. InProceedings of the 10th International Conference on Learning Representations (ICLR). OpenReview.net, 1–13
[30]

IBM. [n. d.]. What is DevSecOps? https://www.ibm.com/think/topics/devsecops Accessed: 2026-04-05

2026
[31]

Mahmoud Jahanshahi and Audris Mockus. 2025. Cracks in the stack: Hidden vulnerabilities and licensing risks in llm pre-training datasets. InProceedings of the 2nd IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, 104–111

2025
[32]

Arvinder Kaur and Ruchikaa Nayyar. 2020. A Comparative Study of Static Code Analysis tools for Vulnerability Detection in C/C++ and JAVA Source Code.Procedia Computer Science171 (2020), 2023–2029

2020
[33]

Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, and David Mohaisen. 2025. Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis.arXiv preprint arXiv:2502.01853(2025)

work page arXiv 2025
[34]

Jan H. Klemmer, Stefan Albert Horstmann, Nikhil Patnaik, Cordelia Ludden, Cordell Burton Jr., Carson Powers, Fabio Massacci, Akond Rahman, Daniel Votipka, Heather Richter Lipford, Awais Rashid, Alena Naiakshina, and Sascha Fahl. 2024. Using AI Assistants in Software Development: A Qualitative Study on Security Practices and Concerns. In Proceedings of the...

2024
[35]

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. 2022. The stack: 3 tb of permissively licensed source code.arXiv preprint arXiv:2211.15533(2022)

work page arXiv 2022
[36]

Junjie Li, Fazle Rabbi, Cheng Cheng, Aseem Sangalay, Yuan Tian, and Jinqiu Yang. 2024. An exploratory study on fine-tuning large language models for secure code generation.arXiv preprint arXiv:2408.09078(2024)

work page arXiv 2024
[37]

Junjie Li, Aseem Sangalay, Cheng Cheng, Yuan Tian, and Jinqiu Yang. 2024. Fine tuning large language model for secure code generation. InProceedings of the 1st Special Event of AI Foundation Models and Software Engineering (FORGE). ACM, 86–90

2024
[38]

Xinghang Li, Jingzhe Ding, Chao Peng, Bing Zhao, Xiang Gao, Hongwan Gao, and Xinchen Gu. 2025. SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code.arXiv preprint arXiv:2506.05692 (2025)

work page arXiv 2025
[39]

Nikolaos Lykousas and Constantinos Patsakis. 2023. Tales from the Git: Automating the detection of secrets on code and assessing developers’ passwords choices. InProceedings of the 8th IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). IEEE, 68–75

2023
[40]

Vahid Majdinasab, Michael Joshua Bishop, Shawn Rasheed, Arghavan Moradidakhel, Amjed Tahir, and Foutse Khomh
[41]

InProceedings of the 31st IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

Assessing the Security of GitHub Copilot Generated Code – A Targeted Replication Study. InProceedings of the 31st IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 435–444. 42 Soltanian Fard Jahromi et al
[42]

MITRE Corporation. 2025. CWE-79: Improper Neutralization of Input During Web Page Generation (Cross -site Scripting). https://cwe.mitre.org/data/definitions/79.html. Accessed: 2025-08-01

2025
[43]

MITRE Corporation. 2025. CWE-89: Improper Neutralization of Special Elements used in an SQL Command (’SQL Injection’). https://cwe.mitre.org/data/definitions/89.html. Accessed: 2025-08-01

2025
[44]

MITRE Corporation. 2025. New to CWE. https://cwe.mitre.org/about/new_to_cwe.html. Accessed: 2025-08-01

2025
[45]

Hassan Onsori Delicheh, Alexandre Decan, and Tom Mens. 2024. Quantifying security issues in reusable JavaScript actions in GitHub workflows. InProceedings of the 21st International Conference on Mining Software Repositories (MSR). ACM, 692–703

2024
[46]

OpenAI. 2025. Introducing GPT-5. OpenAI website. https://openai.com/index/introducing-gpt-5/ Accessed on November 25, 2025

2025
[47]

OpenAI. 2025. Introducing GPT -4.1 in the API. OpenAI website. https://openai.com/index/gpt-4-1/ Accessed on August 16, 2025

2025
[48]

OpenAI. 2025. Introducing OpenAI o3 and o4-mini. OpenAI website. https://openai.com/index/introducing-o3-and- o4-mini/ Accessed on August 16, 2025

2025
[49]

Oracle. 2025. Java EE at a Glance. https://www.oracle.com/nz/java/technologies/java-ee-glance.html. Accessed: 2025-11-30

2025
[50]

Stack Overflow. 2025. 2025 Stack Overflow Developer Survey - AI. https://survey.stackoverflow.co/2025/ai Accessed: 2025-10-29

2025
[51]

2025.Flask Documentation

Pallets Projects. 2025.Flask Documentation. Pallets Projects. https://flask.palletsprojects.com/en/stable/ Version 3.1.2

2025
[52]

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. InProceedings of the 43rd IEEE Symposium on Security and Privacy (SP). IEEE, 754–768

2022
[53]

The Go Project. 2025. Package net/http - HTTP client and server implementations in Go. https://pkg.go.dev/net/http. Accessed: 2025-11-30

2025
[54]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)

work page internal anchor Pith review arXiv 2020
[55]

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. 2023. From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. InProceedings of the 12th IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9

2023
[56]

Andreas Schaad, Stefan Götz, and Dominik Binder. 2025. You Still have to Study On the Security of LLM Generated Code. InICT Systems Security and Privacy Protection, Lili Nemec Zlatolas, Kai Rannenberg, Tatjana Welzer, and Joaquin Garcia-Alfaro (Eds.). Springer, 111–124

2025
[57]

Tabnine. 2025. Tabnine. https://www.tabnine.com/. Accessed: 2025-08-01

2025
[58]

Cordeiro

Norbert Tihanyi, Tamas Bisztray, Mohamed Amine Ferrag, Ridhi Jain, and Lucas C. Cordeiro. 2024. How secure is AI-generated code: a large-scale comparison of large language models.Empirical Software Engineering30, 47 (2024), 1–47

2024
[59]

Catherine Tony, Markus Mutas, Nicolás Díaz Ferreyra, and Riccardo Scandariato. 2023. LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations. InProceedings of the 20th International Conference on Mining Software Repositories (MSR). IEEE, 588–592

2023
[60]

Lukas Twist, Jie M Zhang, Mark Harman, Don Syme, Joost Noppen, and Detlef Nauck. 2025. LLMs Love Python: A Study of LLMs’ Bias for Programming Languages and Libraries.arXiv e-prints(2025), arXiv–2503

2025
[61]

Darko Ðurđev. 2024. Popularity of programming languages.AIDASCO Reviews2, 2 (2024), 24–29

2024
[62]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in Neural Information Processing Systems30 (2017), 1–11

2017
[63]

Luping Wang, Sheng Chen, Linnan Jiang, Shu Pan, Runze Cai, Sen Yang, and Fei Yang. 2025. Parameter-efficient fine-tuning in large language models: a survey of methodologies.Artificial Intelligence Review58, 8 (2025), Article No.: 227

2025
[64]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems35 (2022), 24824–24837

2022
[65]

Xiangzhe Xu, Zian Su, Jinyao Guo, Kaiyuan Zhang, Zhenting Wang, and Xiangyu Zhang. 2024. ProSec: Fortifying Code LLMs with Proactive Security Alignment.arXiv preprint arXiv:2411.12882(2024)

work page arXiv 2024
[66]

Hao Yan, Swapneel Suhas Vaidya, Xiaokuan Zhang, and Ziyu Yao. 2025. Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation.arXiv preprint arXiv:2506.23034(2025). On Fixing Insecure AI-Generated Code Suggestions 43

work page arXiv 2025
[67]

Yifan Zhang, Yang Yuan, and Andrew Chi-Chih Yao. 2023. Meta prompting for ai systems.arXiv preprint arXiv:2311.11482(2023)

work page arXiv 2023
[68]

Jianguo Zhao, Yuqiang Sun, Cheng Huang, Chengwei Liu, YaoHui Guan, Yutong Zeng, and Yang Liu. 2025. Towards secure code generation with LLMs: A study on common weakness enumeration.IEEE Transactions on Software Engineering51, 12 (2025), 3507–3523

2025
[69]

Dewu Zheng, Yanlin Wang, Ensheng Shi, Xilin Liu, Yuchi Ma, Hongyu Zhang, and Zibin Zheng. 2024. Top general performance= top domain performance? domaincodebench: A multi-domain code generation benchmark.arXiv preprint arXiv:2412.18573(2024). A Network Graph of CWEs Introduced By Model Refinement Techniques 44 Soltanian Fard Jahromi et al. Fig. 13. Network...

work page arXiv 2024