A Quasi-Experimental Developer Study of Security Training in LLM-Assisted Web Application Development
Pith reviewed 2026-05-10 05:13 UTC · model grok-4.3
The pith
Security training reduces validated weaknesses in LLM-assisted backend code by 31.5 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the tested conditions, post-training runs reduce validated security burden in LLM-assisted backend development without modifying the model. The post-training condition showed a significant paired reduction under an exact Wilcoxon signed-rank test (p = 0.0059). Validated weaknesses decreased from 162 to 111 (31.5%), the severity-weighted burden decreased from 432 to 267 (38.2%), and critical findings decreased from 24 to 5 (79.2%). The largest reductions were in authorization and object access (53.3%) and in authentication, credential policy, and recovery weaknesses (44.7%).
What carries the argument
The layer-based security training package paired with independent manual validation of submitted repositories using a severity-weighted weakness score.
If this is right
- Authorization and object access weaknesses dropped by 53.3 percent.
- Authentication, credential policy, and recovery weaknesses dropped by 44.7 percent.
- Session and browser trust-boundary issues showed minimal change.
- Sensitive-data and cryptographic weaknesses showed only marginal improvement.
- The results do not support replacing secure defaults, static analysis, expert review, or operational hardening.
Where Pith is reading between the lines
- The same training format might produce comparable reductions when developers work in other languages or application layers.
- Embedding short training modules into LLM coding workflows could lower aggregate security debt across teams.
- The pre/post design could be extended to measure whether gains persist after several weeks or when developers switch models.
Load-bearing premise
The manual validation of security weaknesses performed by the first and second authors is unbiased, consistent across reviewers, and accurately captures real-world exploitability.
What would settle it
A larger replication with blinded validators on the same or similar tasks that finds no significant paired difference in validated weakness counts after training.
Figures
read the original abstract
This paper presents a controlled quasi-experimental developer study examining whether a layer-based security training package is associated with improved security quality in LLM-assisted implementation of an identity-centric Java Spring Boot backend. The study uses a mixed design with a within-subject pre-training versus post-training comparison and an exploratory between-subject expertise factor. Twelve developers completed matched runs under a common interface, fixed model configuration, counterbalanced task sets, and a shared starter project. Security outcomes were assessed via independent manual validation of submitted repositories by the first and second authors. The primary participant-level endpoint was a severity-weighted validated-weakness score. The post-training condition showed a significant paired reduction under an exact Wilcoxon signed-rank test ($p = 0.0059$). In aggregate, validated weaknesses decreased from 162 to 111 (31.5\%), the severity-weighted burden decreased from 432 to 267 (38.2\%), and critical findings decreased from 24 to 5 (79.2\%). The largest reductions were in authorization and object access (53.3\%) and in authentication, credential policy, and recovery weaknesses (44.7\%). Session and browser trust-boundary issues showed minimal change, while sensitive-data and cryptographic weaknesses showed only marginal improvement. These results suggest that, under the tested conditions, post-training runs reduce validated security burden in LLM-assisted backend development without modifying the model. They do not support replacing secure defaults, static analysis, expert review, or operational hardening.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper reports a quasi-experimental developer study with 12 participants examining the effect of a layer-based security training package on security outcomes in LLM-assisted implementation of an identity-centric Java Spring Boot backend. Using a mixed within-subject (pre- vs. post-training) and exploratory between-subject (expertise) design with counterbalanced tasks and fixed model settings, the authors claim statistically significant reductions in manually validated security weaknesses after training: validated weaknesses fell from 162 to 111 (31.5%), severity-weighted burden from 432 to 267 (38.2%), and critical findings from 24 to 5 (79.2%), with an exact Wilcoxon signed-rank test yielding p=0.0059 on the primary participant-level severity-weighted score. Largest gains occurred in authorization/object access and authentication/credential weaknesses.
Significance. If the results are robust, the work supplies concrete empirical evidence that targeted security training can measurably reduce validated security burden in LLM-assisted backend development without altering the underlying model, defaults, or requiring post-generation static analysis. This is a useful contribution to the intersection of usable security and AI-assisted software engineering. The design includes appropriate non-parametric paired testing, task counterbalancing, and a shared starter project, which are strengths. The small n=12 and author-only validation, however, constrain claims about generalizability and real-world exploitability.
major comments (1)
- [Security outcome assessment] Security outcome assessment (as described in the abstract and methods): The manual validation of repositories into weakness categories and severity levels was performed solely by the first and second authors. No blinding protocol, inter-rater reliability statistic (e.g., Cohen’s kappa), or explicit standardized classification criteria are reported. Because the headline aggregates (162→111 weaknesses, 432→267 severity-weighted score, 24→5 critical) and the p=0.0059 result rest entirely on these classifications, any systematic shift in judgment between pre- and post-training code (possible in an unblinded within-subject design) would directly alter the paired differences. This measurement step is load-bearing for the central claim.
minor comments (2)
- [Abstract] The exploratory between-subject expertise factor is mentioned but its statistical treatment and interaction with the within-subject training effect are not detailed in the abstract; a brief clarification would help readers interpret the mixed design.
- [Methods] The paper could usefully report the exact definition and weighting scheme used for the severity-weighted burden score (e.g., how critical, high, medium, and low map to numeric weights) to allow replication.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for acknowledging the design strengths of the study. We address the single major comment below with a commitment to improve transparency in the revised manuscript.
read point-by-point responses
-
Referee: Security outcome assessment (as described in the abstract and methods): The manual validation of repositories into weakness categories and severity levels was performed solely by the first and second authors. No blinding protocol, inter-rater reliability statistic (e.g., Cohen’s kappa), or explicit standardized classification criteria are reported. Because the headline aggregates (162→111 weaknesses, 432→267 severity-weighted score, 24→5 critical) and the p=0.0059 result rest entirely on these classifications, any systematic shift in judgment between pre- and post-training code (possible in an unblinded within-subject design) would directly alter the paired differences. This measurement step is load-bearing for the central claim.
Authors: We agree that the current reporting of the validation process is insufficient and constitutes a genuine methodological limitation. In the revised manuscript we will expand the Methods section to include: (1) the full standardized classification criteria (with concrete examples for each weakness category drawn from the OWASP Top 10 and CWE mappings used); (2) the inter-rater reliability statistic (Cohen’s kappa) computed from the independent classifications performed by the two authors; and (3) an explicit statement that the authors were not blinded to condition together with a brief discussion of the resulting bias risk. While complete blinding is difficult in a small-scale study in which the validators are also the designers, these additions will allow readers to evaluate the robustness of the paired differences. We do not claim the validation was free of potential bias; we only maintain that the independent dual-author review and the use of a severity-weighted primary endpoint provide a defensible, if imperfect, measurement approach. revision: yes
Circularity Check
No circularity: purely empirical pre/post comparison with direct measurement
full rationale
The paper reports results from a quasi-experimental within-subject study of developer performance before and after security training in LLM-assisted coding. Central claims rest on aggregate counts of manually validated weaknesses, severity-weighted scores, and an exact Wilcoxon signed-rank test applied to paired participant data. No equations, model derivations, fitted parameters renamed as predictions, ansatzes, uniqueness theorems, or self-citations appear in the provided text or abstract. The measurement step (author validation of repositories) is a methodological choice whose validity can be assessed externally but does not reduce the reported reductions or p-value to the input data by construction. The study is self-contained empirical work.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The Wilcoxon signed-rank test is appropriate for the paired severity-weighted scores and the data meet its assumptions
- domain assumption Manual validation by the first and second authors produces consistent and unbiased weakness classifications
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2308.10620 (2023)
X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. C. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,”CoRR, vol. abs/2308.10620, 2023
-
[2]
arXiv preprint arXiv:2403.02583 (2024) https://doi.org/10.48550/arXiv.2403.02583
Y . Huang, Y . Chen, X. Chen, J. Chen, R. Peng, Z. Tang, J. Huang, F. Xu, and Z. Zheng, “Generative software engineering,”CoRR, vol. abs/2403.02583, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2403.02583
-
[3]
N. O. Jaffal, M. Alkhanafseh, and D. Mohaisen, “Large language models in cybersecurity: A survey of applications, vulnerabilities, and defense techniques,”AI, vol. 6, no. 9, 2025. [Online]. Available: https://www.mdpi.com/2673-2688/6/9/216
work page 2025
-
[4]
Asleep at the keyboard? assessing the security of github copilot’s code contri- butions,
H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contri- butions,” inIEEE Symp. on Security & Privacy, 2022
work page 2022
-
[5]
Do users write more insecure code with ai assistants?
N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with ai assistants?” inACM CCS, 2023
work page 2023
-
[6]
How secure is code generated by chatgpt?
R. Khoury, A. R. Avila, J. Brunelle, and B. M. Camara, “How secure is code generated by chatgpt?” inIEEE SMC, 2023
work page 2023
-
[7]
How secure is ai-generated code: a large-scale comparison of large language models,
N. Tihanyi, T. Bisztray, M. A. Ferrag, R. Jain, and L. C. Cordeiro, “How secure is ai-generated code: a large-scale comparison of large language models,”Empir. Softw. Eng., vol. 30, no. 2, p. 47, 2025. [Online]. Available: https://doi.org/10.1007/s10664-024-10590-1
-
[8]
Owasp top ten — owasp foundation,
—, “Owasp top ten — owasp foundation,” https://owasp.org/ www-project-top-ten/, 05 2024, (Accessed on 05/18/2024)
work page 2024
-
[9]
——, “CWE list version 4.16,” https://cwe.mitre.org/data/index.html, 05 2024, [Online; accessed 2024-12-16]
work page 2024
-
[10]
Lost at c: A user study on the security implications of large language model code assistants,
G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, “Lost at c: A user study on the security implications of large language model code assistants,” inProceedings of the 32nd USENIX Security Symposium, 2023, pp. 2205–2222
work page 2023
-
[11]
Ocassionally secure: A comparative analysis of code generation assistants,
R. Elgedawy, J. Sadik, S. Dutta, A. Gautam, K. Georgiou, F. Gholamrezae, F. Ji, K. Lim, Q. Liu, and S. Ruoti, “Ocassionally secure: A comparative analysis of code generation assistants,”CoRR, vol. abs/2402.00689, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2402.00689
-
[12]
Kharma, Soohyeon Choi, Mohammad Alkhanafseh, and David Mohaisen
M. F. Kharma, S. Choi, M. Alkhanafseh, and D. Mohaisen, “Security and quality in llm-generated code: a multi-language, multi-model analysis,”IEEE Transactions on Dependable and Secure Computing, no. 01, pp. 1–15, 2026. [Online]. Available: https: //doi.org/10.1109/TDSC.2026.3672745
-
[13]
A user-centered security evaluation of copilot,
O. Asare, M. Nagappan, and N. Asokan, “A user-centered security evaluation of copilot,” inICSE, 2024
work page 2024
-
[14]
Using AI assistants in software development: A qualitative study on security practices and concerns,
J. H. Klemmer, S. A. Horstmann, N. Patnaik, C. Ludden, C. B. Jr., C. Powers, F. Massacci, A. Rahman, D. V otipka, H. R. Lipford, A. Rashid, A. Naiakshina, and S. Fahl, “Using AI assistants in software development: A qualitative study on security practices and concerns,” inCCS. ACM, 2024, pp. 2726–2740. [Online]. Available: https://doi.org/10.1145/3658644.3690283
-
[15]
Mitigation-aware prompt engineering for secure code generation (ma-cot),
—, “Mitigation-aware prompt engineering for secure code generation (ma-cot),” 04 2026, [Online; accessed 2026-04-14]. [Online]. Available: https://github.com/mohsystem/paper3
work page 2026
-
[16]
Software security during modern code review: the developer’s perspective,
L. Braz and A. Bacchelli, “Software security during modern code review: the developer’s perspective,” inESEC/FSE. ACM, 2022, pp. 810–821. [Online]. Available: https://doi.org/10.1145/3540250.3549135
-
[17]
Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study, February 2025
Y . Fu, P. Liang, A. Tahir, Z. Li, M. Shahin, J. Yu, and J. Chen, “Security weaknesses of copilot-generated code in github projects: An empirical study,”CoRR, vol. abs/2310.02059, 2025
-
[18]
Y . Mou, X. Deng, Y . Luo, S. Zhang, and W. Ye, “Can you really trust code copilots? evaluating large language models from a code security perspective,”CoRR, vol. abs/2505.10494, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.10494
-
[19]
An investigation into misuse of java security apis by large language models,
Z. Mousavi, C. Islam, K. Moore, A. Abuadbba, and M. A. Babar, “An investigation into misuse of java security apis by large language models,” inASIA CCS. ACM, 2024. [Online]. Available: https://doi.org/10.1145/3634737.3661134
-
[20]
How well do large language models serve as end-to-end secure code agents for python?
J. Gong, N. Duan, Z. Tao, Z. Gong, Y . Yuan, and M. Huang, “How well do large language models serve as end-to-end secure code agents for python?” inEASE. ACM, 2025, pp. 1004–1013. [Online]. Available: https://doi.org/10.1145/3756681.3756984
-
[21]
H. Hajipour, K. Hassler, T. Holz, L. Schonherr, and M. Fritz, “Codelm- sec benchmark: Systematically evaluating vulnerabilities in black-box code generation models,” inCCS, 2023, pp. 1234–1248
work page 2023
-
[22]
From vulnerabilities to remediation: A systematic literature review of LLMs in code security, 2024
E. Basic and A. Giaretta, “Large language models and code security: A systematic literature review,”CoRR, vol. abs/2412.15004, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2412.15004
-
[23]
Large language models for cyber security: A systematic literature review,
H. Xu, S. Wang, N. Li, K. Wang, Y . Zhao, K. Chen, T. Yu, Y . Liu, and H. Wang, “Large language models for cyber security: A systematic literature review,”CoRR, vol. abs/2405.04760, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2405.04760
-
[24]
From large to mammoth: A comparative evaluation of large language models in vulnerability detection,
J. Lin and D. Mohaisen, “From large to mammoth: A comparative evaluation of large language models in vulnerability detection,” in32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28, 2025. The Internet Society, 2025
work page 2025
-
[25]
Díaz Ferreyra, Markus Mutas, Salem Dhiff, and Ric- cardo Scandariato
C. Tony, N. E. D. Ferreyra, M. Mutas, S. Dhif, and R. Scandariato, “Prompting techniques for secure code generation: A systematic investigation,”ACM TOSEM, vol. 34, no. 8, pp. 225:1–225:53, 2025. [Online]. Available: https://doi.org/10.1145/3722108
-
[26]
Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo
M. Bruni, F. Gabrielli, M. Ghafari, and M. Kropp, “Benchmarking prompt engineering techniques for secure code generation with GPT models,” inForge@ICSE. IEEE, 2025, pp. 93–103. [Online]. Available: https://doi.org/10.1109/Forge66646.2025.00018
-
[27]
M. Nazzal, I. Khalil, A. Khreishah, and N. Phan, “Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),” inCCS. ACM, 2024, pp. 2266–2280. [Online]. Available: https://doi.org/10.1145/3658644.3690298
-
[28]
Towards secure code generation with llms: A study on common weakness enumeration,
J. Zhao, Y . Sun, C. Huang, C. Liu, Y . Guan, Y . Zeng, and Y . Liu, “Towards secure code generation with llms: A study on common weakness enumeration,”IEEE Transactions on Software Engineering, vol. 51, no. 12, pp. 3507–3523, 2025
work page 2025
-
[29]
Give llms a security course: Securing retrieval-augmented code generation via knowledge injection,
B. Lin, S. Wang, Y . Qin, L. Chen, and X. Mao, “Give llms a security course: Securing retrieval-augmented code generation via knowledge injection,” inCCS, C. Huang, J. Chen, S. Shieh, D. Lie, and V . Cortier, Eds. ACM, 2025, pp. 3356–3370. [Online]. Available: https://doi.org/10.1145/3719027.3765049
-
[30]
Y . Nong, M. Aldeen, L. Cheng, H. Hu, F. Chen, and H. Cai, “Chain-of-thought prompting of large language models for discovering and fixing software vulnerabilities,”CoRR, vol. abs/2402.17230, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.17230
-
[31]
A study of vulnerability repair in javascript programs with large language models,
T. K. Le, S. Alimadadi, and S. Y . Ko, “A study of vulnerability repair in javascript programs with large language models,” inWWW, T. Chua, C. Ngo, R. K. Lee, R. Kumar, and H. W. Lauw, Eds. ACM, 2024, pp. 666–669. [Online]. Available: https://doi.org/10.1145/3589335.3651463
-
[32]
Devaic: A tool for security assessment of ai-generated code,
D. Cotroneo, R. D. Luca, and P. Liguori, “Devaic: A tool for security assessment of ai-generated code,”Inf. Softw. Technol., vol. 177, p. 107572, 2025. [Online]. Available: https://doi.org/10.1016/j.infsof.2024. 107572
-
[33]
M. L. Siddiq and J. C. S. Santos, “Securityeval dataset: Mining vul- nerability examples to evaluate machine learning-based code generation techniques,” inACM MSR4P&S, 2022
work page 2022
-
[34]
SALLM: security assessment of generated code,
M. L. Siddiq, J. C. da Silva Santos, S. Devareddy, and A. Muller, “SALLM: security assessment of generated code,” inASE Workshops. ACM, 2024, pp. 54–65. [Online]. Available: https://doi.org/10.1145/ 3691621.3694934
-
[35]
C. Tony, M. Mutas, N. E. D. Ferreyra, and R. Scandariato, “Llmseceval: A dataset of natural language prompts for security evaluations,” inMSR. IEEE, 2023, pp. 588–592. [Online]. Available: https://doi.org/10.1109/MSR59073.2023.00084
-
[36]
Knowledgegraphbasedrepository-levelcodegeneration
J. Peng, L. Cui, K. Huang, J. Yang, and B. Ray, “Cweval: Outcome-driven evaluation on functionality and security of LLM code generation,” inIEEE/ACM International Workshop on Large Language Models for Code, LLM4Code@ICSE 2025, Ottawa, ON, Canada, May 3, 2025. IEEE, 2025, pp. 33–40. [Online]. Available: https://doi.org/10.1109/LLM4Code66737.2025.00009
-
[37]
Baxbench: Can llms generate correct and secure backends?
M. Vero, N. M ¨undler, V . Chibotaru, V . Raychev, M. Baader, N. Jovanovic, J. He, and M. T. Vechev, “Baxbench: Can llms generate correct and secure backends?” inICML. OpenReview.net, 2025. [Online]. Available: https://openreview.net/forum?id=il3KRr4H9u
work page 2025
-
[38]
—, “Gpt-5 model — openai api,” https://platform.openai.com/docs/ models/gpt-5, 09 2025, (Accessed on 2025-09-24)
work page 2025
-
[39]
Evaluating large language models in vulnerability detection under variable context windows,
J. Lin and D. Mohaisen, “Evaluating large language models in vulnerability detection under variable context windows,”CoRR, vol. abs/2502.00064, 2025. [Online]. Available: https://doi.org/10.48550/ arXiv.2502.00064
-
[40]
R. Croft, Y . Xie, M. Zahedi, M. A. Babar, and C. Treude, “An empirical study of developers’ discussions about security challenges of different programming languages,”Empir. Softw. Eng., vol. 27, no. 1, p. 27,
-
[41]
[Online]. Available: https://doi.org/10.1007/s10664-021-10054-w APPENDIXA PARTICIPANT-LEVEL SEVERITY PROFILES Table V reports raw severity composition for each partic- ipant in both conditions. Its purpose is transparency; formal analysis is based on paired participant-level change scores. TABLE V: Participant-level severity profile before and after train...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.