pith. sign in

arxiv: 2604.17763 · v1 · submitted 2026-04-20 · 💻 cs.CR · cs.LG

A Quasi-Experimental Developer Study of Security Training in LLM-Assisted Web Application Development

Pith reviewed 2026-05-10 05:13 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords LLM-assisted developmentsecurity trainingweb application securityquasi-experimental studyvulnerability reductionSpring Bootdeveloper studyauthorization weaknesses
0
0 comments X

The pith

Security training reduces validated weaknesses in LLM-assisted backend code by 31.5 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether a layer-based security training package improves security outcomes when developers use large language models to implement an identity-centric Java Spring Boot backend. Twelve participants completed matched pre-training and post-training tasks under fixed model settings, counterbalanced tasks, and a shared starter project. Independent manual validation of the submitted repositories found a statistically significant drop in severity-weighted weaknesses after training. The largest gains appeared in authorization and authentication categories while some other weakness types changed little. The work positions training as a complement to, rather than a replacement for, secure defaults and later review steps.

Core claim

Under the tested conditions, post-training runs reduce validated security burden in LLM-assisted backend development without modifying the model. The post-training condition showed a significant paired reduction under an exact Wilcoxon signed-rank test (p = 0.0059). Validated weaknesses decreased from 162 to 111 (31.5%), the severity-weighted burden decreased from 432 to 267 (38.2%), and critical findings decreased from 24 to 5 (79.2%). The largest reductions were in authorization and object access (53.3%) and in authentication, credential policy, and recovery weaknesses (44.7%).

What carries the argument

The layer-based security training package paired with independent manual validation of submitted repositories using a severity-weighted weakness score.

If this is right

  • Authorization and object access weaknesses dropped by 53.3 percent.
  • Authentication, credential policy, and recovery weaknesses dropped by 44.7 percent.
  • Session and browser trust-boundary issues showed minimal change.
  • Sensitive-data and cryptographic weaknesses showed only marginal improvement.
  • The results do not support replacing secure defaults, static analysis, expert review, or operational hardening.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training format might produce comparable reductions when developers work in other languages or application layers.
  • Embedding short training modules into LLM coding workflows could lower aggregate security debt across teams.
  • The pre/post design could be extended to measure whether gains persist after several weeks or when developers switch models.

Load-bearing premise

The manual validation of security weaknesses performed by the first and second authors is unbiased, consistent across reviewers, and accurately captures real-world exploitability.

What would settle it

A larger replication with blinded validators on the same or similar tasks that finds no significant paired difference in validated weakness counts after training.

Figures

Figures reproduced from arXiv: 2604.17763 by Ahmed Sabbah, David Mohaisen, Mohammad Alkhanafseh, Mohammed Kharma, Radi Jarrar, Samer Zain.

Figure 1
Figure 1. Figure 1: Mixed-design developer study with a within-subject [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Participant-level severity-weighted scores before and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

This paper presents a controlled quasi-experimental developer study examining whether a layer-based security training package is associated with improved security quality in LLM-assisted implementation of an identity-centric Java Spring Boot backend. The study uses a mixed design with a within-subject pre-training versus post-training comparison and an exploratory between-subject expertise factor. Twelve developers completed matched runs under a common interface, fixed model configuration, counterbalanced task sets, and a shared starter project. Security outcomes were assessed via independent manual validation of submitted repositories by the first and second authors. The primary participant-level endpoint was a severity-weighted validated-weakness score. The post-training condition showed a significant paired reduction under an exact Wilcoxon signed-rank test ($p = 0.0059$). In aggregate, validated weaknesses decreased from 162 to 111 (31.5\%), the severity-weighted burden decreased from 432 to 267 (38.2\%), and critical findings decreased from 24 to 5 (79.2\%). The largest reductions were in authorization and object access (53.3\%) and in authentication, credential policy, and recovery weaknesses (44.7\%). Session and browser trust-boundary issues showed minimal change, while sensitive-data and cryptographic weaknesses showed only marginal improvement. These results suggest that, under the tested conditions, post-training runs reduce validated security burden in LLM-assisted backend development without modifying the model. They do not support replacing secure defaults, static analysis, expert review, or operational hardening.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This paper reports a quasi-experimental developer study with 12 participants examining the effect of a layer-based security training package on security outcomes in LLM-assisted implementation of an identity-centric Java Spring Boot backend. Using a mixed within-subject (pre- vs. post-training) and exploratory between-subject (expertise) design with counterbalanced tasks and fixed model settings, the authors claim statistically significant reductions in manually validated security weaknesses after training: validated weaknesses fell from 162 to 111 (31.5%), severity-weighted burden from 432 to 267 (38.2%), and critical findings from 24 to 5 (79.2%), with an exact Wilcoxon signed-rank test yielding p=0.0059 on the primary participant-level severity-weighted score. Largest gains occurred in authorization/object access and authentication/credential weaknesses.

Significance. If the results are robust, the work supplies concrete empirical evidence that targeted security training can measurably reduce validated security burden in LLM-assisted backend development without altering the underlying model, defaults, or requiring post-generation static analysis. This is a useful contribution to the intersection of usable security and AI-assisted software engineering. The design includes appropriate non-parametric paired testing, task counterbalancing, and a shared starter project, which are strengths. The small n=12 and author-only validation, however, constrain claims about generalizability and real-world exploitability.

major comments (1)
  1. [Security outcome assessment] Security outcome assessment (as described in the abstract and methods): The manual validation of repositories into weakness categories and severity levels was performed solely by the first and second authors. No blinding protocol, inter-rater reliability statistic (e.g., Cohen’s kappa), or explicit standardized classification criteria are reported. Because the headline aggregates (162→111 weaknesses, 432→267 severity-weighted score, 24→5 critical) and the p=0.0059 result rest entirely on these classifications, any systematic shift in judgment between pre- and post-training code (possible in an unblinded within-subject design) would directly alter the paired differences. This measurement step is load-bearing for the central claim.
minor comments (2)
  1. [Abstract] The exploratory between-subject expertise factor is mentioned but its statistical treatment and interaction with the within-subject training effect are not detailed in the abstract; a brief clarification would help readers interpret the mixed design.
  2. [Methods] The paper could usefully report the exact definition and weighting scheme used for the severity-weighted burden score (e.g., how critical, high, medium, and low map to numeric weights) to allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for acknowledging the design strengths of the study. We address the single major comment below with a commitment to improve transparency in the revised manuscript.

read point-by-point responses
  1. Referee: Security outcome assessment (as described in the abstract and methods): The manual validation of repositories into weakness categories and severity levels was performed solely by the first and second authors. No blinding protocol, inter-rater reliability statistic (e.g., Cohen’s kappa), or explicit standardized classification criteria are reported. Because the headline aggregates (162→111 weaknesses, 432→267 severity-weighted score, 24→5 critical) and the p=0.0059 result rest entirely on these classifications, any systematic shift in judgment between pre- and post-training code (possible in an unblinded within-subject design) would directly alter the paired differences. This measurement step is load-bearing for the central claim.

    Authors: We agree that the current reporting of the validation process is insufficient and constitutes a genuine methodological limitation. In the revised manuscript we will expand the Methods section to include: (1) the full standardized classification criteria (with concrete examples for each weakness category drawn from the OWASP Top 10 and CWE mappings used); (2) the inter-rater reliability statistic (Cohen’s kappa) computed from the independent classifications performed by the two authors; and (3) an explicit statement that the authors were not blinded to condition together with a brief discussion of the resulting bias risk. While complete blinding is difficult in a small-scale study in which the validators are also the designers, these additions will allow readers to evaluate the robustness of the paired differences. We do not claim the validation was free of potential bias; we only maintain that the independent dual-author review and the use of a severity-weighted primary endpoint provide a defensible, if imperfect, measurement approach. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical pre/post comparison with direct measurement

full rationale

The paper reports results from a quasi-experimental within-subject study of developer performance before and after security training in LLM-assisted coding. Central claims rest on aggregate counts of manually validated weaknesses, severity-weighted scores, and an exact Wilcoxon signed-rank test applied to paired participant data. No equations, model derivations, fitted parameters renamed as predictions, ansatzes, uniqueness theorems, or self-citations appear in the provided text or abstract. The measurement step (author validation of repositories) is a methodological choice whose validity can be assessed externally but does not reduce the reported reductions or p-value to the input data by construction. The study is self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard statistical assumptions for the Wilcoxon signed-rank test and the validity of manual code review as a measurement method; no free parameters, new entities, or ad-hoc axioms are introduced.

axioms (2)
  • standard math The Wilcoxon signed-rank test is appropriate for the paired severity-weighted scores and the data meet its assumptions
    Invoked for the primary significance test on the participant-level endpoint.
  • domain assumption Manual validation by the first and second authors produces consistent and unbiased weakness classifications
    Central to the outcome measurement but not independently verified in the reported abstract.

pith-pipeline@v0.9.0 · 5581 in / 1462 out tokens · 44795 ms · 2026-05-10T05:13:17.194622+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    arXiv preprint arXiv:2308.10620 (2023)

    X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. C. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,”CoRR, vol. abs/2308.10620, 2023

  2. [2]

    arXiv preprint arXiv:2403.02583 (2024) https://doi.org/10.48550/arXiv.2403.02583

    Y . Huang, Y . Chen, X. Chen, J. Chen, R. Peng, Z. Tang, J. Huang, F. Xu, and Z. Zheng, “Generative software engineering,”CoRR, vol. abs/2403.02583, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2403.02583

  3. [3]

    Large language models in cybersecurity: A survey of applications, vulnerabilities, and defense techniques,

    N. O. Jaffal, M. Alkhanafseh, and D. Mohaisen, “Large language models in cybersecurity: A survey of applications, vulnerabilities, and defense techniques,”AI, vol. 6, no. 9, 2025. [Online]. Available: https://www.mdpi.com/2673-2688/6/9/216

  4. [4]

    Asleep at the keyboard? assessing the security of github copilot’s code contri- butions,

    H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contri- butions,” inIEEE Symp. on Security & Privacy, 2022

  5. [5]

    Do users write more insecure code with ai assistants?

    N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with ai assistants?” inACM CCS, 2023

  6. [6]

    How secure is code generated by chatgpt?

    R. Khoury, A. R. Avila, J. Brunelle, and B. M. Camara, “How secure is code generated by chatgpt?” inIEEE SMC, 2023

  7. [7]

    How secure is ai-generated code: a large-scale comparison of large language models,

    N. Tihanyi, T. Bisztray, M. A. Ferrag, R. Jain, and L. C. Cordeiro, “How secure is ai-generated code: a large-scale comparison of large language models,”Empir. Softw. Eng., vol. 30, no. 2, p. 47, 2025. [Online]. Available: https://doi.org/10.1007/s10664-024-10590-1

  8. [8]

    Owasp top ten — owasp foundation,

    —, “Owasp top ten — owasp foundation,” https://owasp.org/ www-project-top-ten/, 05 2024, (Accessed on 05/18/2024)

  9. [9]

    CWE list version 4.16,

    ——, “CWE list version 4.16,” https://cwe.mitre.org/data/index.html, 05 2024, [Online; accessed 2024-12-16]

  10. [10]

    Lost at c: A user study on the security implications of large language model code assistants,

    G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, “Lost at c: A user study on the security implications of large language model code assistants,” inProceedings of the 32nd USENIX Security Symposium, 2023, pp. 2205–2222

  11. [11]

    Ocassionally secure: A comparative analysis of code generation assistants,

    R. Elgedawy, J. Sadik, S. Dutta, A. Gautam, K. Georgiou, F. Gholamrezae, F. Ji, K. Lim, Q. Liu, and S. Ruoti, “Ocassionally secure: A comparative analysis of code generation assistants,”CoRR, vol. abs/2402.00689, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2402.00689

  12. [12]

    Kharma, Soohyeon Choi, Mohammad Alkhanafseh, and David Mohaisen

    M. F. Kharma, S. Choi, M. Alkhanafseh, and D. Mohaisen, “Security and quality in llm-generated code: a multi-language, multi-model analysis,”IEEE Transactions on Dependable and Secure Computing, no. 01, pp. 1–15, 2026. [Online]. Available: https: //doi.org/10.1109/TDSC.2026.3672745

  13. [13]

    A user-centered security evaluation of copilot,

    O. Asare, M. Nagappan, and N. Asokan, “A user-centered security evaluation of copilot,” inICSE, 2024

  14. [14]

    Using AI assistants in software development: A qualitative study on security practices and concerns,

    J. H. Klemmer, S. A. Horstmann, N. Patnaik, C. Ludden, C. B. Jr., C. Powers, F. Massacci, A. Rahman, D. V otipka, H. R. Lipford, A. Rashid, A. Naiakshina, and S. Fahl, “Using AI assistants in software development: A qualitative study on security practices and concerns,” inCCS. ACM, 2024, pp. 2726–2740. [Online]. Available: https://doi.org/10.1145/3658644.3690283

  15. [15]

    Mitigation-aware prompt engineering for secure code generation (ma-cot),

    —, “Mitigation-aware prompt engineering for secure code generation (ma-cot),” 04 2026, [Online; accessed 2026-04-14]. [Online]. Available: https://github.com/mohsystem/paper3

  16. [16]

    Software security during modern code review: the developer’s perspective,

    L. Braz and A. Bacchelli, “Software security during modern code review: the developer’s perspective,” inESEC/FSE. ACM, 2022, pp. 810–821. [Online]. Available: https://doi.org/10.1145/3540250.3549135

  17. [17]

    Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study, February 2025

    Y . Fu, P. Liang, A. Tahir, Z. Li, M. Shahin, J. Yu, and J. Chen, “Security weaknesses of copilot-generated code in github projects: An empirical study,”CoRR, vol. abs/2310.02059, 2025

  18. [18]

    Can you really trust code copilots? evaluating large language models from a code security perspective,

    Y . Mou, X. Deng, Y . Luo, S. Zhang, and W. Ye, “Can you really trust code copilots? evaluating large language models from a code security perspective,”CoRR, vol. abs/2505.10494, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.10494

  19. [19]

    An investigation into misuse of java security apis by large language models,

    Z. Mousavi, C. Islam, K. Moore, A. Abuadbba, and M. A. Babar, “An investigation into misuse of java security apis by large language models,” inASIA CCS. ACM, 2024. [Online]. Available: https://doi.org/10.1145/3634737.3661134

  20. [20]

    How well do large language models serve as end-to-end secure code agents for python?

    J. Gong, N. Duan, Z. Tao, Z. Gong, Y . Yuan, and M. Huang, “How well do large language models serve as end-to-end secure code agents for python?” inEASE. ACM, 2025, pp. 1004–1013. [Online]. Available: https://doi.org/10.1145/3756681.3756984

  21. [21]

    Codelm- sec benchmark: Systematically evaluating vulnerabilities in black-box code generation models,

    H. Hajipour, K. Hassler, T. Holz, L. Schonherr, and M. Fritz, “Codelm- sec benchmark: Systematically evaluating vulnerabilities in black-box code generation models,” inCCS, 2023, pp. 1234–1248

  22. [22]

    From vulnerabilities to remediation: A systematic literature review of LLMs in code security, 2024

    E. Basic and A. Giaretta, “Large language models and code security: A systematic literature review,”CoRR, vol. abs/2412.15004, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2412.15004

  23. [23]

    Large language models for cyber security: A systematic literature review,

    H. Xu, S. Wang, N. Li, K. Wang, Y . Zhao, K. Chen, T. Yu, Y . Liu, and H. Wang, “Large language models for cyber security: A systematic literature review,”CoRR, vol. abs/2405.04760, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2405.04760

  24. [24]

    From large to mammoth: A comparative evaluation of large language models in vulnerability detection,

    J. Lin and D. Mohaisen, “From large to mammoth: A comparative evaluation of large language models in vulnerability detection,” in32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28, 2025. The Internet Society, 2025

  25. [25]

    Díaz Ferreyra, Markus Mutas, Salem Dhiff, and Ric- cardo Scandariato

    C. Tony, N. E. D. Ferreyra, M. Mutas, S. Dhif, and R. Scandariato, “Prompting techniques for secure code generation: A systematic investigation,”ACM TOSEM, vol. 34, no. 8, pp. 225:1–225:53, 2025. [Online]. Available: https://doi.org/10.1145/3722108

  26. [26]

    Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo

    M. Bruni, F. Gabrielli, M. Ghafari, and M. Kropp, “Benchmarking prompt engineering techniques for secure code generation with GPT models,” inForge@ICSE. IEEE, 2025, pp. 93–103. [Online]. Available: https://doi.org/10.1109/Forge66646.2025.00018

  27. [27]

    Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),

    M. Nazzal, I. Khalil, A. Khreishah, and N. Phan, “Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),” inCCS. ACM, 2024, pp. 2266–2280. [Online]. Available: https://doi.org/10.1145/3658644.3690298

  28. [28]

    Towards secure code generation with llms: A study on common weakness enumeration,

    J. Zhao, Y . Sun, C. Huang, C. Liu, Y . Guan, Y . Zeng, and Y . Liu, “Towards secure code generation with llms: A study on common weakness enumeration,”IEEE Transactions on Software Engineering, vol. 51, no. 12, pp. 3507–3523, 2025

  29. [29]

    Give llms a security course: Securing retrieval-augmented code generation via knowledge injection,

    B. Lin, S. Wang, Y . Qin, L. Chen, and X. Mao, “Give llms a security course: Securing retrieval-augmented code generation via knowledge injection,” inCCS, C. Huang, J. Chen, S. Shieh, D. Lie, and V . Cortier, Eds. ACM, 2025, pp. 3356–3370. [Online]. Available: https://doi.org/10.1145/3719027.3765049

  30. [30]

    Chain-of-thought prompting of large language models for discovering and fixing software vulnerabilities,

    Y . Nong, M. Aldeen, L. Cheng, H. Hu, F. Chen, and H. Cai, “Chain-of-thought prompting of large language models for discovering and fixing software vulnerabilities,”CoRR, vol. abs/2402.17230, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.17230

  31. [31]

    A study of vulnerability repair in javascript programs with large language models,

    T. K. Le, S. Alimadadi, and S. Y . Ko, “A study of vulnerability repair in javascript programs with large language models,” inWWW, T. Chua, C. Ngo, R. K. Lee, R. Kumar, and H. W. Lauw, Eds. ACM, 2024, pp. 666–669. [Online]. Available: https://doi.org/10.1145/3589335.3651463

  32. [32]

    Devaic: A tool for security assessment of ai-generated code,

    D. Cotroneo, R. D. Luca, and P. Liguori, “Devaic: A tool for security assessment of ai-generated code,”Inf. Softw. Technol., vol. 177, p. 107572, 2025. [Online]. Available: https://doi.org/10.1016/j.infsof.2024. 107572

  33. [33]

    Securityeval dataset: Mining vul- nerability examples to evaluate machine learning-based code generation techniques,

    M. L. Siddiq and J. C. S. Santos, “Securityeval dataset: Mining vul- nerability examples to evaluate machine learning-based code generation techniques,” inACM MSR4P&S, 2022

  34. [34]

    SALLM: security assessment of generated code,

    M. L. Siddiq, J. C. da Silva Santos, S. Devareddy, and A. Muller, “SALLM: security assessment of generated code,” inASE Workshops. ACM, 2024, pp. 54–65. [Online]. Available: https://doi.org/10.1145/ 3691621.3694934

  35. [35]

    In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp

    C. Tony, M. Mutas, N. E. D. Ferreyra, and R. Scandariato, “Llmseceval: A dataset of natural language prompts for security evaluations,” inMSR. IEEE, 2023, pp. 588–592. [Online]. Available: https://doi.org/10.1109/MSR59073.2023.00084

  36. [36]

    Knowledgegraphbasedrepository-levelcodegeneration

    J. Peng, L. Cui, K. Huang, J. Yang, and B. Ray, “Cweval: Outcome-driven evaluation on functionality and security of LLM code generation,” inIEEE/ACM International Workshop on Large Language Models for Code, LLM4Code@ICSE 2025, Ottawa, ON, Canada, May 3, 2025. IEEE, 2025, pp. 33–40. [Online]. Available: https://doi.org/10.1109/LLM4Code66737.2025.00009

  37. [37]

    Baxbench: Can llms generate correct and secure backends?

    M. Vero, N. M ¨undler, V . Chibotaru, V . Raychev, M. Baader, N. Jovanovic, J. He, and M. T. Vechev, “Baxbench: Can llms generate correct and secure backends?” inICML. OpenReview.net, 2025. [Online]. Available: https://openreview.net/forum?id=il3KRr4H9u

  38. [38]

    Gpt-5 model — openai api,

    —, “Gpt-5 model — openai api,” https://platform.openai.com/docs/ models/gpt-5, 09 2025, (Accessed on 2025-09-24)

  39. [39]

    Evaluating large language models in vulnerability detection under variable context windows,

    J. Lin and D. Mohaisen, “Evaluating large language models in vulnerability detection under variable context windows,”CoRR, vol. abs/2502.00064, 2025. [Online]. Available: https://doi.org/10.48550/ arXiv.2502.00064

  40. [40]

    An empirical study of developers’ discussions about security challenges of different programming languages,

    R. Croft, Y . Xie, M. Zahedi, M. A. Babar, and C. Treude, “An empirical study of developers’ discussions about security challenges of different programming languages,”Empir. Softw. Eng., vol. 27, no. 1, p. 27,

  41. [41]

    [Online]. Available: https://doi.org/10.1007/s10664-021-10054-w APPENDIXA PARTICIPANT-LEVEL SEVERITY PROFILES Table V reports raw severity composition for each partic- ipant in both conditions. Its purpose is transparency; formal analysis is based on paired participant-level change scores. TABLE V: Participant-level severity profile before and after train...