SoK: AI Secure Code Generation: Progress, Pitfalls, and Paths Forward

Haipeng Cai; Hongxin Hu; Keyan Guo; Rupam Patir

arxiv: 2606.25195 · v1 · pith:2UCBI3BVnew · submitted 2026-06-23 · 💻 cs.CR · cs.AI

SoK: AI Secure Code Generation: Progress, Pitfalls, and Paths Forward

Rupam Patir , Keyan Guo , Haipeng Cai , Hongxin Hu This is my paper

Pith reviewed 2026-06-25 22:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords secure code generationAI code modelsknowledge-actuation gapsecure coding principlescode securitylarge language modelssystematization of knowledge

0 comments

The pith

AI models recognize secure coding principles in text yet frequently fail to translate that recognition into secure and functional code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks what current AI systems can and cannot do when generating secure code and why failures persist despite various prompting, fine-tuning, and agentic techniques. It introduces a three-level framework that separately measures a model's natural-language understanding of secure coding principles, its ability to actuate those principles in generated code, and the gaps between the two. Experiments across function-level and full web-application benchmarks show that principle understanding is a statistically strong predictor of functional correctness, security, and joint functional-security correctness. Substantial actuation gaps remain even when models correctly identify the relevant principles. These results provide a principle-centered view of the field's current state and point to concrete directions for improvement.

Core claim

The paper establishes that secure-coding-principle understanding is a statistically strong predictor of code-level outcomes, including functional correctness, security, and joint functional-security correctness. Yet substantial knowledge-actuation gaps remain: models can recognize relevant security principles but still fail to translate them into secure and functional code. This pattern holds across both isolated function-level security benchmarks and full web-application security benchmarks.

What carries the argument

A three-level framework that measures natural-language understanding of secure coding principles, code-level actuation of those principles during generation, and the knowledge-actuation gaps between the two.

If this is right

Principle understanding can serve as an early indicator for expected functional and security quality in generated code.
Principle-guided generation techniques could narrow the observed knowledge-actuation gaps.
Evaluation and benchmarking should separately track understanding, actuation, and gaps rather than only final code properties.
Agentic workflows can be designed to explicitly surface and enforce relevant secure coding principles during code production.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the gaps prove consistent across additional languages and domains, training objectives that emphasize actuation over pure understanding may become necessary.
Current results imply that real-world deployment of AI code generators could still introduce security issues even when models appear to know the rules.
The framework could be extended to measure how quickly gaps close under targeted fine-tuning or reinforcement learning on actuation examples.

Load-bearing premise

The chosen benchmarks for isolated function-level security and full web-application security are representative enough to support general claims about knowledge-actuation gaps across the field.

What would settle it

Repeating the three-level evaluation on a fresh benchmark set that uses different security scenarios, languages, or application types and finding that principle understanding no longer statistically predicts code outcomes or that actuation gaps shrink to negligible size.

Figures

Figures reproduced from arXiv: 2606.25195 by Haipeng Cai, Hongxin Hu, Keyan Guo, Rupam Patir.

**Figure 1.** Figure 1: Layer-1 secure-coding knowledge by reasoning dimension, source catalog, and model. Bar colors indicate the four reasoning dimensions: [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Knowledge–actuation gap by benchmark and intervention, using the same model/method slices as Table 3. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Layer-3 outcome distribution by system, method, and benchmark, using the same model/method slices as Table 3. Each vertical stacked bar covers [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: SCP-guided prompting as a path-forward intervention. Each panel tracks Base [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

The increasing use of AI systems for code generation raises a central security question: what can today's models and coding agents actually do to produce secure code, where do they still fail, and what would move the field forward? Existing work has explored prompting, fine-tuning, reinforcement learning, and agentic workflows for secure code generation, but the field still lacks a systematic understanding of how these techniques improve security and why substantial failures persist. In this SoK, we systematize the progress, pitfalls, and paths forward for AI secure code generation. We introduce a three-level framework that measures models' natural-language understanding of secure coding principles, their code-level actuation of those principles during generation, and the knowledge--actuation gaps between the two. We instantiate this framework across models and coding agents on benchmarks covering both isolated function-level security and full web-application security. Our results show that secure-coding-principle understanding is a statistically strong predictor of code-level outcomes, including functional correctness, security, and joint functional-security correctness. Yet substantial knowledge--actuation gaps remain: models can recognize relevant security principles but still fail to translate them into secure and functional code. These findings offer a principle-centered account of where AI secure code generation stands today and identify concrete paths forward through principle-guided generation, evaluation, benchmarking, and agentic workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This SoK introduces a three-level framework linking secure-coding understanding to code outcomes but its general claims rest on narrow benchmarks.

read the letter

The main thing here is a new three-level framework that measures models' natural-language understanding of secure coding principles, their actuation of those principles in generated code, and the gaps between the two. They apply it to existing models and agents on function-level and web-app benchmarks and report that understanding is a statistically strong predictor of functional correctness, security, and joint outcomes, while substantial actuation gaps remain.

The paper does a clean job of pulling together prompting, fine-tuning, reinforcement learning, and agentic work into one structure. The principle-centered account gives a clearer way to explain persistent failures than just listing success rates. That framing is distinct from prior surveys and could help organize how people design evaluations going forward.

The soft spot is the empirical scope. Results are reported only for isolated function-level security and full web-application security. These domains leave out cryptographic primitives, memory-safety systems code, and embedded constraints. If the predictor strength or gap sizes shift across those settings, the claim that understanding predicts outcomes and gaps persist cannot be read as field-wide without additional justification. The abstract mentions statistical relationships but provides no visible details on controls, test selection, or benchmark construction, so the strength of the central findings is hard to assess from what is shown.

This is for researchers working on secure AI code generation who need a way to structure their own experiments or benchmarks. A reader in that subfield would get practical value from the framework even if the empirical section needs expansion. The paper is coherent on its own terms and engages the literature directly.

I would send it to peer review. The framework is concrete and new enough that referees should see it, with the expectation that revisions address domain coverage.

Referee Report

1 major / 0 minor

Summary. This SoK introduces a three-level framework (natural-language understanding of secure coding principles, code-level actuation of those principles, and the resulting knowledge-actuation gap) and instantiates it on models and agents using benchmarks for isolated function-level security and full web-application security. The central empirical finding is that understanding is a statistically strong predictor of functional correctness, security, and joint outcomes, yet substantial actuation gaps persist.

Significance. If the reported correlations and gap sizes hold under the chosen evaluation protocol, the framework supplies a principle-centered lens for diagnosing why current techniques still fail at secure code generation and for guiding future work on principle-guided generation, evaluation, and agentic workflows. The explicit separation of understanding from actuation is a clear organizational contribution to the SoK literature.

major comments (1)

[Abstract / Results] Abstract and results sections: the claim that understanding is a 'statistically strong predictor' of code-level outcomes and that 'substantial knowledge-actuation gaps remain' is instantiated only on function-level and web-application benchmarks. The manuscript does not provide cross-domain validation (e.g., cryptographic primitives, memory-safety-heavy systems code, or embedded constraints), so the field-wide framing of the predictor relationship and the gap conclusion rests on an untested assumption of domain representativeness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the scope of our empirical claims. We address it directly below.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results sections: the claim that understanding is a 'statistically strong predictor' of code-level outcomes and that 'substantial knowledge-actuation gaps remain' is instantiated only on function-level and web-application benchmarks. The manuscript does not provide cross-domain validation (e.g., cryptographic primitives, memory-safety-heavy systems code, or embedded constraints), so the field-wide framing of the predictor relationship and the gap conclusion rests on an untested assumption of domain representativeness.

Authors: The empirical instantiation is performed on the two benchmark categories that dominate the existing literature on AI secure code generation: isolated function-level security and full web-application security. These are the domains for which standardized, reproducible evaluation protocols currently exist and have been used in the majority of prior studies. The three-level framework is presented as domain-agnostic, but the reported statistical relationships and gap sizes are explicitly tied to the chosen benchmarks. We do not assert that the precise correlation coefficients or gap magnitudes hold outside these domains. To clarify the boundaries of the claims, we will revise the abstract, introduction, and results sections to replace broad phrasing with language that anchors the findings to the evaluated benchmarks, and we will add a dedicated limitations paragraph in the discussion that notes the absence of cross-domain validation (e.g., cryptographic primitives or memory-safety-heavy code) and identifies this as an important direction for future work. This change qualifies the framing without altering the core empirical results or the organizational contribution of the framework. revision: partial

Circularity Check

0 steps flagged

No circularity; framework and empirical results are independent

full rationale

The paper is an SoK that introduces a three-level framework (NL understanding of principles, code-level actuation, and knowledge-actuation gaps) independently of any results. It then applies the framework to report measured correlations and gaps on external benchmarks for function-level and web-app security. No equations, fitted parameters, or self-citations reduce the central claims to inputs by construction; the predictor relationship is an observed statistical outcome, not a definitional tautology. The derivation is self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper's central contribution is the introduction of a new evaluation framework rather than derivation from prior axioms or data fits.

invented entities (1)

three-level framework (understanding, actuation, knowledge-actuation gap) no independent evidence
purpose: To measure and relate natural-language understanding of secure coding principles to code-level security and correctness outcomes
Newly introduced in this SoK to systematize existing techniques

pith-pipeline@v0.9.1-grok · 5773 in / 1082 out tokens · 22979 ms · 2026-06-25T22:39:23.084084+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions,” in2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 754–768

2022
[2]

Do users write more insecure code with AI assistants?

N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with AI assistants?” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Secu- rity (CCS ’23). ACM, 2023, pp. 2785–2799

2023
[3]

Lost at C: A user study on the security implications of large language model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, B. Dolan-Gavitt, and S. Garg, “Lost at C: A user study on the security implications of large language model code assistants,” in32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, 2023, pp. 2205–2222

2023
[4]

How secure is code generated by ChatGPT?

R. Khoury, A. R. Avila, J. Brunelle, and B. M. Camara, “How secure is code generated by ChatGPT?” in2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2023, pp. 2445–2451

2023
[5]

Prompting techniques for secure code generation: A system- atic investigation,

C. Tony, N. E. D ´ıaz Ferreyra, M. Mutas, S. Dhif, and R. Scandari- ato, “Prompting techniques for secure code generation: A system- atic investigation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 8, pp. 1–53, 2025

2025
[6]

Exam- ining zero-shot vulnerability repair with large language models,

H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Exam- ining zero-shot vulnerability repair with large language models,” in 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023, pp. 2339–2356

2023
[7]

Rescue: Retrieval augmented secure code generation,

J. Shi and T. Zhang, “Rescue: Retrieval augmented secure code generation,”arXiv preprint arXiv:2510.18204, 2025

work page arXiv 2025
[8]

Seccoder: Towards generalizable and robust secure code generation,

B. Zhang, T. Du, J. Tong, X. Zhang, K. Chow, S. Cheng, X. Wang, and J. Yin, “Seccoder: Towards generalizable and robust secure code generation,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 14 557–14 571

2024
[9]

Constrained decoding for secure code generation,

Y . Fu, E. Baker, Y . Ding, and Y . Chen, “Constrained decoding for secure code generation,”arXiv preprint arXiv:2405.00218, 2024

work page arXiv 2024
[10]

Scodegen: A real-time trustworthy constrained decoding framework for secure code generation with llms,

M. Qu, J. Liu, L. Kang, S. Ling, S. Wang, D. Ye, and T. Huang, “Scodegen: A real-time trustworthy constrained decoding framework for secure code generation with llms,” in2025 IEEE 24th Interna- tional Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2025, pp. 492–503

2025
[11]

Large language models for code: Security hardening and adversarial testing,

J. He and M. Vechev, “Large language models for code: Security hardening and adversarial testing,” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’23. New York, NY , USA: Association for Computing Machinery, 2023, pp. 1865–1879. [Online]. Available: https://doi.org/10.1145/3576915.3623175

work page doi:10.1145/3576915.3623175 2023
[12]

Instruction tuning for secure code generation,

J. He, M. Vero, G. Krasnopolska, and M. Vechev, “Instruction tuning for secure code generation,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024
[13]

Prosec: For- tifying code llms with proactive security alignment,

X. Xu, Z. Su, J. Guo, K. Zhang, Z. Wang, and X. Zhang, “Prosec: For- tifying code llms with proactive security alignment,”arXiv preprint arXiv:2411.12882, 2024

work page arXiv 2024
[14]

Purpcode: Reasoning for safer code generation,

J. Liu, N. Diwan, Z. Wang, H. Zhai, X. Zhou, K. Nguyen, T. Yu, M. Wahed, Y . Deng, H. Benkraoudaet al., “Purpcode: Reasoning for safer code generation,”Advances in Neural Information Processing Systems, vol. 38, pp. 55 146–55 200, 2026

2026
[15]

Teaching an old llm secure coding: Localized preference optimiza- tion on distilled preferences,

M. S. Hasan, S. Chakraborty, S. Karmaker, and N. Balasubramanian, “Teaching an old llm secure coding: Localized preference optimiza- tion on distilled preferences,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 26 039–26 057

2025
[16]

Autosafecoder: A multi-agent framework for securing llm code generation through static analysis and fuzz testing,

A. Nunez, N. T. Islam, S. K. Jha, and P. Najafirad, “Autosafecoder: A multi-agent framework for securing llm code generation through static analysis and fuzz testing,”arXiv preprint arXiv:2409.10737, 2024

work page arXiv 2024
[17]

Scgagent: Recreating the benefits of reasoning models for secure code generation with agentic workflows,

R. Saul, H. Wang, K. Sen, and D. Wagner, “Scgagent: Recreating the benefits of reasoning models for secure code generation with agentic workflows,”arXiv preprint arXiv:2506.07313, 2025

work page arXiv 2025
[18]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable au- tomated software engineering,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, arXiv:2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

SecurityEval dataset: Mining vul- nerability examples to evaluate machine learning-based code genera- tion techniques,

M. L. Siddiq and J. C. S. Santos, “SecurityEval dataset: Mining vul- nerability examples to evaluate machine learning-based code genera- tion techniques,” inProceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S ’22). ACM, 2022, pp. 29–33

2022
[20]

LLM- SecEval: A dataset of natural language prompts for security evalua- tions,

C. Tony, M. Mutas, N. E. D ´ıaz Ferreyra, and R. Scandariato, “LLM- SecEval: A dataset of natural language prompts for security evalua- tions,” in2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 2023, pp. 588–592

2023
[21]

Pur- ple Llama CyberSecEval: A secure coding benchmark for language models,

M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontanaet al., “Pur- ple Llama CyberSecEval: A secure coding benchmark for language models,” 2023

2023
[22]

CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models,

M. Bhatt, S. Chennabasappa, Y . Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y . Chen, D. Kapilet al., “CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models,” 2024

2024
[23]

CodeLMSec benchmark: Systematically evaluating and finding secu- rity vulnerabilities in black-box code language models,

H. Hajipour, K. Hassler, T. Holz, L. Sch ¨onherr, and M. Fritz, “CodeLMSec benchmark: Systematically evaluating and finding secu- rity vulnerabilities in black-box code language models,” 2024, iEEE SaTML 2024

2024
[24]

Is your AI-generated code really safe? evaluating large language models on secure code generation with CodeSecEval,

J. Wang, X. Luo, L. Cao, H. He, H. Huang, J. Xie, A. Jatowt, and Y . Cai, “Is your AI-generated code really safe? evaluating large language models on secure code generation with CodeSecEval,” 2024

2024
[25]

SeCodePLT: A unified platform for evaluating the security of code GenAI,

Y . Nie, Z. Wang, Y . Yang, R. Jiang, Y . Tang, X. Davies, Y . Gal, B. Li, W. Guo, and D. Song, “SeCodePLT: A unified platform for evaluating the security of code GenAI,” 2025, accepted to NeurIPS Datasets and Benchmarks Track 2025

2025
[26]

CWEval: Outcome- driven evaluation on functionality and security of LLM code genera- tion,

J. Peng, L. Cui, K. Huang, J. Yang, and B. Ray, “CWEval: Outcome- driven evaluation on functionality and security of LLM code genera- tion,” inProceedings of the 2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, 2025, arXiv:2501.08200

work page arXiv 2025
[27]

BaxBench: Can LLMs generate correct and secure backends?

M. Vero, N. M ¨undler, V . Chibotaru, V . Raychev, M. Baader, N. Jo- vanovi´c, J. He, and M. Vechev, “BaxBench: Can LLMs generate correct and secure backends?” inProceedings of the 42nd Interna- tional Conference on Machine Learning (ICML). PMLR, 2025, arXiv:2502.11844

work page arXiv 2025
[28]

SecRepoBench: Benchmarking code agents for secure code comple- tion in real-world repositories,

C. Shen, C. Dilgren, P. Chiniya, L. Griffith, Y . Ding, and Y . Chen, “SecRepoBench: Benchmarking code agents for secure code comple- tion in real-world repositories,” 2025

2025
[29]

RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories,

Y . Wang, Z. Zhang, C. Wang, X. Xu, M. Liu, Y . Wang, J. Chen, and Z. Zheng, “RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories,” 2026. 14

2026
[30]

SEC-bench: Automated benchmarking of LLM agents on real-world software security tasks,

H. Lee, Z. Zhang, H. Lu, and L. Zhang, “SEC-bench: Automated benchmarking of LLM agents on real-world software security tasks,” 2025

2025
[31]

A comprehensive study of LLM secure code generation,

S.-C. Dai, J. Xu, and G. Tao, “A comprehensive study of LLM secure code generation,” 2025

2025
[32]

Rethinking the evaluation of secure code generation,

——, “Rethinking the evaluation of secure code generation,” inPro- ceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2026

2026
[33]

Se- curity of language models for code: A systematic literature review,

Y . Chen, W. Sun, C. Fang, Z. Chen, Y . Ge, T. Han, B. Xuet al., “Se- curity of language models for code: A systematic literature review,” 2024

2024
[34]

OW ASP secure coding prac- tices quick reference guide,

OW ASP Foundation, “OW ASP secure coding prac- tices quick reference guide,” https://owasp.org/ www-project-secure-coding-practices-quick-reference-guide/, 2024

2024
[35]

SEI CERT C coding standard,

Software Engineering Institute, “SEI CERT C coding standard,” https: //wiki.sei.cmu.edu/confluence/display/c, 2024

2024
[36]

Hexacoder: Secure code generation via oracle-guided synthetic training data,

H. Hajipour, L. Sch ¨onherr, T. Holz, and M. Fritz, “Hexacoder: Secure code generation via oracle-guided synthetic training data,”arXiv preprint arXiv:2409.06446, 2024

work page arXiv 2024
[37]

Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),

M. Nazzal, I. Khalil, A. Khreishah, and N. Phan, “Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’24. New York, NY , USA: Association for Computing Machinery, 2024, pp. 2266–2280. [Online]. ...

work page doi:10.1145/3658644.3690298 2024
[38]

Guidelines for snowballing in systematic literature stud- ies and a replication in software engineering,

C. Wohlin, “Guidelines for snowballing in systematic literature stud- ies and a replication in software engineering,” inProceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (EASE ’14). ACM, 2014

2014
[39]

SALLM: Security assessment of generated code,

M. L. Siddiq, J. C. S. Santos, S. Devareddy, and A. Muller, “SALLM: Security assessment of generated code,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engi- neering Workshops (ASEW ’24). ACM, 2024

2024
[40]

LLMSec- Code: Evaluating large language models for secure coding,

A. Ryd ´en, E. N ¨aslund, E. M. Schiller, and M. Almgren, “LLMSec- Code: Evaluating large language models for secure coding,” 2024

2024
[41]

ARVO: Atlas of reproducible vulnerabilities for open source software,

X. Mei, P. S. Singaria, J. Del Castillo, H. Xi, A. Benchikh, T. Bao, R. Wang, Y . Shoshitaishvili, A. Doup ´e, H. Pearce, and B. Dolan- Gavitt, “ARVO: Atlas of reproducible vulnerabilities for open source software,” 2024

2024
[42]

VulnRepairEval: An exploit-based evaluation framework for assessing large language model vulnerability repair capabilities,

W. Wang, W. Ma, Q. Hu, Y . Zhang, J. Sun, B. Wu, Y . Liu, G. Xu, and L. Jiang, “VulnRepairEval: An exploit-based evaluation framework for assessing large language model vulnerability repair capabilities,” 2025

2025
[43]

Detect–repair–verify for LLM-generated code: A multi- language, multi-granularity empirical study,

C. Cheng, “Detect–repair–verify for LLM-generated code: A multi- language, multi-granularity empirical study,” 2026

2026
[44]

SuperGLUE: A stickier benchmark for general-purpose language understanding systems,

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “SuperGLUE: A stickier benchmark for general-purpose language understanding systems,” inProc. NeurIPS, 2019

2019
[45]

GLUE: A multi-task benchmark and analysis platform for natural language understanding,

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” inProc. EMNLP Workshop BlackboxNLP, 2018

2018
[46]

CosmosQA: Machine reading comprehension with contextual commonsense rea- soning,

L. Huang, R. Le Bras, C. Bhagavatula, and Y . Choi, “CosmosQA: Machine reading comprehension with contextual commonsense rea- soning,” inProc. EMNLP-IJCNLP, 2019

2019
[47]

ReClor: A reading compre- hension dataset requiring logical reasoning,

W. Yu, Z. Jiang, Y . Dong, and J. Feng, “ReClor: A reading compre- hension dataset requiring logical reasoning,” inProc. ICLR, 2020

2020
[48]

Language models can solve com- puter tasks,

G. Kim, P. Baldi, and S. McAleer, “Language models can solve com- puter tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[49]

Evaluating c/c++ vulnerability detectability of query-based static application security testing tools,

Z. Li, Z. Liu, W. K. Wong, P. Ma, and S. Wang, “Evaluating c/c++ vulnerability detectability of query-based static application security testing tools,”IEEE Transactions on Dependable and Secure Com- puting, vol. 21, no. 5, pp. 4600–4618, 2024

2024
[50]

Static application security testing (sast) tools for smart contracts: How far are we?

K. Li, Y . Xue, S. Chen, H. Liu, K. Sun, M. Hu, H. Wang, Y . Liu, and Y . Chen, “Static application security testing (sast) tools for smart contracts: How far are we?”Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3660772

work page doi:10.1145/3660772 2024
[51]

Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java,

K. Li, S. Chen, L. Fan, R. Feng, H. Liu, C. Liu, Y . Liu, and Y . Chen, “Comparison and evaluation on static application security testing (sast) tools for java,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY , USA: Association for Comp...

work page doi:10.1145/3611643.3616262 2023
[52]

A broad-coverage challenge corpus for sentence understanding through inference,

A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orleans, Louisiana: As...

2018
[53]

Rethinking the evaluation of secure code generation,

S.-C. Dai, J. Xu, and G. Tao, “Rethinking the evaluation of secure code generation,”arXiv preprint arXiv:2503.15554, 2025

work page arXiv 2025
[54]

Large language model for vulnerability detection and repair: Literature review and the road ahead,

X. Zhou, S. Cao, X. Sun, and D. Lo, “Large language model for vulnerability detection and repair: Literature review and the road ahead,”ACM Trans. Softw. Eng. Methodol., vol. 34, no. 5, May
[55]

Available: https://doi.org/10.1145/3708522

[Online]. Available: https://doi.org/10.1145/3708522

work page doi:10.1145/3708522
[56]

{SoK}: Towards effective automated vulnerability repair,

Y . Li, F. hossain Shezan, B. Wei, G. Wang, and Y . Tian, “{SoK}: Towards effective automated vulnerability repair,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 4441–4462

2025
[57]

SoK: Automated vulnerability repair: Methods, tools, and assessments,

Y . Hu, Z. Liu, K. Shu, S. Guan, D. Zou, S. Xu, B. Yuan, and H. Jin, “SoK: Automated vulnerability repair: Methods, tools, and assessments,” in34th USENIX Security Symposium (USENIX Security 25). Seattle, W A: USENIX Association, Aug. 2025, pp. 4421–4440. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity25/presentation/hu-yiwei

2025
[58]

Sok: Understand- ing (new) security issues across ai4code use cases,

Q. Wu, T. Li, T. Zhou, and V . Chandrasekaran, “Sok: Understand- ing (new) security issues across ai4code use cases,”arXiv preprint arXiv:2512.18456, 2025. Appendix A. Layer-1 Question Formats: Examples and Pur- pose Table 4 gives, for each of the nine NLP task formats, the cognitive dimension it serves, what it is designed to probe, and an abbreviated ex...

work page arXiv 2025

[1] [1]

Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions,” in2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 754–768

2022

[2] [2]

Do users write more insecure code with AI assistants?

N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with AI assistants?” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Secu- rity (CCS ’23). ACM, 2023, pp. 2785–2799

2023

[3] [3]

Lost at C: A user study on the security implications of large language model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, B. Dolan-Gavitt, and S. Garg, “Lost at C: A user study on the security implications of large language model code assistants,” in32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, 2023, pp. 2205–2222

2023

[4] [4]

How secure is code generated by ChatGPT?

R. Khoury, A. R. Avila, J. Brunelle, and B. M. Camara, “How secure is code generated by ChatGPT?” in2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2023, pp. 2445–2451

2023

[5] [5]

Prompting techniques for secure code generation: A system- atic investigation,

C. Tony, N. E. D ´ıaz Ferreyra, M. Mutas, S. Dhif, and R. Scandari- ato, “Prompting techniques for secure code generation: A system- atic investigation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 8, pp. 1–53, 2025

2025

[6] [6]

Exam- ining zero-shot vulnerability repair with large language models,

H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Exam- ining zero-shot vulnerability repair with large language models,” in 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023, pp. 2339–2356

2023

[7] [7]

Rescue: Retrieval augmented secure code generation,

J. Shi and T. Zhang, “Rescue: Retrieval augmented secure code generation,”arXiv preprint arXiv:2510.18204, 2025

work page arXiv 2025

[8] [8]

Seccoder: Towards generalizable and robust secure code generation,

B. Zhang, T. Du, J. Tong, X. Zhang, K. Chow, S. Cheng, X. Wang, and J. Yin, “Seccoder: Towards generalizable and robust secure code generation,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 14 557–14 571

2024

[9] [9]

Constrained decoding for secure code generation,

Y . Fu, E. Baker, Y . Ding, and Y . Chen, “Constrained decoding for secure code generation,”arXiv preprint arXiv:2405.00218, 2024

work page arXiv 2024

[10] [10]

Scodegen: A real-time trustworthy constrained decoding framework for secure code generation with llms,

M. Qu, J. Liu, L. Kang, S. Ling, S. Wang, D. Ye, and T. Huang, “Scodegen: A real-time trustworthy constrained decoding framework for secure code generation with llms,” in2025 IEEE 24th Interna- tional Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2025, pp. 492–503

2025

[11] [11]

Large language models for code: Security hardening and adversarial testing,

J. He and M. Vechev, “Large language models for code: Security hardening and adversarial testing,” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’23. New York, NY , USA: Association for Computing Machinery, 2023, pp. 1865–1879. [Online]. Available: https://doi.org/10.1145/3576915.3623175

work page doi:10.1145/3576915.3623175 2023

[12] [12]

Instruction tuning for secure code generation,

J. He, M. Vero, G. Krasnopolska, and M. Vechev, “Instruction tuning for secure code generation,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024

[13] [13]

Prosec: For- tifying code llms with proactive security alignment,

X. Xu, Z. Su, J. Guo, K. Zhang, Z. Wang, and X. Zhang, “Prosec: For- tifying code llms with proactive security alignment,”arXiv preprint arXiv:2411.12882, 2024

work page arXiv 2024

[14] [14]

Purpcode: Reasoning for safer code generation,

J. Liu, N. Diwan, Z. Wang, H. Zhai, X. Zhou, K. Nguyen, T. Yu, M. Wahed, Y . Deng, H. Benkraoudaet al., “Purpcode: Reasoning for safer code generation,”Advances in Neural Information Processing Systems, vol. 38, pp. 55 146–55 200, 2026

2026

[15] [15]

Teaching an old llm secure coding: Localized preference optimiza- tion on distilled preferences,

M. S. Hasan, S. Chakraborty, S. Karmaker, and N. Balasubramanian, “Teaching an old llm secure coding: Localized preference optimiza- tion on distilled preferences,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 26 039–26 057

2025

[16] [16]

Autosafecoder: A multi-agent framework for securing llm code generation through static analysis and fuzz testing,

A. Nunez, N. T. Islam, S. K. Jha, and P. Najafirad, “Autosafecoder: A multi-agent framework for securing llm code generation through static analysis and fuzz testing,”arXiv preprint arXiv:2409.10737, 2024

work page arXiv 2024

[17] [17]

Scgagent: Recreating the benefits of reasoning models for secure code generation with agentic workflows,

R. Saul, H. Wang, K. Sen, and D. Wagner, “Scgagent: Recreating the benefits of reasoning models for secure code generation with agentic workflows,”arXiv preprint arXiv:2506.07313, 2025

work page arXiv 2025

[18] [18]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable au- tomated software engineering,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, arXiv:2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

SecurityEval dataset: Mining vul- nerability examples to evaluate machine learning-based code genera- tion techniques,

M. L. Siddiq and J. C. S. Santos, “SecurityEval dataset: Mining vul- nerability examples to evaluate machine learning-based code genera- tion techniques,” inProceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S ’22). ACM, 2022, pp. 29–33

2022

[20] [20]

LLM- SecEval: A dataset of natural language prompts for security evalua- tions,

C. Tony, M. Mutas, N. E. D ´ıaz Ferreyra, and R. Scandariato, “LLM- SecEval: A dataset of natural language prompts for security evalua- tions,” in2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 2023, pp. 588–592

2023

[21] [21]

Pur- ple Llama CyberSecEval: A secure coding benchmark for language models,

M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontanaet al., “Pur- ple Llama CyberSecEval: A secure coding benchmark for language models,” 2023

2023

[22] [22]

CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models,

M. Bhatt, S. Chennabasappa, Y . Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y . Chen, D. Kapilet al., “CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models,” 2024

2024

[23] [23]

CodeLMSec benchmark: Systematically evaluating and finding secu- rity vulnerabilities in black-box code language models,

H. Hajipour, K. Hassler, T. Holz, L. Sch ¨onherr, and M. Fritz, “CodeLMSec benchmark: Systematically evaluating and finding secu- rity vulnerabilities in black-box code language models,” 2024, iEEE SaTML 2024

2024

[24] [24]

Is your AI-generated code really safe? evaluating large language models on secure code generation with CodeSecEval,

J. Wang, X. Luo, L. Cao, H. He, H. Huang, J. Xie, A. Jatowt, and Y . Cai, “Is your AI-generated code really safe? evaluating large language models on secure code generation with CodeSecEval,” 2024

2024

[25] [25]

SeCodePLT: A unified platform for evaluating the security of code GenAI,

Y . Nie, Z. Wang, Y . Yang, R. Jiang, Y . Tang, X. Davies, Y . Gal, B. Li, W. Guo, and D. Song, “SeCodePLT: A unified platform for evaluating the security of code GenAI,” 2025, accepted to NeurIPS Datasets and Benchmarks Track 2025

2025

[26] [26]

CWEval: Outcome- driven evaluation on functionality and security of LLM code genera- tion,

J. Peng, L. Cui, K. Huang, J. Yang, and B. Ray, “CWEval: Outcome- driven evaluation on functionality and security of LLM code genera- tion,” inProceedings of the 2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, 2025, arXiv:2501.08200

work page arXiv 2025

[27] [27]

BaxBench: Can LLMs generate correct and secure backends?

M. Vero, N. M ¨undler, V . Chibotaru, V . Raychev, M. Baader, N. Jo- vanovi´c, J. He, and M. Vechev, “BaxBench: Can LLMs generate correct and secure backends?” inProceedings of the 42nd Interna- tional Conference on Machine Learning (ICML). PMLR, 2025, arXiv:2502.11844

work page arXiv 2025

[28] [28]

SecRepoBench: Benchmarking code agents for secure code comple- tion in real-world repositories,

C. Shen, C. Dilgren, P. Chiniya, L. Griffith, Y . Ding, and Y . Chen, “SecRepoBench: Benchmarking code agents for secure code comple- tion in real-world repositories,” 2025

2025

[29] [29]

RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories,

Y . Wang, Z. Zhang, C. Wang, X. Xu, M. Liu, Y . Wang, J. Chen, and Z. Zheng, “RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories,” 2026. 14

2026

[30] [30]

SEC-bench: Automated benchmarking of LLM agents on real-world software security tasks,

H. Lee, Z. Zhang, H. Lu, and L. Zhang, “SEC-bench: Automated benchmarking of LLM agents on real-world software security tasks,” 2025

2025

[31] [31]

A comprehensive study of LLM secure code generation,

S.-C. Dai, J. Xu, and G. Tao, “A comprehensive study of LLM secure code generation,” 2025

2025

[32] [32]

Rethinking the evaluation of secure code generation,

——, “Rethinking the evaluation of secure code generation,” inPro- ceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2026

2026

[33] [33]

Se- curity of language models for code: A systematic literature review,

Y . Chen, W. Sun, C. Fang, Z. Chen, Y . Ge, T. Han, B. Xuet al., “Se- curity of language models for code: A systematic literature review,” 2024

2024

[34] [34]

OW ASP secure coding prac- tices quick reference guide,

OW ASP Foundation, “OW ASP secure coding prac- tices quick reference guide,” https://owasp.org/ www-project-secure-coding-practices-quick-reference-guide/, 2024

2024

[35] [35]

SEI CERT C coding standard,

Software Engineering Institute, “SEI CERT C coding standard,” https: //wiki.sei.cmu.edu/confluence/display/c, 2024

2024

[36] [36]

Hexacoder: Secure code generation via oracle-guided synthetic training data,

H. Hajipour, L. Sch ¨onherr, T. Holz, and M. Fritz, “Hexacoder: Secure code generation via oracle-guided synthetic training data,”arXiv preprint arXiv:2409.06446, 2024

work page arXiv 2024

[37] [37]

Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),

M. Nazzal, I. Khalil, A. Khreishah, and N. Phan, “Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’24. New York, NY , USA: Association for Computing Machinery, 2024, pp. 2266–2280. [Online]. ...

work page doi:10.1145/3658644.3690298 2024

[38] [38]

Guidelines for snowballing in systematic literature stud- ies and a replication in software engineering,

C. Wohlin, “Guidelines for snowballing in systematic literature stud- ies and a replication in software engineering,” inProceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (EASE ’14). ACM, 2014

2014

[39] [39]

SALLM: Security assessment of generated code,

M. L. Siddiq, J. C. S. Santos, S. Devareddy, and A. Muller, “SALLM: Security assessment of generated code,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engi- neering Workshops (ASEW ’24). ACM, 2024

2024

[40] [40]

LLMSec- Code: Evaluating large language models for secure coding,

A. Ryd ´en, E. N ¨aslund, E. M. Schiller, and M. Almgren, “LLMSec- Code: Evaluating large language models for secure coding,” 2024

2024

[41] [41]

ARVO: Atlas of reproducible vulnerabilities for open source software,

X. Mei, P. S. Singaria, J. Del Castillo, H. Xi, A. Benchikh, T. Bao, R. Wang, Y . Shoshitaishvili, A. Doup ´e, H. Pearce, and B. Dolan- Gavitt, “ARVO: Atlas of reproducible vulnerabilities for open source software,” 2024

2024

[42] [42]

VulnRepairEval: An exploit-based evaluation framework for assessing large language model vulnerability repair capabilities,

W. Wang, W. Ma, Q. Hu, Y . Zhang, J. Sun, B. Wu, Y . Liu, G. Xu, and L. Jiang, “VulnRepairEval: An exploit-based evaluation framework for assessing large language model vulnerability repair capabilities,” 2025

2025

[43] [43]

Detect–repair–verify for LLM-generated code: A multi- language, multi-granularity empirical study,

C. Cheng, “Detect–repair–verify for LLM-generated code: A multi- language, multi-granularity empirical study,” 2026

2026

[44] [44]

SuperGLUE: A stickier benchmark for general-purpose language understanding systems,

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “SuperGLUE: A stickier benchmark for general-purpose language understanding systems,” inProc. NeurIPS, 2019

2019

[45] [45]

GLUE: A multi-task benchmark and analysis platform for natural language understanding,

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” inProc. EMNLP Workshop BlackboxNLP, 2018

2018

[46] [46]

CosmosQA: Machine reading comprehension with contextual commonsense rea- soning,

L. Huang, R. Le Bras, C. Bhagavatula, and Y . Choi, “CosmosQA: Machine reading comprehension with contextual commonsense rea- soning,” inProc. EMNLP-IJCNLP, 2019

2019

[47] [47]

ReClor: A reading compre- hension dataset requiring logical reasoning,

W. Yu, Z. Jiang, Y . Dong, and J. Feng, “ReClor: A reading compre- hension dataset requiring logical reasoning,” inProc. ICLR, 2020

2020

[48] [48]

Language models can solve com- puter tasks,

G. Kim, P. Baldi, and S. McAleer, “Language models can solve com- puter tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[49] [49]

Evaluating c/c++ vulnerability detectability of query-based static application security testing tools,

Z. Li, Z. Liu, W. K. Wong, P. Ma, and S. Wang, “Evaluating c/c++ vulnerability detectability of query-based static application security testing tools,”IEEE Transactions on Dependable and Secure Com- puting, vol. 21, no. 5, pp. 4600–4618, 2024

2024

[50] [50]

Static application security testing (sast) tools for smart contracts: How far are we?

K. Li, Y . Xue, S. Chen, H. Liu, K. Sun, M. Hu, H. Wang, Y . Liu, and Y . Chen, “Static application security testing (sast) tools for smart contracts: How far are we?”Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3660772

work page doi:10.1145/3660772 2024

[51] [51]

Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java,

K. Li, S. Chen, L. Fan, R. Feng, H. Liu, C. Liu, Y . Liu, and Y . Chen, “Comparison and evaluation on static application security testing (sast) tools for java,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY , USA: Association for Comp...

work page doi:10.1145/3611643.3616262 2023

[52] [52]

A broad-coverage challenge corpus for sentence understanding through inference,

A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orleans, Louisiana: As...

2018

[53] [53]

Rethinking the evaluation of secure code generation,

S.-C. Dai, J. Xu, and G. Tao, “Rethinking the evaluation of secure code generation,”arXiv preprint arXiv:2503.15554, 2025

work page arXiv 2025

[54] [54]

Large language model for vulnerability detection and repair: Literature review and the road ahead,

X. Zhou, S. Cao, X. Sun, and D. Lo, “Large language model for vulnerability detection and repair: Literature review and the road ahead,”ACM Trans. Softw. Eng. Methodol., vol. 34, no. 5, May

[55] [55]

Available: https://doi.org/10.1145/3708522

[Online]. Available: https://doi.org/10.1145/3708522

work page doi:10.1145/3708522

[56] [56]

{SoK}: Towards effective automated vulnerability repair,

Y . Li, F. hossain Shezan, B. Wei, G. Wang, and Y . Tian, “{SoK}: Towards effective automated vulnerability repair,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 4441–4462

2025

[57] [57]

SoK: Automated vulnerability repair: Methods, tools, and assessments,

Y . Hu, Z. Liu, K. Shu, S. Guan, D. Zou, S. Xu, B. Yuan, and H. Jin, “SoK: Automated vulnerability repair: Methods, tools, and assessments,” in34th USENIX Security Symposium (USENIX Security 25). Seattle, W A: USENIX Association, Aug. 2025, pp. 4421–4440. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity25/presentation/hu-yiwei

2025

[58] [58]

Sok: Understand- ing (new) security issues across ai4code use cases,

Q. Wu, T. Li, T. Zhou, and V . Chandrasekaran, “Sok: Understand- ing (new) security issues across ai4code use cases,”arXiv preprint arXiv:2512.18456, 2025. Appendix A. Layer-1 Question Formats: Examples and Pur- pose Table 4 gives, for each of the nine NLP task formats, the cognitive dimension it serves, what it is designed to probe, and an abbreviated ex...

work page arXiv 2025