arxiv: 2604.05292 · v2 · submitted 2026-04-07 · 💻 cs.CR · cs.AI· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code

Dominik Blain , Maxime Noiseux

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:06 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE

keywords AI code generationsecurity vulnerabilitiesformal verificationZ3 solverLLM securityCWEstatic analysiscode artifacts

0 comments

The pith

AI-generated code from popular models contains formally proven security vulnerabilities in more than half of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to quantify the security risks in code produced by AI coding assistants by generating thousands of artifacts and subjecting them to formal verification rather than heuristic checks. A sympathetic reader would care because these tools are used in production for security-sensitive tasks, yet their outputs may introduce exploitable flaws by default. The study covers seven LLMs and five common weakness categories, finding that 55.8 percent of artifacts have at least one vulnerability with over a thousand proven via Z3 solver witnesses. No model scores better than a D grade, and additional tests reveal that security prompts help little while industry tools overlook most issues.

Core claim

The authors create 3,500 code samples using 500 prompts across five CWE categories from seven LLMs, then process each through the COBALT pipeline to generate Z3 satisfiability witnesses that prove the presence of vulnerabilities. They determine that 55.8% of all artifacts contain at least one such vulnerability, with 1,055 formally proven, and that models like GPT-4o perform worst while none reach acceptable safety levels.

What carries the argument

The COBALT analysis pipeline, which converts generated code into logical constraints solvable by the Z3 SMT solver to produce mathematical proofs of security vulnerabilities.

If this is right

Explicit security instructions in prompts lower the vulnerability rate by only four percentage points.
Six combined industry analysis tools fail to detect 97.8% of the Z3-proven vulnerabilities.
LLMs can spot vulnerabilities in their own generated code 78.7% of the time during review but still produce them at the 55.8% rate without such review.
Six of seven representative vulnerabilities cause runtime crashes when tested with address sanitizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the high rates hold across more diverse prompts, developers may need to treat AI-generated code as untrusted by default in security contexts.
The formal proof approach could be extended to other programming languages or additional weakness categories to broaden the assessment.
Models might improve if trained specifically to avoid generating code that triggers these Z3-provable flaws.

Load-bearing premise

The selected 500 prompts across five CWE categories represent typical real-world security-critical coding needs, and the COBALT pipeline produces no false positive proofs of vulnerabilities.

What would settle it

A follow-up experiment that uses a broader set of prompts or different formal verification tools to check whether the vulnerability rate drops significantly below 55 percent or if many Z3 witnesses fail to correspond to actual exploitable bugs in running code.

read the original abstract

AI coding assistants are now used to generate production code in security-sensitive domains, yet the exploitability of their outputs remains unquantified. We address this gap with Broken by Default: a formal verification study of 3,500 code artifacts generated by seven widely-deployed LLMs across 500 security-critical prompts (five CWE categories, 100 prompts each). Each artifact is subjected to the Z3 SMT solver via the COBALT analysis pipeline, producing mathematical satisfiability witnesses rather than pattern-based heuristics. Across all models, 55.8% of artifacts contain at least one COBALT-identified vulnerability; of these, 1,055 are formally proven via Z3 satisfiability witnesses. GPT-4o leads at 62.4% (grade F); Gemini 2.5 Flash performs best at 48.4% (grade D). No model achieves a grade better than D. Six of seven representative findings are confirmed with runtime crashes under GCC AddressSanitizer. Three auxiliary experiments show: (1) explicit security instructions reduce the mean rate by only 4 points; (2) six industry tools combined miss 97.8% of Z3-proven findings; and (3) models identify their own vulnerable outputs 78.7% of the time in review mode yet generate them at 55.8% by default.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports a formal verification study of security vulnerabilities in code generated by seven LLMs using 500 prompts across five CWE categories, resulting in 3,500 artifacts. Using the COBALT pipeline with Z3 SMT solver, it finds that 55.8% of artifacts contain at least one vulnerability, with 1,055 formally proven via satisfiability witnesses. It compares model performance (GPT-4o at 62.4% grade F, Gemini 2.5 Flash at 48.4% grade D), shows explicit security instructions reduce the mean rate by only 4 points, industry tools miss 97.8% of Z3-proven findings, and models identify their own vulnerable outputs 78.7% of the time in review mode.

Significance. If the Z3 encodings are sound and the prompt set representative, the results provide quantitative evidence that AI-generated code is frequently vulnerable by default, with no model performing better than grade D. The use of Z3 satisfiability witnesses for 1,055 findings and the multi-experiment design (including tool comparisons and self-review) are strengths that enable reproducible, solver-backed claims rather than heuristic detection.

major comments (3)

[Abstract and COBALT pipeline description] Abstract and COBALT pipeline description: The headline claim that 1,055 vulnerabilities are 'formally proven via Z3 satisfiability witnesses' is load-bearing for the 55.8% rate and model rankings, yet no details are supplied on the Z3 encoding rules for pointer arithmetic, buffer bounds, command execution, or other language features. The limited runtime AddressSanitizer confirmation (only six of seven sampled findings) does not establish absence of false positives from over-approximation.
[Prompt selection and representativeness] Prompt selection and representativeness: The 500 prompts (100 per CWE across five categories) underpin the claim that results reflect security-critical coding tasks, but the manuscript provides no validation, comparison to real-world codebases, or sensitivity analysis to show that the elicited vulnerability distribution generalizes beyond this specific prompt set.
[Auxiliary experiments] Industry tool comparison experiment: The auxiliary result that six industry tools miss 97.8% of Z3-proven findings requires explicit description of tool configurations, invocation parameters, and whether they were applied to the identical set of generated artifacts to support the cross-tool claim.

minor comments (2)

The letter-grade assignments (F, D) for model performance are introduced without calibration to established security benchmarks or prior empirical studies on code vulnerability rates.
A consolidated table reporting vulnerability rates broken down by model and CWE category would improve clarity and allow readers to assess per-category variation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful and constructive review. The comments identify key areas where additional detail and clarification will strengthen the manuscript. We address each major comment below and have incorporated revisions accordingly.

read point-by-point responses

Referee: [Abstract and COBALT pipeline description] Abstract and COBALT pipeline description: The headline claim that 1,055 vulnerabilities are 'formally proven via Z3 satisfiability witnesses' is load-bearing for the 55.8% rate and model rankings, yet no details are supplied on the Z3 encoding rules for pointer arithmetic, buffer bounds, command execution, or other language features. The limited runtime AddressSanitizer confirmation (only six of seven sampled findings) does not establish absence of false positives from over-approximation.

Authors: We agree that the Z3 encoding details are essential for substantiating the formal claims. In the revised manuscript we have expanded Section 3.2 (COBALT Pipeline) with explicit encoding rules for pointer arithmetic, buffer bounds, command execution, and other relevant C language features, including the precise SMT formulas used. We have also added a new subsection discussing soundness assumptions and potential over-approximations. The AddressSanitizer results are presented strictly as supplementary runtime evidence for a small sample of findings; we have clarified that they do not constitute a comprehensive false-positive audit and have noted the inherent limitations of dynamic testing relative to the Z3 witnesses. revision: yes
Referee: [Prompt selection and representativeness] Prompt selection and representativeness: The 500 prompts (100 per CWE across five categories) underpin the claim that results reflect security-critical coding tasks, but the manuscript provides no validation, comparison to real-world codebases, or sensitivity analysis to show that the elicited vulnerability distribution generalizes beyond this specific prompt set.

Authors: The prompts were systematically derived from the official CWE descriptions to target canonical vulnerability patterns. We acknowledge that the original submission lacked explicit validation against real-world distributions. In the revision we have added a paragraph in Section 4.1 that references empirical studies on CWE prevalence in open-source codebases and explains how our prompt templates align with those patterns. We have also included a sensitivity analysis in Appendix C that perturbs prompt phrasing and structure while measuring changes in vulnerability rates. A full-scale empirical comparison to a large production codebase remains outside the scope of this work but is noted as a limitation. revision: partial
Referee: [Auxiliary experiments] Industry tool comparison experiment: The auxiliary result that six industry tools miss 97.8% of Z3-proven findings requires explicit description of tool configurations, invocation parameters, and whether they were applied to the identical set of generated artifacts to support the cross-tool claim.

Authors: We have revised Section 5.2 to supply complete information: exact tool versions, configuration files, command-line invocation parameters, and an explicit statement that every tool was executed on the identical set of 3,500 generated artifacts. These additions make the 97.8% miss-rate claim fully reproducible and directly comparable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements from external solver.

full rationale

The paper's central claims consist of direct counts (55.8% of artifacts, 1,055 Z3 witnesses) obtained by executing the COBALT pipeline on 3,500 generated artifacts. No equations or steps reduce a claimed result to its own inputs by construction. No parameters are fitted on a data subset and then presented as predictions of closely related quantities. No self-citation chain is invoked to justify the quantitative findings themselves. The representativeness of the 500 prompts and the soundness of the Z3 encoding are external validity and soundness assumptions, not circular reductions within the derivation. The auxiliary experiments similarly report direct measurements rather than self-referential derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the correctness of the COBALT translation from code to SMT constraints and on the representativeness of the chosen prompts and CWE categories.

axioms (1)

domain assumption COBALT correctly encodes security properties as SMT formulas whose satisfiability corresponds to real exploitable vulnerabilities.
Invoked when claiming Z3 witnesses prove vulnerabilities.

pith-pipeline@v0.9.0 · 5579 in / 1117 out tokens · 29529 ms · 2026-05-10T20:06:37.317765+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each artifact is subjected to the Z3 SMT solver via the COBALT analysis pipeline, producing mathematical satisfiability witnesses rather than pattern-based heuristics.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across all models, 55.8% of artifacts contain at least one COBALT-identified vulnerability; of these, 1,055 are formally proven via Z3 satisfiability witnesses.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mythos and the Unverified Cage: Z3-Based Pre-Deployment Verification for Frontier-Model Sandbox Infrastructure
cs.CR 2026-04 unverdicted novelty 4.0

COBALT applies Z3 to detect CWE-190/191/195 arithmetic vulnerabilities in C/C++ sandbox code with validated case studies on NASA and other systems and proposes a pre-deployment verification layer for frontier AI containment.

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages · cited by 1 Pith paper

[1]

CIO Dive. (2024). GitHub Copilot drives revenue growth amid subscriber base ex- pansion. https://www.ciodive.com/news/ github-copilot-subscriber-count-revenue-growth/ 706201/

2024
[2]

Asleep at the keyboard? Assessing 7 the security of GitHub Copilot’s code contribu- tions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? Assessing 7 the security of GitHub Copilot’s code contribu- tions,”IEEE Symposium on Security and Privacy (S&P), pp. 754–768, 2022

2022
[3]

Lost at C: A user study on the security implications of large lan- guage model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, B. Dolan-Gavitt, and S. Garg, “Lost at C: A user study on the security implications of large lan- guage model code assistants,”USENIX Security Symposium, 2023

2023
[4]

LLMSecEval: A dataset of natural language prompts for security evalua- tions,

C. Tony, M. Mutas, N.E.D. Ferreyra, and R. Scandariato, “LLMSecEval: A dataset of natural language prompts for security evalua- tions,”IEEE/ACM Mining Software Repositories (MSR), 2023

2023
[5]

Do users write more insecure code with AI assistants?

N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with AI assistants?”ACM CCS, 2023. DOI: 10.1145/3576915.3623157

work page doi:10.1145/3576915.3623157 2023
[6]

Z3: An efficient SMT solver,

L. de Moura and N. Bjørner, “Z3: An efficient SMT solver,”TACAS, LNCS 4963, pp. 337–340, 2008

2008
[7]

MITRE. (2024). CWE-131: Incorrect Calculation of Buffer Size. https://cwe.mitre.org/data/ definitions/131.html

2024
[8]

MITRE. (2024). CWE-190: Integer Over- flow or Wraparound. https://cwe.mitre.org/ data/definitions/190.html

2024
[9]

AddressSanitizer: A fast address sanity checker,

K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov, “AddressSanitizer: A fast address sanity checker,”USENIX ATC, 2012

2012
[10]

MITRE. (2024). CWE-916: Use of Password Hash With Insufficient Computational Effort. https://cwe.mitre.org/data/definitions/ 916.html

2024
[11]

Purple llama CyberSecEval : A secure coding benchmark for language models

M. Bhatt et al., “CyberSecEval: A Comprehen- sive Evaluation Framework for Measuring Cy- bersecurity Risks of Large Language Models,” arXiv:2312.04724, 2024

work page arXiv 2024
[12]

SecurityEval Dataset: Mining Vulnerability Examples to Eval- uate Machine Learning-Based Code Generation Techniques,

M.L. Siddiq and J.C.S. Santos, “SecurityEval Dataset: Mining Vulnerability Examples to Eval- uate Machine Learning-Based Code Generation Techniques,”ACM MSR4PS Workshop, 2022

2022
[13]

In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

J. He and M. Vechev, “Large Language Models for Code: Security Hardening and Adversarial Testing,”ACM CCS, pp. 1865–1879, 2023. DOI: 10.1145/3576915.3623175 8

work page doi:10.1145/3576915.3623175 2023