Recognition: 2 theorem links
· Lean TheoremBroken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code
Pith reviewed 2026-05-10 20:06 UTC · model grok-4.3
The pith
AI-generated code from popular models contains formally proven security vulnerabilities in more than half of cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors create 3,500 code samples using 500 prompts across five CWE categories from seven LLMs, then process each through the COBALT pipeline to generate Z3 satisfiability witnesses that prove the presence of vulnerabilities. They determine that 55.8% of all artifacts contain at least one such vulnerability, with 1,055 formally proven, and that models like GPT-4o perform worst while none reach acceptable safety levels.
What carries the argument
The COBALT analysis pipeline, which converts generated code into logical constraints solvable by the Z3 SMT solver to produce mathematical proofs of security vulnerabilities.
If this is right
- Explicit security instructions in prompts lower the vulnerability rate by only four percentage points.
- Six combined industry analysis tools fail to detect 97.8% of the Z3-proven vulnerabilities.
- LLMs can spot vulnerabilities in their own generated code 78.7% of the time during review but still produce them at the 55.8% rate without such review.
- Six of seven representative vulnerabilities cause runtime crashes when tested with address sanitizers.
Where Pith is reading between the lines
- If the high rates hold across more diverse prompts, developers may need to treat AI-generated code as untrusted by default in security contexts.
- The formal proof approach could be extended to other programming languages or additional weakness categories to broaden the assessment.
- Models might improve if trained specifically to avoid generating code that triggers these Z3-provable flaws.
Load-bearing premise
The selected 500 prompts across five CWE categories represent typical real-world security-critical coding needs, and the COBALT pipeline produces no false positive proofs of vulnerabilities.
What would settle it
A follow-up experiment that uses a broader set of prompts or different formal verification tools to check whether the vulnerability rate drops significantly below 55 percent or if many Z3 witnesses fail to correspond to actual exploitable bugs in running code.
read the original abstract
AI coding assistants are now used to generate production code in security-sensitive domains, yet the exploitability of their outputs remains unquantified. We address this gap with Broken by Default: a formal verification study of 3,500 code artifacts generated by seven widely-deployed LLMs across 500 security-critical prompts (five CWE categories, 100 prompts each). Each artifact is subjected to the Z3 SMT solver via the COBALT analysis pipeline, producing mathematical satisfiability witnesses rather than pattern-based heuristics. Across all models, 55.8% of artifacts contain at least one COBALT-identified vulnerability; of these, 1,055 are formally proven via Z3 satisfiability witnesses. GPT-4o leads at 62.4% (grade F); Gemini 2.5 Flash performs best at 48.4% (grade D). No model achieves a grade better than D. Six of seven representative findings are confirmed with runtime crashes under GCC AddressSanitizer. Three auxiliary experiments show: (1) explicit security instructions reduce the mean rate by only 4 points; (2) six industry tools combined miss 97.8% of Z3-proven findings; and (3) models identify their own vulnerable outputs 78.7% of the time in review mode yet generate them at 55.8% by default.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a formal verification study of security vulnerabilities in code generated by seven LLMs using 500 prompts across five CWE categories, resulting in 3,500 artifacts. Using the COBALT pipeline with Z3 SMT solver, it finds that 55.8% of artifacts contain at least one vulnerability, with 1,055 formally proven via satisfiability witnesses. It compares model performance (GPT-4o at 62.4% grade F, Gemini 2.5 Flash at 48.4% grade D), shows explicit security instructions reduce the mean rate by only 4 points, industry tools miss 97.8% of Z3-proven findings, and models identify their own vulnerable outputs 78.7% of the time in review mode.
Significance. If the Z3 encodings are sound and the prompt set representative, the results provide quantitative evidence that AI-generated code is frequently vulnerable by default, with no model performing better than grade D. The use of Z3 satisfiability witnesses for 1,055 findings and the multi-experiment design (including tool comparisons and self-review) are strengths that enable reproducible, solver-backed claims rather than heuristic detection.
major comments (3)
- [Abstract and COBALT pipeline description] Abstract and COBALT pipeline description: The headline claim that 1,055 vulnerabilities are 'formally proven via Z3 satisfiability witnesses' is load-bearing for the 55.8% rate and model rankings, yet no details are supplied on the Z3 encoding rules for pointer arithmetic, buffer bounds, command execution, or other language features. The limited runtime AddressSanitizer confirmation (only six of seven sampled findings) does not establish absence of false positives from over-approximation.
- [Prompt selection and representativeness] Prompt selection and representativeness: The 500 prompts (100 per CWE across five categories) underpin the claim that results reflect security-critical coding tasks, but the manuscript provides no validation, comparison to real-world codebases, or sensitivity analysis to show that the elicited vulnerability distribution generalizes beyond this specific prompt set.
- [Auxiliary experiments] Industry tool comparison experiment: The auxiliary result that six industry tools miss 97.8% of Z3-proven findings requires explicit description of tool configurations, invocation parameters, and whether they were applied to the identical set of generated artifacts to support the cross-tool claim.
minor comments (2)
- The letter-grade assignments (F, D) for model performance are introduced without calibration to established security benchmarks or prior empirical studies on code vulnerability rates.
- A consolidated table reporting vulnerability rates broken down by model and CWE category would improve clarity and allow readers to assess per-category variation.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review. The comments identify key areas where additional detail and clarification will strengthen the manuscript. We address each major comment below and have incorporated revisions accordingly.
read point-by-point responses
-
Referee: [Abstract and COBALT pipeline description] Abstract and COBALT pipeline description: The headline claim that 1,055 vulnerabilities are 'formally proven via Z3 satisfiability witnesses' is load-bearing for the 55.8% rate and model rankings, yet no details are supplied on the Z3 encoding rules for pointer arithmetic, buffer bounds, command execution, or other language features. The limited runtime AddressSanitizer confirmation (only six of seven sampled findings) does not establish absence of false positives from over-approximation.
Authors: We agree that the Z3 encoding details are essential for substantiating the formal claims. In the revised manuscript we have expanded Section 3.2 (COBALT Pipeline) with explicit encoding rules for pointer arithmetic, buffer bounds, command execution, and other relevant C language features, including the precise SMT formulas used. We have also added a new subsection discussing soundness assumptions and potential over-approximations. The AddressSanitizer results are presented strictly as supplementary runtime evidence for a small sample of findings; we have clarified that they do not constitute a comprehensive false-positive audit and have noted the inherent limitations of dynamic testing relative to the Z3 witnesses. revision: yes
-
Referee: [Prompt selection and representativeness] Prompt selection and representativeness: The 500 prompts (100 per CWE across five categories) underpin the claim that results reflect security-critical coding tasks, but the manuscript provides no validation, comparison to real-world codebases, or sensitivity analysis to show that the elicited vulnerability distribution generalizes beyond this specific prompt set.
Authors: The prompts were systematically derived from the official CWE descriptions to target canonical vulnerability patterns. We acknowledge that the original submission lacked explicit validation against real-world distributions. In the revision we have added a paragraph in Section 4.1 that references empirical studies on CWE prevalence in open-source codebases and explains how our prompt templates align with those patterns. We have also included a sensitivity analysis in Appendix C that perturbs prompt phrasing and structure while measuring changes in vulnerability rates. A full-scale empirical comparison to a large production codebase remains outside the scope of this work but is noted as a limitation. revision: partial
-
Referee: [Auxiliary experiments] Industry tool comparison experiment: The auxiliary result that six industry tools miss 97.8% of Z3-proven findings requires explicit description of tool configurations, invocation parameters, and whether they were applied to the identical set of generated artifacts to support the cross-tool claim.
Authors: We have revised Section 5.2 to supply complete information: exact tool versions, configuration files, command-line invocation parameters, and an explicit statement that every tool was executed on the identical set of 3,500 generated artifacts. These additions make the 97.8% miss-rate claim fully reproducible and directly comparable. revision: yes
Circularity Check
No significant circularity; results are direct empirical measurements from external solver.
full rationale
The paper's central claims consist of direct counts (55.8% of artifacts, 1,055 Z3 witnesses) obtained by executing the COBALT pipeline on 3,500 generated artifacts. No equations or steps reduce a claimed result to its own inputs by construction. No parameters are fitted on a data subset and then presented as predictions of closely related quantities. No self-citation chain is invoked to justify the quantitative findings themselves. The representativeness of the 500 prompts and the soundness of the Z3 encoding are external validity and soundness assumptions, not circular reductions within the derivation. The auxiliary experiments similarly report direct measurements rather than self-referential derivations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption COBALT correctly encodes security properties as SMT formulas whose satisfiability corresponds to real exploitable vulnerabilities.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each artifact is subjected to the Z3 SMT solver via the COBALT analysis pipeline, producing mathematical satisfiability witnesses rather than pattern-based heuristics.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across all models, 55.8% of artifacts contain at least one COBALT-identified vulnerability; of these, 1,055 are formally proven via Z3 satisfiability witnesses.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Mythos and the Unverified Cage: Z3-Based Pre-Deployment Verification for Frontier-Model Sandbox Infrastructure
COBALT applies Z3 to detect CWE-190/191/195 arithmetic vulnerabilities in C/C++ sandbox code with validated case studies on NASA and other systems and proposes a pre-deployment verification layer for frontier AI containment.
Reference graph
Works this paper leans on
-
[1]
CIO Dive. (2024). GitHub Copilot drives revenue growth amid subscriber base ex- pansion. https://www.ciodive.com/news/ github-copilot-subscriber-count-revenue-growth/ 706201/
2024
-
[2]
Asleep at the keyboard? Assessing 7 the security of GitHub Copilot’s code contribu- tions,
H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? Assessing 7 the security of GitHub Copilot’s code contribu- tions,”IEEE Symposium on Security and Privacy (S&P), pp. 754–768, 2022
2022
-
[3]
Lost at C: A user study on the security implications of large lan- guage model code assistants,
G. Sandoval, H. Pearce, T. Nys, R. Karri, B. Dolan-Gavitt, and S. Garg, “Lost at C: A user study on the security implications of large lan- guage model code assistants,”USENIX Security Symposium, 2023
2023
-
[4]
LLMSecEval: A dataset of natural language prompts for security evalua- tions,
C. Tony, M. Mutas, N.E.D. Ferreyra, and R. Scandariato, “LLMSecEval: A dataset of natural language prompts for security evalua- tions,”IEEE/ACM Mining Software Repositories (MSR), 2023
2023
-
[5]
Do users write more insecure code with AI assistants?
N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with AI assistants?”ACM CCS, 2023. DOI: 10.1145/3576915.3623157
-
[6]
Z3: An efficient SMT solver,
L. de Moura and N. Bjørner, “Z3: An efficient SMT solver,”TACAS, LNCS 4963, pp. 337–340, 2008
2008
-
[7]
MITRE. (2024). CWE-131: Incorrect Calculation of Buffer Size. https://cwe.mitre.org/data/ definitions/131.html
2024
-
[8]
MITRE. (2024). CWE-190: Integer Over- flow or Wraparound. https://cwe.mitre.org/ data/definitions/190.html
2024
-
[9]
AddressSanitizer: A fast address sanity checker,
K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov, “AddressSanitizer: A fast address sanity checker,”USENIX ATC, 2012
2012
-
[10]
MITRE. (2024). CWE-916: Use of Password Hash With Insufficient Computational Effort. https://cwe.mitre.org/data/definitions/ 916.html
2024
-
[11]
Purple llama CyberSecEval : A secure coding benchmark for language models
M. Bhatt et al., “CyberSecEval: A Comprehen- sive Evaluation Framework for Measuring Cy- bersecurity Risks of Large Language Models,” arXiv:2312.04724, 2024
-
[12]
SecurityEval Dataset: Mining Vulnerability Examples to Eval- uate Machine Learning-Based Code Generation Techniques,
M.L. Siddiq and J.C.S. Santos, “SecurityEval Dataset: Mining Vulnerability Examples to Eval- uate Machine Learning-Based Code Generation Techniques,”ACM MSR4PS Workshop, 2022
2022
-
[13]
In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security
J. He and M. Vechev, “Large Language Models for Code: Security Hardening and Adversarial Testing,”ACM CCS, pp. 1865–1879, 2023. DOI: 10.1145/3576915.3623175 8
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.