Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies
Pith reviewed 2026-05-18 03:50 UTC · model grok-4.3
The pith
Lower prompt quality markedly increases the rate at which LLMs generate insecure code, while techniques like chain-of-thought reduce the risk.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that prompt normativity, measured by the three-dimensional quality framework of goal clarity, information completeness, and logical consistency, directly affects code security: as normativity drops from level L3 to L0 in the CWE-BENCH-PYTHON tasks, the frequency of generated insecure code rises consistently across tested models, and that chain-of-thought and self-correction prompting substantially lower those defect rates even when the underlying prompt remains low-normativity.
What carries the argument
The three-dimensional prompt quality framework (goal clarity, information completeness, logical consistency) used to assign prompts to four normativity levels (L0-L3) inside the CWE-BENCH-PYTHON benchmark dataset.
If this is right
- Raising prompt normativity offers a direct way to lower security defects in LLM-generated code without retraining models.
- Chain-of-thought and self-correction prompting can be applied as standard safeguards when users must work with incomplete or unclear task descriptions.
- The CWE-BENCH-PYTHON dataset supplies a reusable testbed for measuring how prompt changes affect code security across different models.
- Security evaluations of code generators should include prompt-quality controls in addition to model architecture tests.
Where Pith is reading between the lines
- If the pattern holds, prompt-quality guidelines could be built into developer tools so that insecure code suggestions are flagged before they are accepted.
- The result suggests that teaching basic prompt-writing skills may be a faster route to safer AI coding assistants than waiting for model-level fixes.
- Similar experiments could test whether the same normativity-security link appears when the target language shifts from Python to other languages or when the task moves outside common weakness enumeration categories.
Load-bearing premise
The three prompt-quality dimensions and the four-level normativity scale in CWE-BENCH-PYTHON isolate the factors that actually cause insecure code rather than merely tracking some other unmeasured variable.
What would settle it
Running the same low-normativity prompts on a fresh set of models or tasks while keeping every other input fixed and finding no measurable rise in insecure code outputs would falsify the reported correlation.
read the original abstract
Large language models (LLMs) have become indispensable for automated code generation, yet the quality and security of their outputs remain a critical concern. Existing studies predominantly concentrate on adversarial attacks or inherent flaws within the models. However, a more prevalent yet underexplored issue concerns how the quality of a benign but poorly formulated prompt affects the security of the generated code. To investigate this, we first propose an evaluation framework for prompt quality encompassing three key dimensions: goal clarity, information completeness, and logical consistency. Based on this framework, we construct and publicly release CWE-BENCH-PYTHON, a large-scale benchmark dataset containing tasks with prompts categorized into four distinct levels of normativity (L0-L3). Extensive experiments on multiple state-of-the-art LLMs reveal a clear correlation: as prompt normativity decreases, the likelihood of generating insecure code consistently and markedly increases. Furthermore, we demonstrate that advanced prompting techniques, such as Chain-of-Thought and Self-Correction, effectively mitigate the security risks introduced by low-quality prompts, substantially improving code safety. Our findings highlight that enhancing the quality of user prompts constitutes a critical and effective strategy for strengthening the security of AI-generated code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that lower-quality prompts, as measured by a three-dimensional framework of goal clarity, information completeness, and logical consistency, lead to higher rates of insecure code generation by LLMs. The authors construct and release CWE-BENCH-PYTHON, a benchmark with prompts categorized into four normativity levels (L0-L3), and report experimental results across multiple LLMs showing a consistent increase in security defects with decreasing normativity. They further claim that Chain-of-Thought and Self-Correction prompting techniques substantially mitigate these risks.
Significance. If the central correlation holds after addressing methodological gaps, the work has clear practical value for secure LLM-assisted coding by emphasizing prompt quality as a controllable factor. The public release of CWE-BENCH-PYTHON is a concrete strength that supports reproducibility and follow-on research. The findings extend beyond adversarial prompt attacks to everyday benign but low-normativity prompts.
major comments (3)
- [§4] §4 (Benchmark Construction): The L0-L3 normativity assignments rely on author-applied judgments of the three prompt-quality dimensions without reported inter-rater reliability, blinding, or external validation. Because the same framework is used both to generate the prompt variants and to measure their effect on security defects, this introduces a risk that lower-normativity prompts were systematically constructed to omit security constraints, rendering the reported correlation partly tautological rather than an independent empirical finding.
- [§5] §5 (Experiments): The manuscript reports a 'clear correlation' and 'marked increase' in insecure code but provides no statistical tests, effect sizes, confidence intervals, or controls for confounding variables such as prompt length, temperature, or model-specific decoding settings. Without these, it is impossible to determine whether the observed defect induction rates are robust or driven by unmeasured factors.
- [§5.2] §5.2 (Defect Verification): The abstract and experimental sections do not describe how security defects were identified or verified (static analyzers, manual CWE labeling, or automated matching). This detail is load-bearing for the central claim about defect induction rates and must be specified to allow assessment of measurement validity.
minor comments (2)
- [§3] The three-dimensional framework is introduced clearly in the abstract but would benefit from an explicit table or figure in §3 that maps example prompts to each dimension and normativity level.
- [Results] Figure captions and axis labels in the results section should explicitly state the number of samples per prompt level and per model to aid interpretation of the plotted rates.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major point below, outlining specific revisions and clarifications that will strengthen the work while preserving its core contributions.
read point-by-point responses
-
Referee: [§4] §4 (Benchmark Construction): The L0-L3 normativity assignments rely on author-applied judgments of the three prompt-quality dimensions without reported inter-rater reliability, blinding, or external validation. Because the same framework is used both to generate the prompt variants and to measure their effect on security defects, this introduces a risk that lower-normativity prompts were systematically constructed to omit security constraints, rendering the reported correlation partly tautological rather than an independent empirical finding.
Authors: We acknowledge the validity of this methodological concern. The normativity levels were assigned according to explicit, predefined criteria across the three dimensions, and prompt variants were created by systematically reducing goal clarity, information completeness, or logical consistency rather than by selectively removing security-related details. Nevertheless, to eliminate any perception of circularity, the revised manuscript will add an inter-rater reliability subsection. We will recruit three independent annotators, provide them with the same dimension definitions, and report agreement statistics (Fleiss’ kappa). We will also explicitly state that defect measurement occurs downstream via independent static analysis and manual CWE mapping, decoupled from the initial prompt-quality labeling. These additions will demonstrate that the observed correlation is empirically grounded rather than definitional. revision: partial
-
Referee: [§5] §5 (Experiments): The manuscript reports a 'clear correlation' and 'marked increase' in insecure code but provides no statistical tests, effect sizes, confidence intervals, or controls for confounding variables such as prompt length, temperature, or model-specific decoding settings. Without these, it is impossible to determine whether the observed defect induction rates are robust or driven by unmeasured factors.
Authors: We agree that quantitative rigor is required to support the central claims. In the revision we will augment Section 5 with chi-squared tests comparing insecure-code proportions across normativity levels, report effect sizes (Cramér’s V), and include 95% confidence intervals for all defect rates. We will also document the full experimental controls: temperature fixed at 0.7 for all models, prompt lengths balanced within ±10 tokens across L0–L3 variants for each task, and identical decoding parameters (top-p = 0.95, max tokens = 512). These additions will allow readers to assess the robustness of the reported trends. revision: yes
-
Referee: [§5.2] §5.2 (Defect Verification): The abstract and experimental sections do not describe how security defects were identified or verified (static analyzers, manual CWE labeling, or automated matching). This detail is load-bearing for the central claim about defect induction rates and must be specified to allow assessment of measurement validity.
Authors: We regret the omission of this essential methodological detail. The revised Section 5.2 will provide a complete description of the verification pipeline: (1) automated scanning with Bandit and Semgrep configured for Python CWE rules, (2) subsequent manual review by two authors to confirm true positives and map findings to specific CWE identifiers, and (3) resolution of disagreements via discussion. We will also report inter-annotator agreement for the manual labeling step. This transparent account will enable proper evaluation of measurement validity. revision: yes
Circularity Check
Empirical measurement study with independent benchmark and experiments; no derivation reduces to inputs by construction
full rationale
The paper defines a three-dimensional prompt quality framework (goal clarity, information completeness, logical consistency) and applies it to construct the CWE-BENCH-PYTHON benchmark with L0-L3 normativity levels. It then runs fresh experiments on multiple LLMs, measures defect induction rates in generated code, and tests mitigation via Chain-of-Thought and Self-Correction. No equations, fitted parameters, or self-citation chains are present that would make the reported correlation equivalent to the input categorization by construction. The security outcomes are evaluated independently on the generated code, rendering the central claim falsifiable and self-contained against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we first propose an evaluation framework for prompt quality encompassing three key dimensions: goal clarity, information completeness, and logical consistency... four distinct levels of normativity (L0–L3)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
as prompt normativity decreases, the likelihood of generating insecure code consistently and markedly increases
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.