Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies

Bin Wang; Hui Li; MiDi Wan; WenJie Yu; Yenan Huang; YiLu Zhong; YuanBing Ouyang

arxiv: 2510.22944 · v2 · submitted 2025-10-27 · 💻 cs.CR · cs.AI

Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies

Bin Wang , YiLu Zhong , MiDi Wan , WenJie Yu , YuanBing Ouyang , Yenan Huang , Hui Li This is my paper

Pith reviewed 2026-05-18 03:50 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords prompt qualitycode securityLLM code generationinsecure codeprompt engineeringCWE benchmarkdefect inductionsecurity mitigation

0 comments

The pith

Lower prompt quality markedly increases the rate at which LLMs generate insecure code, while techniques like chain-of-thought reduce the risk.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that the quality of ordinary user prompts shapes the security of code produced by large language models. It introduces a three-part framework that scores prompts on goal clarity, information completeness, and logical consistency, then uses this framework to build CWE-BENCH-PYTHON, a dataset of programming tasks whose prompts fall into four graded levels of normativity. Experiments across several current models reveal a steady rise in insecure outputs as prompts move from high to low normativity. The same experiments also show that two established prompting methods, chain-of-thought and self-correction, cut the extra security defects introduced by weak prompts. A reader would care because prompts are the only lever most users have when they ask an LLM to write code.

Core claim

The central claim is that prompt normativity, measured by the three-dimensional quality framework of goal clarity, information completeness, and logical consistency, directly affects code security: as normativity drops from level L3 to L0 in the CWE-BENCH-PYTHON tasks, the frequency of generated insecure code rises consistently across tested models, and that chain-of-thought and self-correction prompting substantially lower those defect rates even when the underlying prompt remains low-normativity.

What carries the argument

The three-dimensional prompt quality framework (goal clarity, information completeness, logical consistency) used to assign prompts to four normativity levels (L0-L3) inside the CWE-BENCH-PYTHON benchmark dataset.

If this is right

Raising prompt normativity offers a direct way to lower security defects in LLM-generated code without retraining models.
Chain-of-thought and self-correction prompting can be applied as standard safeguards when users must work with incomplete or unclear task descriptions.
The CWE-BENCH-PYTHON dataset supplies a reusable testbed for measuring how prompt changes affect code security across different models.
Security evaluations of code generators should include prompt-quality controls in addition to model architecture tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds, prompt-quality guidelines could be built into developer tools so that insecure code suggestions are flagged before they are accepted.
The result suggests that teaching basic prompt-writing skills may be a faster route to safer AI coding assistants than waiting for model-level fixes.
Similar experiments could test whether the same normativity-security link appears when the target language shifts from Python to other languages or when the task moves outside common weakness enumeration categories.

Load-bearing premise

The three prompt-quality dimensions and the four-level normativity scale in CWE-BENCH-PYTHON isolate the factors that actually cause insecure code rather than merely tracking some other unmeasured variable.

What would settle it

Running the same low-normativity prompts on a fresh set of models or tasks while keeping every other input fixed and finding no measurable rise in insecure code outputs would falsify the reported correlation.

read the original abstract

Large language models (LLMs) have become indispensable for automated code generation, yet the quality and security of their outputs remain a critical concern. Existing studies predominantly concentrate on adversarial attacks or inherent flaws within the models. However, a more prevalent yet underexplored issue concerns how the quality of a benign but poorly formulated prompt affects the security of the generated code. To investigate this, we first propose an evaluation framework for prompt quality encompassing three key dimensions: goal clarity, information completeness, and logical consistency. Based on this framework, we construct and publicly release CWE-BENCH-PYTHON, a large-scale benchmark dataset containing tasks with prompts categorized into four distinct levels of normativity (L0-L3). Extensive experiments on multiple state-of-the-art LLMs reveal a clear correlation: as prompt normativity decreases, the likelihood of generating insecure code consistently and markedly increases. Furthermore, we demonstrate that advanced prompting techniques, such as Chain-of-Thought and Self-Correction, effectively mitigate the security risks introduced by low-quality prompts, substantially improving code safety. Our findings highlight that enhancing the quality of user prompts constitutes a critical and effective strategy for strengthening the security of AI-generated code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows lower-normativity prompts raise insecure code rates across LLMs and that CoT/self-correction helps, backed by a new released benchmark, but the prompt grading process needs close checking for bias.

read the letter

The main thing to know is that this work finds a consistent link between poorer prompts and higher rates of security defects in LLM-generated Python code, with some standard prompting fixes reducing the problem. They built and released CWE-BENCH-PYTHON to measure it directly. What the paper does is lay out a three-axis prompt quality framework—goal clarity, information completeness, logical consistency—and use it to create four graded levels of prompts for the same tasks. They then run those through several current LLMs and track how often the outputs hit CWE-style weaknesses. The reported pattern is that defect rates climb as the prompts get less normative, and techniques like chain-of-thought or self-correction bring the rates back down. Releasing the benchmark is the clearest positive here; it gives others a concrete way to test or extend the result instead of just reading claims. The experiments cover multiple models, which adds some breadth. The soft spot is the prompt categorization itself. If the L0-L3 assignments were done by the authors without blinding or independent validation, lower-normativity versions could have been written in ways that already omit security details or introduce ambiguity that directly triggers the defects being measured. That would turn part of the correlation into a built-in feature rather than an independent finding. The abstract also skips statistical details, effect sizes, defect verification method, and generation parameters like temperature, so reproducibility is hard to judge from what is shown. This is the kind of paper that matters to people working on secure LLM code generation or prompt engineering for production use. It is not revolutionary but adds a practical, measurable angle that the field can use. I would send it to peer review so referees can examine the benchmark construction and run their own checks on the grading process.

Referee Report

3 major / 2 minor

Summary. The paper claims that lower-quality prompts, as measured by a three-dimensional framework of goal clarity, information completeness, and logical consistency, lead to higher rates of insecure code generation by LLMs. The authors construct and release CWE-BENCH-PYTHON, a benchmark with prompts categorized into four normativity levels (L0-L3), and report experimental results across multiple LLMs showing a consistent increase in security defects with decreasing normativity. They further claim that Chain-of-Thought and Self-Correction prompting techniques substantially mitigate these risks.

Significance. If the central correlation holds after addressing methodological gaps, the work has clear practical value for secure LLM-assisted coding by emphasizing prompt quality as a controllable factor. The public release of CWE-BENCH-PYTHON is a concrete strength that supports reproducibility and follow-on research. The findings extend beyond adversarial prompt attacks to everyday benign but low-normativity prompts.

major comments (3)

[§4] §4 (Benchmark Construction): The L0-L3 normativity assignments rely on author-applied judgments of the three prompt-quality dimensions without reported inter-rater reliability, blinding, or external validation. Because the same framework is used both to generate the prompt variants and to measure their effect on security defects, this introduces a risk that lower-normativity prompts were systematically constructed to omit security constraints, rendering the reported correlation partly tautological rather than an independent empirical finding.
[§5] §5 (Experiments): The manuscript reports a 'clear correlation' and 'marked increase' in insecure code but provides no statistical tests, effect sizes, confidence intervals, or controls for confounding variables such as prompt length, temperature, or model-specific decoding settings. Without these, it is impossible to determine whether the observed defect induction rates are robust or driven by unmeasured factors.
[§5.2] §5.2 (Defect Verification): The abstract and experimental sections do not describe how security defects were identified or verified (static analyzers, manual CWE labeling, or automated matching). This detail is load-bearing for the central claim about defect induction rates and must be specified to allow assessment of measurement validity.

minor comments (2)

[§3] The three-dimensional framework is introduced clearly in the abstract but would benefit from an explicit table or figure in §3 that maps example prompts to each dimension and normativity level.
[Results] Figure captions and axis labels in the results section should explicitly state the number of samples per prompt level and per model to aid interpretation of the plotted rates.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major point below, outlining specific revisions and clarifications that will strengthen the work while preserving its core contributions.

read point-by-point responses

Referee: [§4] §4 (Benchmark Construction): The L0-L3 normativity assignments rely on author-applied judgments of the three prompt-quality dimensions without reported inter-rater reliability, blinding, or external validation. Because the same framework is used both to generate the prompt variants and to measure their effect on security defects, this introduces a risk that lower-normativity prompts were systematically constructed to omit security constraints, rendering the reported correlation partly tautological rather than an independent empirical finding.

Authors: We acknowledge the validity of this methodological concern. The normativity levels were assigned according to explicit, predefined criteria across the three dimensions, and prompt variants were created by systematically reducing goal clarity, information completeness, or logical consistency rather than by selectively removing security-related details. Nevertheless, to eliminate any perception of circularity, the revised manuscript will add an inter-rater reliability subsection. We will recruit three independent annotators, provide them with the same dimension definitions, and report agreement statistics (Fleiss’ kappa). We will also explicitly state that defect measurement occurs downstream via independent static analysis and manual CWE mapping, decoupled from the initial prompt-quality labeling. These additions will demonstrate that the observed correlation is empirically grounded rather than definitional. revision: partial
Referee: [§5] §5 (Experiments): The manuscript reports a 'clear correlation' and 'marked increase' in insecure code but provides no statistical tests, effect sizes, confidence intervals, or controls for confounding variables such as prompt length, temperature, or model-specific decoding settings. Without these, it is impossible to determine whether the observed defect induction rates are robust or driven by unmeasured factors.

Authors: We agree that quantitative rigor is required to support the central claims. In the revision we will augment Section 5 with chi-squared tests comparing insecure-code proportions across normativity levels, report effect sizes (Cramér’s V), and include 95% confidence intervals for all defect rates. We will also document the full experimental controls: temperature fixed at 0.7 for all models, prompt lengths balanced within ±10 tokens across L0–L3 variants for each task, and identical decoding parameters (top-p = 0.95, max tokens = 512). These additions will allow readers to assess the robustness of the reported trends. revision: yes
Referee: [§5.2] §5.2 (Defect Verification): The abstract and experimental sections do not describe how security defects were identified or verified (static analyzers, manual CWE labeling, or automated matching). This detail is load-bearing for the central claim about defect induction rates and must be specified to allow assessment of measurement validity.

Authors: We regret the omission of this essential methodological detail. The revised Section 5.2 will provide a complete description of the verification pipeline: (1) automated scanning with Bandit and Semgrep configured for Python CWE rules, (2) subsequent manual review by two authors to confirm true positives and map findings to specific CWE identifiers, and (3) resolution of disagreements via discussion. We will also report inter-annotator agreement for the manual labeling step. This transparent account will enable proper evaluation of measurement validity. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with independent benchmark and experiments; no derivation reduces to inputs by construction

full rationale

The paper defines a three-dimensional prompt quality framework (goal clarity, information completeness, logical consistency) and applies it to construct the CWE-BENCH-PYTHON benchmark with L0-L3 normativity levels. It then runs fresh experiments on multiple LLMs, measures defect induction rates in generated code, and tests mitigation via Chain-of-Thought and Self-Correction. No equations, fitted parameters, or self-citation chains are present that would make the reported correlation equivalent to the input categorization by construction. The security outcomes are evaluated independently on the generated code, rendering the central claim falsifiable and self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The work rests on standard assumptions that CWE categories are valid security defects and that LLM outputs can be reliably evaluated for security.

pith-pipeline@v0.9.0 · 5755 in / 1069 out tokens · 34942 ms · 2026-05-18T03:50:56.606053+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we first propose an evaluation framework for prompt quality encompassing three key dimensions: goal clarity, information completeness, and logical consistency... four distinct levels of normativity (L0–L3)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

as prompt normativity decreases, the likelihood of generating insecure code consistently and markedly increases

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.