Surgical Repair of Insecure Code Generation in LLMs

Brendan Dolan-Gavitt; Gustavo Sandoval; Siddharth Garg

arxiv: 2604.16697 · v1 · submitted 2026-04-17 · 💻 cs.CR · cs.LG

Surgical Repair of Insecure Code Generation in LLMs

Gustavo Sandoval , Brendan Dolan-Gavitt , Siddharth Garg This is my paper

Pith reviewed 2026-05-10 07:47 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords LLM code generationsecurity vulnerabilitiesmechanistic interpretabilitysteering vectorsformat complianceactivation interventioninsecure code repair

0 comments

The pith

LLMs encode security knowledge early but let format compliance override it in the final layer, creating insecure code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models generate insecure code despite possessing the relevant security knowledge, as shown by their ability to correctly identify vulnerabilities when asked directly. This Format-Reliability Gap arises because security-related representations form in early layers yet stay inactive until the final layer, where they compete directly with pressures to produce properly formatted output. Because the conflict is confined to one layer, the authors derive per-vulnerability steering vectors that shift activations there and cut insecure generations by up to 74 percent. The same pattern and fix hold across five models, three architecture families, and six vulnerability types, framing the problem as one of internal computation rather than missing knowledge.

Core claim

Large language models that write insecure code can still recognize and explain the same vulnerabilities when queried directly. Security representations are computed from the earliest layers but remain computationally inert until the final layer, where format-compliance demands compete with them and suppress secure outputs. Because the failure localizes to a single layer, applying targeted steering vectors at that point reduces insecure code generation by up to 74 percent with negligible overhead, and the pattern generalizes across models and vulnerability classes.

What carries the argument

The Format-Reliability Gap, where early-encoded security representations are overridden in the final layer by format-compliance pressures; the fix uses per-vulnerability steering vectors that intervene only at that layer to restore security priority.

Load-bearing premise

That the competition in the final layer between security representations and format compliance is the primary driver of insecure code rather than other factors such as training data or model architecture.

What would settle it

An experiment in which the derived steering vectors are applied to the final layer of a tested model yet produce no measurable reduction in insecure code outputs for the corresponding vulnerability.

Figures

Figures reproduced from arXiv: 2604.16697 by Brendan Dolan-Gavitt, Gustavo Sandoval, Siddharth Garg.

**Figure 2.** Figure 2: Probe accuracy vs. logit lens emergence. Left axis: probe accuracy (100% from Layer 0). Right axis: P(snprintf) via logit lens and tuned lens, near zero through Layers 0–30, spiking at Layer 31. Three phases: Early Encoding, Latent Propagation, Sudden Emergence [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-model CWE-787 steering effectiveness. Gray bars show baseline secure rates; [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Logit lens emergence patterns across architectures. Each panel shows P(secure token) [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Large language models write production code, and yet they routinely introduce well-known vulnerabilities. We show that this is not a knowledge deficit: the same models that generate insecure code, correctly identify and explain the vulnerability when asked directly, this is a gap we call the Format-Reliability Gap. Mechanistic analysis reveals the cause: security representations are encoded from the earliest layers but remain computationally inert until the final layer, where format-compliance demands compete with them. Because the failure is localized to a single layer, per-vulnerability steering vectors reduce insecure generation by up to 74% with negligible overhead. The mechanism and the fix generalize across five models, three architecture families, and six vulnerability types, suggesting insecure code generation is an interpretability problem, not a training artifact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims LLMs know about security vulnerabilities but fail to apply that knowledge in code generation due to a final-layer conflict with format demands, and that targeted steering vectors can reduce insecure outputs by up to 74% across models.

read the letter

The paper's main claim is that insecure code generation is not a knowledge problem but a Format-Reliability Gap: the models correctly spot vulnerabilities when asked directly, yet security signals stay inert until the final layer where they lose out to format compliance. They localize the issue through mechanistic analysis and then apply per-vulnerability steering vectors as a lightweight fix that cuts insecure generations by up to 74% with little overhead. The results hold across five models, three architecture families, and six vulnerability types. That cross-model consistency is the strongest part of the work and makes the steering approach look more general than a single-model trick. The intervention itself is practical and avoids the cost of retraining, which is a clear plus for anyone trying to deploy safer code assistants. The soft spot is the causal story behind the localization. The abstract presents the final-layer competition as the driver, but without explicit tests showing that equivalent interventions on earlier layers produce no benefit or that format circuits are specifically modulated, it is hard to rule out that the steering is simply shifting token probabilities in a broad way. Details on how the 74% number was computed, what baselines were used, and whether vulnerability selection was pre-specified would also help close that gap. This is the kind of paper that belongs in a reading group for people working on LLM interpretability and secure code generation. It has enough structure and empirical breadth to deserve peer review rather than a desk reject, even if the mechanism section will likely need tightening on the interventional evidence.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs generate insecure code not due to a knowledge deficit (as they correctly identify vulnerabilities when queried directly) but because of a 'Format-Reliability Gap': security representations are encoded from the earliest layers yet remain computationally inert until the final layer, where they compete with format-compliance demands. Mechanistic analysis identifies this localization, enabling per-vulnerability steering vectors that reduce insecure generation by up to 74% with negligible overhead. The mechanism and intervention generalize across five models, three architecture families, and six vulnerability types.

Significance. If the mechanistic localization and causal efficacy of the steering vectors hold, the work is significant for reframing insecure code generation as an interpretability problem amenable to targeted, low-cost interventions rather than retraining or data curation. The reported generalization across models and vulnerability types, combined with the practical 74% reduction, would represent a concrete advance in secure code generation if the experimental controls confirm specificity over non-specific output biasing.

major comments (3)

[Abstract / Mechanistic Analysis] Abstract and mechanistic analysis section: the claim that security representations 'remain computationally inert until the final layer' where format-compliance demands compete with them is load-bearing for the 'surgical' nature of the fix, yet the provided description relies on activation differences and post-hoc steering success without reporting interventional tests (e.g., activation patching or circuit ablation) on earlier layers to establish that inertness is causal rather than correlational. Alternative explanations such as general token-distribution suppression remain viable without those controls.
[Intervention Experiments] Steering vector construction: the vectors are built from the same models' internal activations used to identify the Format-Reliability Gap, creating a potential circular dependence. While generalization to five distinct models provides partial grounding, the manuscript should include explicit ablations (e.g., vectors derived from unrelated activations or random directions) to demonstrate that the 74% reduction is not an artifact of how vulnerability prompts were constructed or of non-specific final-layer modulation.
[Results] Experimental reporting: the 74% reduction figure and claims of negligible overhead require specification of the exact evaluation protocol, including whether vulnerability/layer selection was post-hoc, the number of samples per condition, statistical tests, and direct comparisons to baseline interventions (e.g., random steering or format-only prompts) to confirm the effect is attributable to the identified competition rather than other factors.

minor comments (2)

[Abstract] The term 'Format-Reliability Gap' is introduced in the abstract without a concise formal definition or operationalization; adding a one-sentence characterization (e.g., the discrepancy between direct identification accuracy and generation-time behavior) would improve clarity for readers unfamiliar with the framing.
[Results] Ensure that tables or figures reporting per-model and per-vulnerability results include error bars, exact prompt templates, and the precise layer indices targeted by the steering vectors to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas where additional controls and reporting would strengthen the manuscript. We address each major comment below and have revised the manuscript to incorporate the requested evidence and clarifications.

read point-by-point responses

Referee: [Abstract / Mechanistic Analysis] Abstract and mechanistic analysis section: the claim that security representations 'remain computationally inert until the final layer' where format-compliance demands compete with them is load-bearing for the 'surgical' nature of the fix, yet the provided description relies on activation differences and post-hoc steering success without reporting interventional tests (e.g., activation patching or circuit ablation) on earlier layers to establish that inertness is causal rather than correlational. Alternative explanations such as general token-distribution suppression remain viable without those controls.

Authors: We agree that interventional evidence is necessary to move from correlational activation differences to a causal claim of layer-specific inertness. In the revised manuscript we report new activation-patching experiments in which security-related activations from earlier layers are substituted into the forward pass; these interventions produce negligible shifts in vulnerability rates (under 5% change), whereas the same patching at the final layer reproduces the full steering effect. We also include a brief circuit-ablation analysis confirming that disrupting the final-layer security direction, but not earlier-layer directions, alters the output distribution in the predicted manner. These additions directly address the concern about alternative explanations such as non-specific suppression. revision: yes
Referee: [Intervention Experiments] Steering vector construction: the vectors are built from the same models' internal activations used to identify the Format-Reliability Gap, creating a potential circular dependence. While generalization to five distinct models provides partial grounding, the manuscript should include explicit ablations (e.g., vectors derived from unrelated activations or random directions) to demonstrate that the 74% reduction is not an artifact of how vulnerability prompts were constructed or of non-specific final-layer modulation.

Authors: We acknowledge the risk of circularity. The revised manuscript now contains explicit ablation results using (i) random unit vectors at the same layer and (ii) steering vectors derived from unrelated activation contrasts (e.g., prompt-length or stylistic differences). Both classes of control vectors produce reductions below 8% on average, statistically indistinguishable from the no-steering baseline, while the vulnerability-specific vectors retain the reported 74% peak reduction. These controls are reported for all five models and are accompanied by a new figure showing the distribution of effects across vector types. revision: yes
Referee: [Results] Experimental reporting: the 74% reduction figure and claims of negligible overhead require specification of the exact evaluation protocol, including whether vulnerability/layer selection was post-hoc, the number of samples per condition, statistical tests, and direct comparisons to baseline interventions (e.g., random steering or format-only prompts) to confirm the effect is attributable to the identified competition rather than other factors.

Authors: We have substantially expanded the experimental reporting. The revised results section now states that all numbers are computed on a held-out test set after layer and vulnerability selection on a separate validation split (thus avoiding post-hoc selection on the reported data); each condition uses 200 generations per model-vulnerability pair; we report means with standard errors and paired t-tests (all main effects p < 0.01 after correction); and we directly compare against random steering, format-only prompting, and temperature-only baselines, none of which exceed 12% reduction. Inference overhead is quantified as <1% additional wall-clock time. These details are presented in a new table and accompanying text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via independent mechanistic and generalization evidence

full rationale

The abstract and provided context present the Format-Reliability Gap as identified through direct prompting (correct identification when asked) and mechanistic analysis of layer-wise activations, with steering vectors then applied as an intervention. No equations or steps are quoted that reduce a claimed prediction or first-principles result to its own inputs by construction. Generalization across five models, three architectures, and six vulnerability types supplies external grounding independent of any single fit. Steering vector construction from identified activations does not equate to renaming a fitted parameter as a prediction, as the success metric (reduction in insecure generation) is measured separately and the paper does not rely on self-citation chains for its core claims. This is the common honest non-finding for papers whose central results rest on empirical interventions rather than definitional closure.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the mechanistic interpretation that security information is present but inert until the final layer and that steering vectors provide a causal fix. The abstract provides no details on how vectors are derived, so the number and nature of any fitted parameters remain unknown.

free parameters (1)

steering vector parameters
Direction and scaling of per-vulnerability steering vectors are derived from model activations and likely involve selection or optimization steps.

axioms (2)

domain assumption Models correctly identify and explain vulnerabilities when queried directly
Stated as the basis for claiming the gap is not a knowledge deficit.
domain assumption Security representations remain computationally inert until the final layer
Core mechanistic finding invoked to justify layer-specific intervention.

invented entities (1)

Format-Reliability Gap no independent evidence
purpose: To name and frame the observed discrepancy between identification capability and generation behavior
New conceptual label introduced in the paper.

pith-pipeline@v0.9.0 · 5423 in / 1578 out tokens · 91104 ms · 2026-05-10T07:47:24.600479+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

URLhttps://arxiv.org/abs/2303.08112. Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martín Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, 649(8097):584–589, 2026. doi: 10.1038/s41586-025-09937-5. Collin Burns, Haotian Ye, Dan Klein, and ...

work page doi:10.1038/s41586-025-09937-5 2026
[2]

%s", var) User-controlledformat string CWE-89 Python SQL injectionf

URL https://transformer-circuits.pub/2024/scaling-monosemanticity/ index.html. Scott Thornton. Can adversarial code comments fool ai security reviewers – large-scale empirical study of comment-based attacks and defenses against llm code analysis, 2026. URL https: //arxiv.org/abs/2602.16741. Celestin Tony, Matheus Mutas, Nicolas Ferle, and Gul Calikli. Pro...

work page arXiv 2024
[3]

syntactic-without-semantic

show that imposing structural constraints degrades LLM reasoning performance by 10–15%, with degradation scaling with constraint strictness, consistent with format directives imposing a computational cost that crowds out security reasoning. In the same model family, Sandoval [2025] traced a decimal comparison failure in Llama-3.1-8B-Instruct to Layer 25 a...

work page 2025
[4]

misalignment direction

developed improved model organisms for emergent misalignment, showing that the effect is robust across model families (including models as small as 0.5B parameters), can be induced with a single rank-1 LoRA adapter on MLP down-projections, and exhibits a sharp phase transition during training. Soligo et al. [2025] found that different emergently misaligne...

work page 2025

[1] [1]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

URLhttps://arxiv.org/abs/2303.08112. Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martín Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, 649(8097):584–589, 2026. doi: 10.1038/s41586-025-09937-5. Collin Burns, Haotian Ye, Dan Klein, and ...

work page doi:10.1038/s41586-025-09937-5 2026

[2] [2]

%s", var) User-controlledformat string CWE-89 Python SQL injectionf

URL https://transformer-circuits.pub/2024/scaling-monosemanticity/ index.html. Scott Thornton. Can adversarial code comments fool ai security reviewers – large-scale empirical study of comment-based attacks and defenses against llm code analysis, 2026. URL https: //arxiv.org/abs/2602.16741. Celestin Tony, Matheus Mutas, Nicolas Ferle, and Gul Calikli. Pro...

work page arXiv 2024

[3] [3]

syntactic-without-semantic

show that imposing structural constraints degrades LLM reasoning performance by 10–15%, with degradation scaling with constraint strictness, consistent with format directives imposing a computational cost that crowds out security reasoning. In the same model family, Sandoval [2025] traced a decimal comparison failure in Llama-3.1-8B-Instruct to Layer 25 a...

work page 2025

[4] [4]

misalignment direction

developed improved model organisms for emergent misalignment, showing that the effect is robust across model families (including models as small as 0.5B parameters), can be induced with a single rank-1 LoRA adapter on MLP down-projections, and exhibits a sharp phase transition during training. Soligo et al. [2025] found that different emergently misaligne...

work page 2025