pith. sign in

arxiv: 2606.20502 · v1 · pith:IRMKD3UOnew · submitted 2026-06-18 · 💻 cs.CR · cs.AI· cs.SE

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

Pith reviewed 2026-06-26 16:57 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE
keywords LLM vulnerability detectionfine-tuning limitssystems software securitydata contaminationCWE classificationdirectional failure indexsecurity reasoningtemporal split evaluation
0
0 comments X

The pith

Fine-tuning LLMs for vulnerability detection only calibrates output thresholds without instilling security reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs that score well on vulnerability benchmarks genuinely reason about security or merely match patterns from contaminated data. It introduces the CWE-Trace framework built on 834 manually curated Linux kernel samples across 74 CWEs, using a strict temporal split between historical and post-cutoff data plus two new metrics to track failure directions. Evaluation of eight base models and fifteen LoRA variants shows data contamination gives no advantage, while fine-tuning consistently fails to alter stable, systematic failure modes that persist across time periods. The result is that detection and comprehension remain decoupled, with peak binary detection at 52.1 percent and exact CWE ranking below 1.3 percent top-1 accuracy.

Core claim

Backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities.

What carries the argument

The Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD) metrics, which quantify persistent directional failure modes that fine-tuning cannot correct.

If this is right

  • Data contamination provides no measurable advantage for vulnerability detection tasks.
  • Detection and understanding of vulnerabilities remain decoupled capabilities in current LLMs.
  • The best achievable detection score reaches only 52.1 percent, only slightly above chance.
  • Exact CWE ranking stays below 1.3 percent Top-1 accuracy across all tested models and strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the result generalizes, security applications may require new model architectures capable of acquiring domain reasoning rather than relying on existing priors.
  • The same calibration-without-comprehension pattern could limit LLM performance on related tasks such as automated code repair or formal verification.
  • Applying the DFI metric to non-security code tasks could reveal whether persistent directional failures are specific to vulnerability detection or more general.

Load-bearing premise

The manually curated 834-sample dataset with its temporal split and the DFI/HDD metrics provide a valid, leakage-free test that distinguishes absence of security reasoning from other model or data limitations.

What would settle it

A fine-tuning run on post-cutoff data that produces DFI values near zero or reverses the sign of existing failure directions on held-out samples would falsify the claim that priors dominate and remain uncorrectable.

Figures

Figures reproduced from arXiv: 2606.20502 by Arastoo Zibaeirad, Marco Vieira.

Figure 1
Figure 1. Figure 1: A unified diff of CVE-2024-41010. The vulnerability (Use-After [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the CWE-Trace framework. The pipeline begins with the extraction and manual pairing of 834 Linux kernel samples (417 pairs) (§III-A), [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RQ1 mechanism analysis. Panel (a): fine-tuning vs. backbone detection [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fine-tuning ∆ Root-Micro@1 for coarse CWE-1000 / root-taxonomy classification (relative to vanilla backbone). Each lollipop represents the change for a single fine-tuning dataset; filled circles = PBD, open circles = LFD. Red zone: fine-tuning hurts; green zone: fine-tuning helps. B. Finding 2: Fine-Tuning Response Is Systematically Backbone-Dependent and Inverse to RQ1 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 5
Figure 5. Figure 5: RQ3 semantic-depth analysis. Panel (a): per-model MRR ranking on PBD and LFD. Panel (b): MRR [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable--patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). We evaluate eight vanilla LLMs and 15 LoRA fine-tuned variants across non-targeted detection, targeted detection, and CWE classification. Our analysis yields two key results. First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces CWE-Trace, a framework built on 834 manually curated Linux kernel samples spanning 74 CWEs with a strict temporal split (pre-2025 historical vs. post-cutoff leakage-free sets) that preserves vulnerable-patched pairs. It evaluates eight vanilla LLMs and 15 LoRA variants on non-targeted detection, targeted detection, and CWE classification, introducing Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD) metrics. Key claims are that contamination yields no measurable advantage (84% of contaminated samples lack usable memorization signal), backbone directional priors dominate (DFI ranges -85.5 to +94.8 pp persist across splits and resist fine-tuning), and fine-tuning only shifts output thresholds without altering decision policy, yielding at best 52.1% detection (+2.1 pp above chance) and <1.3% Top-1 CWE accuracy.

Significance. If the DFI/HDD metrics are shown to isolate decision policy independent of task hardness, the temporal-split results and contamination controls would provide strong evidence that current LLMs lack reliable security reasoning for systems software. The manual curation, leakage-free split, and direct contamination measurement are concrete strengths supporting the empirical component.

major comments (3)
  1. [framework construction and metric introduction] The section introducing DFI and HDD (framework construction paragraph) defines these as new diagnostic metrics but reports no validation on control cases where security reasoning is known to exist (e.g., static analyzers or explicit CoT security traces). This is load-bearing for the central claim that |DFI| values indicate absent reasoning rather than general task difficulty, label imbalance, or curation artifacts.
  2. [results on fine-tuning effects] The abstract claim that 'fine-tuning shifts the output threshold without changing the decision policy' (supported by persistent DFI across historical and post-cutoff data) requires explicit before/after DFI comparisons with statistical tests in the results section; without these, it remains unclear whether observed stability reflects true policy invariance or metric sensitivity.
  3. [contamination analysis] Table or section reporting the 84% 'no usable memorization signal' finding for contaminated samples must detail the exact cross-mapping criteria and CWE misclassification detection method used at function level; these choices directly affect the conclusion that contamination provides no advantage.
minor comments (1)
  1. Provide explicit formulas for DFI and HDD (including any normalization or hierarchical weighting) in a dedicated subsection or appendix to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating planned revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [framework construction and metric introduction] The section introducing DFI and HDD (framework construction paragraph) defines these as new diagnostic metrics but reports no validation on control cases where security reasoning is known to exist (e.g., static analyzers or explicit CoT security traces). This is load-bearing for the central claim that |DFI| values indicate absent reasoning rather than general task difficulty, label imbalance, or curation artifacts.

    Authors: We agree that direct validation of DFI and HDD on control cases with established security reasoning (such as static analyzers) would provide stronger support for interpreting high |DFI| as evidence of absent reasoning rather than task artifacts. The metrics were designed to isolate directional output bias from accuracy, using the temporal split as an indirect control. In revision we will add a dedicated paragraph in the framework section discussing this limitation, including a brief qualitative comparison to expected DFI behavior under perfect reasoning, and note it explicitly as a direction for future work. revision: partial

  2. Referee: [results on fine-tuning effects] The abstract claim that 'fine-tuning shifts the output threshold without changing the decision policy' (supported by persistent DFI across historical and post-cutoff data) requires explicit before/after DFI comparisons with statistical tests in the results section; without these, it remains unclear whether observed stability reflects true policy invariance or metric sensitivity.

    Authors: The results already report DFI for both vanilla and LoRA-tuned models and note persistence across splits, but we acknowledge the absence of paired before/after comparisons and formal statistical tests. We will add a new table in the results section listing per-model DFI values pre- and post-fine-tuning, together with paired statistical tests (Wilcoxon signed-rank) to quantify stability. This will directly support the threshold-shift claim. revision: yes

  3. Referee: [contamination analysis] Table or section reporting the 84% 'no usable memorization signal' finding for contaminated samples must detail the exact cross-mapping criteria and CWE misclassification detection method used at function level; these choices directly affect the conclusion that contamination provides no advantage.

    Authors: We will expand the contamination analysis subsection to include the precise function-level cross-mapping rules (string similarity thresholds, AST-based matching, and handling of renamed functions) and the exact procedure for detecting CWE misclassifications (manual review protocol plus keyword/CWE-ID mismatch checks). A supplementary table will list the decision criteria and example mappings to ensure full reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on held-out data

full rationale

The paper is a purely empirical evaluation that introduces a curated dataset, temporal split, and two new diagnostic metrics (DFI and HDD) then reports measured performance numbers on eight base models and fifteen fine-tuned variants. No equations, fitted parameters, or derivations exist that reduce any reported outcome to a quantity defined by the inputs. Central claims rest on the observed DFI ranges and accuracy figures themselves rather than on any self-referential construction or self-citation chain. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the validity of the new DFI and HDD metrics and the assumption that manual curation plus the temporal split produce a leakage-free test of comprehension; no numerical free parameters are introduced beyond standard model training.

axioms (2)
  • domain assumption The 834 samples are accurately labeled for vulnerabilities and CWEs by manual curation
    The benchmark quality and all downstream metrics rest on the correctness of human labeling.
  • domain assumption The post-2025 temporal split ensures no training-data leakage
    Used to create the leakage-free evaluation set that supports the contamination and persistence conclusions.
invented entities (2)
  • Directional Failure Index (DFI) no independent evidence
    purpose: Quantify systematic directional biases in model failure modes across detection tasks
    Newly defined diagnostic metric introduced to measure persistence of failure patterns.
  • Hierarchical Distance and Direction (HDD) no independent evidence
    purpose: Measure hierarchical aspects of CWE classification performance
    Newly introduced metric for analyzing exact CWE ranking and distance from ground truth.

pith-pipeline@v0.9.1-grok · 5858 in / 1751 out tokens · 36674 ms · 2026-06-26T16:57:11.246891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 10 linked inside Pith

  1. [1]

    Vulnerability details and information,

    CVE Details, “Vulnerability details and information,” https://www. cvedetails.com/, 2024, accessed: 2024-06-19

  2. [2]

    GitHub, Inc., “Codeql,” https://codeql.github.com/, 2026, accessed: 2026-02-26

  3. [3]

    Semgrep app security platform,

    Semgrep, Inc., “Semgrep app security platform,” https://semgrep.dev/, 2026, accessed: 2026-02-26

  4. [4]

    Machine learning to combine static analysis alerts with software metrics to detect security vulnerabilities: An empirical study,

    J. D. Pereira, J. R. Campos, and M. Vieira, “Machine learning to combine static analysis alerts with software metrics to detect security vulnerabilities: An empirical study,” in2021 17th European Dependable Computing Conference (EDCC). IEEE, 2021, pp. 1–8

  5. [5]

    Evaluating large language models trained on code,

    M. Chenet al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

  6. [6]

    Llama: Open and efficient foundation language models,

    H. Touvronet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  7. [7]

    Codebert: A pre-trained model for programming and natural languages,

    Z. Fenget al., “Codebert: A pre-trained model for programming and natural languages,”arXiv preprint arXiv:2002.08155, 2020

  8. [8]

    Competition-level code generation with alphacode,

    Y . Liet al., “Competition-level code generation with alphacode,”Sci- ence, vol. 378, no. 6624, pp. 1092–1097, 2022

  9. [9]

    Securityeval dataset: mining vulner- ability examples to evaluate machine learning-based code generation techniques,

    M. L. Siddiq and J. C. Santos, “Securityeval dataset: mining vulner- ability examples to evaluate machine learning-based code generation techniques,” inProceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security, 2022, pp. 29–33

  10. [10]

    Vuldetectbench: Evaluating the deep capability of vulnerability detection with large language models,

    Y . Liuet al., “Vuldetectbench: Evaluating the deep capability of vulnerability detection with large language models,”arXiv preprint arXiv:2406.07595, 2024

  11. [11]

    Llm4vuln: A unified evaluation framework for de- coupling and enhancing llms’ vulnerability reasoning,

    Y . Sunet al., “Llm4vuln: A unified evaluation framework for de- coupling and enhancing llms’ vulnerability reasoning,”arXiv preprint arXiv:2401.16185, 2024

  12. [12]

    Understanding the effectiveness of large language models in detecting security vulnerabilities,

    A. Khareet al., “Understanding the effectiveness of large language models in detecting security vulnerabilities,” in2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2025, pp. 103–114

  13. [13]

    Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks,

    S. Ullahet al., “Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks,” inIEEE Symposium on Security and Privacy, 2024

  14. [14]

    Examining radiation therapy planning knowledge in large language models,

    O. Ghorbaniet al., “Examining radiation therapy planning knowledge in large language models,” inProceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health In- formatics, 2025, pp. 1–1

  15. [15]

    Quantifying influencer impact on affective polar- ization,

    R. Rashidet al., “Quantifying influencer impact on affective polar- ization,” in2024 International Conference on Machine Learning and Applications (ICMLA). IEEE, 2024, pp. 1135–1140

  16. [16]

    How far have we gone in vulnerability detection using large language models,

    Z. Gaoet al., “How far have we gone in vulnerability detection using large language models,”arXiv preprint arXiv:2311.12420, 2023

  17. [17]

    The secret life of software vulnerabilities: A large- scale empirical study,

    E. Iannoneet al., “The secret life of software vulnerabilities: A large- scale empirical study,”IEEE Transactions on Software Engineering, vol. 49, no. 1, pp. 44–63, 2022

  18. [18]

    Data quality for software vulnerability datasets,

    R. Croft, M. A. Babar, and M. M. Kholoosi, “Data quality for software vulnerability datasets,” in2023 IEEE/ACM 45th International Confer- ence on Software Engineering (ICSE). IEEE, 2023, pp. 121–133

  19. [19]

    Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,

    Y . Zhouet al., “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,”Advances in neural information processing systems, vol. 32, 2019

  20. [20]

    Linevul: A transformer-based line- level vulnerability prediction,

    M. Fu and C. Tantithamthavorn, “Linevul: A transformer-based line- level vulnerability prediction,” inProceedings of the 19th international conference on mining software repositories, 2022, pp. 608–620

  21. [21]

    Vulnerability detection with code language models: How far are we?

    Y . Dinget al., “Vulnerability detection with code language models: How far are we?”arXiv preprint arXiv:2403.18624, 2024

  22. [22]

    Megavul: Ac/c++ vulnerability dataset with comprehensive code representations,

    C. Niet al., “Megavul: Ac/c++ vulnerability dataset with comprehensive code representations,” inProceedings of the 21st International Confer- ence on Mining Software Repositories, 2024, pp. 738–742

  23. [23]

    Automated vulnerability detection in source code using deep representation learning,

    R. Russellet al., “Automated vulnerability detection in source code using deep representation learning,” in2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, 2018, pp. 757–762

  24. [24]

    Codexglue: A machine learning benchmark dataset for code understanding and generation,

    S. Luet al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,”arXiv preprint arXiv:2102.04664, 2021

  25. [25]

    Deep learning based vulnerability detection: Are we there yet?

    S. Chakrabortyet al., “Deep learning based vulnerability detection: Are we there yet?”IEEE Transactions on Software Engineering, vol. 48, no. 9, pp. 3280–3296, 2021

  26. [26]

    D2a: A dataset built for ai-based vulnerability detection methods using differential analysis,

    Y . Zhenget al., “D2a: A dataset built for ai-based vulnerability detection methods using differential analysis,” in2021 IEEE/ACM 43rd Interna- tional Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2021, pp. 111–120

  27. [27]

    Cvefixes: automated collec- tion of vulnerabilities and their fixes from open-source software,

    G. Bhandari, A. Naseer, and L. Moonen, “Cvefixes: automated collec- tion of vulnerabilities and their fixes from open-source software,” in Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, 2021, pp. 30–39

  28. [28]

    Common Weakness Enumeration (CWE),

    MITRE Corporation, “Common Weakness Enumeration (CWE),” https: //cwe.mitre.org/, 2024, accessed: 2025-10-18

  29. [29]

    CWE-1000: Research Concepts,

    ——, “CWE-1000: Research Concepts,” https://cwe.mitre.org/data/ definitions/1000.html, 2024, accessed: 2025-10-18

  30. [30]

    Imagenet large scale visual recognition chal- lenge,

    O. Russakovskyet al., “Imagenet large scale visual recognition chal- lenge,”International journal of computer vision, vol. 115, pp. 211–252, 2015

  31. [31]

    Mean reciprocal rank,

    N. Craswell, “Mean reciprocal rank,”Encyclopedia of database systems, pp. 1703–1703, 2009

  32. [32]

    Lora: Low-rank adaptation of large language models

    E. J. Huet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

  33. [33]

    Qwen3 technical report,

    A. Yanget al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  34. [34]

    Code llama: Open foundation models for code,

    B. Roziereet al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023

  35. [35]

    Starcoder 2 and the stack v2: The next generation,

    A. Lozhkovet al., “Starcoder 2 and the stack v2: The next generation,” arXiv preprint arXiv:2402.19173, 2024

  36. [36]

    Mistral 7b,

    A. Q. Jianget al., “Mistral 7b,”arXiv preprint arXiv:2310.06825, 2023

  37. [37]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

    D. Guoet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  38. [38]

    Gpt-4.1-mini,

    OpenAI, “Gpt-4.1-mini,” https://openai.com/index/gpt-4-1, 2025, large language model by OpenAI. Accessed via ChatGPT or the OpenAI API

  39. [39]

    Graphcodebert: Pre-training code representations with data flow,

    D. Guoet al., “Graphcodebert: Pre-training code representations with data flow,”arXiv preprint arXiv:2009.08366, 2020

  40. [40]

    Deepwukong: Statically detecting software vulnerabil- ities using deep graph neural network,

    X. Chenget al., “Deepwukong: Statically detecting software vulnerabil- ities using deep graph neural network,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 30, no. 3, pp. 1–33, 2021