Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap
Pith reviewed 2026-05-22 11:33 UTC · model grok-4.3
The pith
Fine-tuned LLMs for vulnerability detection rely on superficial code differences rather than grasping root causes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that decoder-only LLMs after vanilla supervised fine-tuning exhibit the Semantic Trap, defined by pairing-sensitive performance that drops when vulnerable code is shown with its actual patch instead of unrelated normal code, by decisions that track the textual gap between samples, and by fragility when the code undergoes semantic-preserving perturbations. Adding explicit reasoning during fine-tuning lessens these symptoms but reduces recall, with the reduced gap dependency partly due to a floor effect; a taxonomy of reasoning failures further shows persistent errors in control-flow interpretation and API behavior hallucination.
What carries the argument
The Semantic Trap, a failure mode identified through TrapEval that measures whether fine-tuned models internalize vulnerability root causes or exploit surface textual patterns via paired datasets, gap analysis, and perturbation tests.
If this is right
- Vanilla fine-tuned models will produce high false-positive rates when vulnerable code is tested against its actual patched version rather than unrelated normal code.
- Adding explicit reasoning during training reduces sensitivity to textual gaps but lowers overall recall of real vulnerabilities.
- Models continue to misread control flow and invent incorrect API behaviors even after reasoning supervision.
- Scores on unpaired data overestimate true understanding of vulnerabilities under standard fine-tuning.
Where Pith is reading between the lines
- Security tools built on these models may need extra checks for gap dependency to avoid deploying brittle detectors.
- Future training could target invariance to meaning-preserving code edits to reduce reliance on surface features.
- The same pattern of pairing sensitivity may affect other code-related tasks where LLMs are fine-tuned on before-and-after examples.
Load-bearing premise
That performance gaps between V2P and V2N datasets, combined with CodeBLEU gap analysis and responses to semantic-preserving changes, can separate genuine learning of vulnerability causes from surface-level text pattern matching.
What would settle it
Models that keep high accuracy and low false-positive rates on V2P data even after semantic-preserving code edits, and show no performance difference between V2P and V2N pairings, would indicate escape from the Semantic Trap.
Figures
read the original abstract
Large Language Models (LLMs) have shown promising performance in software vulnerability detection, particularly after domain-specific Supervised Fine-Tuning (SFT). However, it remains unclear whether these models genuinely internalize vulnerability root causes or merely exploit surface-level functional patterns. While prior work documented related failures on pre-trained or zero-shot models, the SFT process itself, and how explicit reasoning supervision modulates it, remains under-explored. We study fine-tuned decoder-only LLMs under vanilla SFT and SFT with reasoning supervision, identifying a failure mode we term the Semantic Trap, characterized by three symptoms: pairing-sensitive performance, gap-dictated decisions, and fragility to semantic-preserving changes. To probe this, we propose TrapEval, an evaluation framework comprising two real-world datasets, V2P (vulnerable paired with patched code) and V2N (vulnerable paired with unrelated normal code), alongside semantic perturbations, CodeBLEU-based gap analysis, and an LLM-assisted reasoning failure taxonomy. Evaluating five representative LLMs fine-tuned with and without explicit reasoning (Chain-of-Thought), our results show vanilla SFT yields deceptively high scores on unpaired data (V2N) while failing all three symptoms. Models suffer high false-positive rates on V2P, degrade under perturbations, and exhibit a systematic dependency on the textual gap between vulnerable and patched code. Finetuning with explicit reasoning reduces these symptoms but costs recall; its lack of measurable gap-dependency partly reflects a floor effect rather than escaping the trap. Furthermore, our taxonomy reveals these models still misinterpret control flow and hallucinate API behavior, indicating current fine-tuning mitigates but does not eliminate reliance on surface features.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that fine-tuned decoder-only LLMs for software vulnerability detection do not genuinely internalize root causes but instead fall into a 'Semantic Trap' characterized by three symptoms: pairing-sensitive performance, gap-dictated decisions, and fragility to semantic-preserving changes. It introduces the TrapEval framework with V2P (vulnerable-patched pairs) and V2N (vulnerable-unrelated normal pairs) datasets, CodeBLEU gap analysis, semantic perturbations, and an LLM-assisted reasoning failure taxonomy. Evaluations across five LLMs show vanilla SFT yields deceptively high scores on V2N while failing all symptoms, with high false positives on V2P and degradation under perturbations; explicit reasoning supervision (Chain-of-Thought) reduces gap dependency and some symptoms but at the cost of recall, and the taxonomy reveals persistent misinterpretations of control flow and API behavior.
Significance. If the central findings hold after addressing verification gaps, the work is significant for AI security research by showing that vanilla SFT on vulnerability detection tasks encourages surface-pattern exploitation rather than root-cause understanding, while reasoning augmentation offers partial mitigation. Strengths include the construction of paired real-world datasets, the multi-model empirical evaluation, the introduction of a diagnostic framework with CodeBLEU metrics, and the provision of a failure taxonomy that highlights specific reasoning deficits.
major comments (1)
- [§4 (TrapEval Framework, Semantic Perturbations subsection)] §4 (TrapEval Framework, Semantic Perturbations subsection): The fragility-to-perturbations symptom is load-bearing for the Semantic Trap diagnosis and the claim that models exploit surface features. The manuscript assumes perturbations (variable renaming, equivalent restructuring) preserve both the vulnerability label and root cause, yet reports no expert review, static-analysis verification, or label-consistency checks on the perturbed samples. Without this, observed performance drops could reflect correct sensitivity to altered semantics or exploitability rather than surface-cue dependence, directly weakening support for the three-symptom characterization.
minor comments (2)
- [Abstract and §5 (Results)] Abstract and §5 (Results): While the high-level outcomes are clear, the absence of reported dataset sizes, sample counts per split, exclusion criteria, or error bars makes it harder to gauge the scale and reliability of the V2P/V2N performance gaps and the floor-effect explanation for reasoning models.
- [§6 (Reasoning Failure Taxonomy)] §6 (Reasoning Failure Taxonomy): The taxonomy is a useful contribution, but adding quantitative breakdowns (e.g., percentage of failures per category across vanilla vs. reasoning SFT) would strengthen the link between observed symptoms and specific misinterpretations such as control-flow errors.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and commit to revisions that will strengthen the empirical support for the fragility-to-perturbations symptom.
read point-by-point responses
-
Referee: The fragility-to-perturbations symptom is load-bearing for the Semantic Trap diagnosis and the claim that models exploit surface features. The manuscript assumes perturbations (variable renaming, equivalent restructuring) preserve both the vulnerability label and root cause, yet reports no expert review, static-analysis verification, or label-consistency checks on the perturbed samples. Without this, observed performance drops could reflect correct sensitivity to altered semantics or exploitability rather than surface-cue dependence, directly weakening support for the three-symptom characterization.
Authors: We agree that explicit verification is necessary to substantiate that performance drops under perturbation reflect surface-cue dependence rather than unintended semantic alteration. The perturbations were generated via standard label-preserving transformations (consistent variable renaming and equivalent restructuring that preserve control-flow and data-flow properties relevant to the vulnerability). However, the original manuscript did not detail verification procedures. In the revised version we will add a dedicated paragraph describing (1) static-analysis confirmation of label consistency on all perturbed samples and (2) manual expert review of a random 10% subset confirming that root-cause semantics remain unchanged. These additions will directly address the concern and reinforce the three-symptom characterization. revision: yes
Circularity Check
No significant circularity in empirical evaluation study
full rationale
This is an empirical investigation that constructs new datasets (V2P, V2N), applies semantic perturbations, computes CodeBLEU gaps, and reports performance metrics plus an LLM-assisted taxonomy across five models. No equations, derivations, or fitted parameters are present that reduce to the inputs by construction. The three symptoms are diagnosed via direct measurement on held-out data rather than self-definition or self-citation chains. The framework is self-contained against external benchmarks (CodeBLEU, perturbation effects) and does not rely on load-bearing prior results from the same authors. Any concerns about label drift under perturbation are validity issues, not circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Performance on paired vulnerable-patched code and semantic perturbations distinguishes root-cause understanding from surface pattern exploitation.
invented entities (1)
-
Semantic Trap
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose TrapEval... V2P dataset (Vulnerable-to-Patch pairs) and V2N dataset... semantic-preserving perturbations... CodeBLEU-based gap analysis
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fragility to semantic-preserving changes... gap-dictated decisions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. InProceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. 30–39
work page 2021
-
[2]
Biagio Boi, Christian Esposito, and Sokjoon Lee. 2024. Smart contract vulnerability detection: The role of large language model (llm).ACM SIGAPP applied computing review24, 2 (2024), 19–29
work page 2024
-
[3]
Daipeng Cao and W Jun. 2024. Llm-cloudsec: Large language model empowered automatic and deep vulnerability analysis for intelligent clouds. InIEEE INFOCOM 2024-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 1–6
work page 2024
-
[4]
Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner. 2023. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses. 654–668
work page 2023
- [5]
-
[6]
DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [7]
- [8]
-
[9]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https://arxiv.org/abs/2401.14196
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al
-
[12]
Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3
work page 2022
-
[13]
Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee
-
[14]
InProceedings of the 2023 conference on empirical methods in natural language processing
Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 5254–5276
work page 2023
-
[15]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-Coder Technical Report.arXiv preprint arXiv:2409.12186(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [16]
-
[17]
Mete Keltek, Rong Hu, Mohammadreza Fani Sani, and Ziyue Li. 2025. LSAST: Enhancing Cybersecurity Through LLM-Supported Static Application Security Testing. InIFIP International Conference on ICT Systems Security and Privacy J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:20 Huang et al. Protection. Springer, 166–179
work page 2025
-
[18]
Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2025. Understanding the effectiveness of large language models in detecting security vulnerabilities. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 103–114
work page 2025
-
[19]
Vasileios Kouliaridis, Georgios Karopoulos, and Georgios Kambourakis. 2024. Assessing the effectiveness of llms in android application vulnerability analysis. InInternational Conference on Attacks and Defenses for Internet-of-Things. Springer, 139–154
work page 2024
-
[20]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626
work page 2023
-
[21]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Yansong Li, Paula Branco, Alexander M Hoole, Manish Marwah, Hari Manassery Koduvely, Guy-Vincent Jourdan, and Stephan Jou. 2025. SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 3014–3032
work page 2025
- [24]
- [25]
-
[26]
Zhihong Liu, Zezhou Yang, and Qing Liao. 2024. Exploration on prompting LLM with code-specific information for vulnerability detection. In2024 IEEE International Conference on Software Services Engineering (SSE). IEEE, 273–281
work page 2024
-
[27]
Guilong Lu, Xiaolin Ju, Xiang Chen, Wenlong Pei, and Zhilong Cai. 2024. GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning.Journal of Systems and Software212 (2024), 112031
work page 2024
- [28]
-
[29]
Andrew A Mahyari. 2024. Harnessing the power of llms in source code vulnerability detection. InMILCOM 2024-2024 IEEE Military Communications Conference (MILCOM). IEEE, 251–256
work page 2024
-
[30]
Qiheng Mao, Zhenhao Li, Xing Hu, Kui Liu, Xin Xia, and Jianling Sun. 2024. Towards effectively detecting and explaining vulnerabilities using large language models.arXiv e-prints(2024), arXiv–2406
work page 2024
- [31]
-
[32]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [33]
-
[34]
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[35]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. 2025. Llms in software security: A survey of vulnerability detection techniques and insights.Comput. Surveys58, 5 (2025), 1–35
work page 2025
-
[37]
Cheng Shi, Jiongchi Yu, Ziming Zhao, Jiongyi Chen, and Fan Zhang. 2025. CGIFuzz: Enabling Gray-Box Fuzzing for Web CGI of IoT Devices.IEEE Transactions on Information Forensics and Security(2025)
work page 2025
-
[38]
Junhan Shi, Yijia Zhu, Zhenning Shi, Dan Zhao, Qing Li, and Yong Jiang. 2025. SpecCoT: Accelerating Chain-of-Thought Reasoning through Speculative Exploration. InFindings of the Association for Computational Linguistics: EMNLP 2025. 24405–24415. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. The Semantic Trap: Do Fine-tuned LLMs Learn...
work page 2025
-
[39]
Parul V Sindhwad, Prateek Ranka, Siddhi Muni, and Faruk Kazi. 2025. VulnArmor: mitigating software vulnerabilities with code resolution and detection techniques.International Journal of Information Technology17, 9 (2025), 5393–5408
work page 2025
- [40]
- [41]
-
[42]
Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
work page 2024
-
[43]
Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. 2024. Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. In2024 IEEE symposium on security and privacy (SP). IEEE, 862–880
work page 2024
- [45]
-
[46]
Xin-Cheng Wen, Cuiyun Gao, Shuzheng Gao, Yang Xiao, and Michael R Lyu. 2024. Scale: Constructing structured natural language comment trees for software vulnerability detection. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 235–247
work page 2024
- [47]
-
[48]
Aidan ZH Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn. 2024. Large language models for test-free fault localization. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12
work page 2024
-
[49]
Xin Yin, Chao Ni, and Shaohua Wang. 2024. Multitask-based evaluation of open-source llm on software vulnerability. IEEE Transactions on Software Engineering(2024)
work page 2024
- [50]
- [51]
-
[52]
Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. 2025. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–31
work page 2025
-
[53]
Xin Zhou, Duc-Manh Tran, Thanh Le-Cong, Ting Zhang, Ivana Clairine Irsan, Joshua Sumarlin, Bach Le, and David Lo. 2024. Comparison of static application security testing tools and large language models for repo-level vulnerability detection.arXiv preprint arXiv:2407.16235(2024). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 J. ACM...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.