pith. sign in

arxiv: 2601.22655 · v3 · pith:Z2OXII2Qnew · submitted 2026-01-30 · 💻 cs.CR · cs.SE

Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap

Pith reviewed 2026-05-22 11:33 UTC · model grok-4.3

classification 💻 cs.CR cs.SE
keywords large language modelsvulnerability detectionsupervised fine-tuningsemantic understandingcode analysisreasoning supervisionfailure modes
0
0 comments X

The pith

Fine-tuned LLMs for vulnerability detection rely on superficial code differences rather than grasping root causes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether supervised fine-tuning makes large language models genuinely understand what creates software vulnerabilities or whether they mainly latch onto obvious textual cues in the code. It defines a Semantic Trap through three clear failure signs: models change their answers based on how vulnerable code is paired with other samples, they base decisions on how much the texts differ, and their accuracy drops when the code is altered without changing its meaning. This distinction matters because unreliable detectors could miss real security issues or flag safe code, especially in automated review tools. The authors compare standard fine-tuning against versions that add step-by-step reasoning, using datasets that pair vulnerable code with either its fix or unrelated normal code, plus targeted edits and gap measurements. Results indicate that plain fine-tuning produces misleadingly high scores on unpaired data while failing the three tests, and even reasoning supervision only partly eases the problem while lowering detection rates.

Core claim

The authors establish that decoder-only LLMs after vanilla supervised fine-tuning exhibit the Semantic Trap, defined by pairing-sensitive performance that drops when vulnerable code is shown with its actual patch instead of unrelated normal code, by decisions that track the textual gap between samples, and by fragility when the code undergoes semantic-preserving perturbations. Adding explicit reasoning during fine-tuning lessens these symptoms but reduces recall, with the reduced gap dependency partly due to a floor effect; a taxonomy of reasoning failures further shows persistent errors in control-flow interpretation and API behavior hallucination.

What carries the argument

The Semantic Trap, a failure mode identified through TrapEval that measures whether fine-tuned models internalize vulnerability root causes or exploit surface textual patterns via paired datasets, gap analysis, and perturbation tests.

If this is right

  • Vanilla fine-tuned models will produce high false-positive rates when vulnerable code is tested against its actual patched version rather than unrelated normal code.
  • Adding explicit reasoning during training reduces sensitivity to textual gaps but lowers overall recall of real vulnerabilities.
  • Models continue to misread control flow and invent incorrect API behaviors even after reasoning supervision.
  • Scores on unpaired data overestimate true understanding of vulnerabilities under standard fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Security tools built on these models may need extra checks for gap dependency to avoid deploying brittle detectors.
  • Future training could target invariance to meaning-preserving code edits to reduce reliance on surface features.
  • The same pattern of pairing sensitivity may affect other code-related tasks where LLMs are fine-tuned on before-and-after examples.

Load-bearing premise

That performance gaps between V2P and V2N datasets, combined with CodeBLEU gap analysis and responses to semantic-preserving changes, can separate genuine learning of vulnerability causes from surface-level text pattern matching.

What would settle it

Models that keep high accuracy and low false-positive rates on V2P data even after semantic-preserving code edits, and show no performance difference between V2P and V2N pairings, would indicate escape from the Semantic Trap.

Figures

Figures reproduced from arXiv: 2601.22655 by Fan Zhang, Feiyang Huang, Han Liu, Yang Liu, Yuqiang Sun, Ziqi Yang.

Figure 1
Figure 1. Figure 1: Overview of TrapEval. how different data structures impact detection capabilities. Based on this, our RQ1 is “How effective are fine-tuned LLMs in vulnerability detection compared to pre-trained models, and how does the training data composition impact their detection capabilities?” Secondly, the reliability of a detector depends on its ability to capture the intrinsic logic of a vulnerability rather than … view at source ↗
Figure 2
Figure 2. Figure 2: Prompt Template during fine-tuning and evaluation. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have shown promising performance in software vulnerability detection, particularly after domain-specific Supervised Fine-Tuning (SFT). However, it remains unclear whether these models genuinely internalize vulnerability root causes or merely exploit surface-level functional patterns. While prior work documented related failures on pre-trained or zero-shot models, the SFT process itself, and how explicit reasoning supervision modulates it, remains under-explored. We study fine-tuned decoder-only LLMs under vanilla SFT and SFT with reasoning supervision, identifying a failure mode we term the Semantic Trap, characterized by three symptoms: pairing-sensitive performance, gap-dictated decisions, and fragility to semantic-preserving changes. To probe this, we propose TrapEval, an evaluation framework comprising two real-world datasets, V2P (vulnerable paired with patched code) and V2N (vulnerable paired with unrelated normal code), alongside semantic perturbations, CodeBLEU-based gap analysis, and an LLM-assisted reasoning failure taxonomy. Evaluating five representative LLMs fine-tuned with and without explicit reasoning (Chain-of-Thought), our results show vanilla SFT yields deceptively high scores on unpaired data (V2N) while failing all three symptoms. Models suffer high false-positive rates on V2P, degrade under perturbations, and exhibit a systematic dependency on the textual gap between vulnerable and patched code. Finetuning with explicit reasoning reduces these symptoms but costs recall; its lack of measurable gap-dependency partly reflects a floor effect rather than escaping the trap. Furthermore, our taxonomy reveals these models still misinterpret control flow and hallucinate API behavior, indicating current fine-tuning mitigates but does not eliminate reliance on surface features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that fine-tuned decoder-only LLMs for software vulnerability detection do not genuinely internalize root causes but instead fall into a 'Semantic Trap' characterized by three symptoms: pairing-sensitive performance, gap-dictated decisions, and fragility to semantic-preserving changes. It introduces the TrapEval framework with V2P (vulnerable-patched pairs) and V2N (vulnerable-unrelated normal pairs) datasets, CodeBLEU gap analysis, semantic perturbations, and an LLM-assisted reasoning failure taxonomy. Evaluations across five LLMs show vanilla SFT yields deceptively high scores on V2N while failing all symptoms, with high false positives on V2P and degradation under perturbations; explicit reasoning supervision (Chain-of-Thought) reduces gap dependency and some symptoms but at the cost of recall, and the taxonomy reveals persistent misinterpretations of control flow and API behavior.

Significance. If the central findings hold after addressing verification gaps, the work is significant for AI security research by showing that vanilla SFT on vulnerability detection tasks encourages surface-pattern exploitation rather than root-cause understanding, while reasoning augmentation offers partial mitigation. Strengths include the construction of paired real-world datasets, the multi-model empirical evaluation, the introduction of a diagnostic framework with CodeBLEU metrics, and the provision of a failure taxonomy that highlights specific reasoning deficits.

major comments (1)
  1. [§4 (TrapEval Framework, Semantic Perturbations subsection)] §4 (TrapEval Framework, Semantic Perturbations subsection): The fragility-to-perturbations symptom is load-bearing for the Semantic Trap diagnosis and the claim that models exploit surface features. The manuscript assumes perturbations (variable renaming, equivalent restructuring) preserve both the vulnerability label and root cause, yet reports no expert review, static-analysis verification, or label-consistency checks on the perturbed samples. Without this, observed performance drops could reflect correct sensitivity to altered semantics or exploitability rather than surface-cue dependence, directly weakening support for the three-symptom characterization.
minor comments (2)
  1. [Abstract and §5 (Results)] Abstract and §5 (Results): While the high-level outcomes are clear, the absence of reported dataset sizes, sample counts per split, exclusion criteria, or error bars makes it harder to gauge the scale and reliability of the V2P/V2N performance gaps and the floor-effect explanation for reasoning models.
  2. [§6 (Reasoning Failure Taxonomy)] §6 (Reasoning Failure Taxonomy): The taxonomy is a useful contribution, but adding quantitative breakdowns (e.g., percentage of failures per category across vanilla vs. reasoning SFT) would strengthen the link between observed symptoms and specific misinterpretations such as control-flow errors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and commit to revisions that will strengthen the empirical support for the fragility-to-perturbations symptom.

read point-by-point responses
  1. Referee: The fragility-to-perturbations symptom is load-bearing for the Semantic Trap diagnosis and the claim that models exploit surface features. The manuscript assumes perturbations (variable renaming, equivalent restructuring) preserve both the vulnerability label and root cause, yet reports no expert review, static-analysis verification, or label-consistency checks on the perturbed samples. Without this, observed performance drops could reflect correct sensitivity to altered semantics or exploitability rather than surface-cue dependence, directly weakening support for the three-symptom characterization.

    Authors: We agree that explicit verification is necessary to substantiate that performance drops under perturbation reflect surface-cue dependence rather than unintended semantic alteration. The perturbations were generated via standard label-preserving transformations (consistent variable renaming and equivalent restructuring that preserve control-flow and data-flow properties relevant to the vulnerability). However, the original manuscript did not detail verification procedures. In the revised version we will add a dedicated paragraph describing (1) static-analysis confirmation of label consistency on all perturbed samples and (2) manual expert review of a random 10% subset confirming that root-cause semantics remain unchanged. These additions will directly address the concern and reinforce the three-symptom characterization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation study

full rationale

This is an empirical investigation that constructs new datasets (V2P, V2N), applies semantic perturbations, computes CodeBLEU gaps, and reports performance metrics plus an LLM-assisted taxonomy across five models. No equations, derivations, or fitted parameters are present that reduce to the inputs by construction. The three symptoms are diagnosed via direct measurement on held-out data rather than self-definition or self-citation chains. The framework is self-contained against external benchmarks (CodeBLEU, perturbation effects) and does not rely on load-bearing prior results from the same authors. Any concerns about label drift under perturbation are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the proposed diagnostic tests isolate semantic understanding; no free parameters are fitted, one new diagnostic concept is introduced without external falsifiable evidence, and standard ML evaluation assumptions are used.

axioms (1)
  • domain assumption Performance on paired vulnerable-patched code and semantic perturbations distinguishes root-cause understanding from surface pattern exploitation.
    Invoked when defining the three symptoms of the Semantic Trap and interpreting results on V2P versus V2N.
invented entities (1)
  • Semantic Trap no independent evidence
    purpose: Characterize the observed failure mode of fine-tuned LLMs relying on textual gaps and pairing sensitivity rather than vulnerability semantics.
    New term coined to group the three symptoms; no independent evidence outside the paper's own evaluations is provided.

pith-pipeline@v0.9.0 · 5850 in / 1422 out tokens · 66127 ms · 2026-05-22T11:33:46.521809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 10 internal anchors

  1. [1]

    Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. InProceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. 30–39

  2. [2]

    Biagio Boi, Christian Esposito, and Sokjoon Lee. 2024. Smart contract vulnerability detection: The role of large language model (llm).ACM SIGAPP applied computing review24, 2 (2024), 19–29

  3. [3]

    Daipeng Cao and W Jun. 2024. Llm-cloudsec: Large language model empowered automatic and deep vulnerability analysis for intelligent clouds. InIEEE INFOCOM 2024-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 1–6

  4. [4]

    Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner. 2023. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses. 654–668

  5. [5]

    Isaac David, Liyi Zhou, Kaihua Qin, Dawn Song, Lorenzo Cavallaro, and Arthur Gervais. 2023. Do you still need a manual smart contract audit?arXiv preprint arXiv:2306.12338(2023)

  6. [6]

    DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437

  7. [7]

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2024. Vulnerability Detection with Code Language Models: How Far Are We?arXiv preprint arXiv:2403.18624(2024)

  8. [8]

    Xiaohu Du, Ming Wen, Jiahao Zhu, Zifan Xie, Bin Ji, Huijun Liu, Xuanhua Shi, and Hai Jin. 2024. Generalization- enhanced code vulnerability detection via multi-task instruction fine-tuning.arXiv preprint arXiv:2406.03718(2024)

  9. [9]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  10. [10]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https://arxiv.org/abs/2401.14196

  11. [11]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al

  12. [12]

    Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

  13. [13]

    Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee

  14. [14]

    InProceedings of the 2023 conference on empirical methods in natural language processing

    Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 5254–5276

  15. [15]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-Coder Technical Report.arXiv preprint arXiv:2409.12186(2024)

  16. [16]

    Xuefeng Jiang, Lvhua Wu, Sheng Sun, Jia Li, Jingjing Xue, Yuwei Wang, Tingting Wu, and Min Liu. 2024. Investigating large language models for code vulnerability detection: An experimental study.arXiv preprint arXiv:2412.18260(2024)

  17. [17]

    Mete Keltek, Rong Hu, Mohammadreza Fani Sani, and Ziyue Li. 2025. LSAST: Enhancing Cybersecurity Through LLM-Supported Static Application Security Testing. InIFIP International Conference on ICT Systems Security and Privacy J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:20 Huang et al. Protection. Springer, 166–179

  18. [18]

    Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2025. Understanding the effectiveness of large language models in detecting security vulnerabilities. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 103–114

  19. [19]

    Vasileios Kouliaridis, Georgios Karopoulos, and Georgios Kambourakis. 2024. Assessing the effectiveness of llms in android application vulnerability analysis. InInternational Conference on Attacks and Defenses for Internet-of-Things. Springer, 139–154

  20. [20]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

  21. [21]

    Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691(2021)

  22. [22]

    Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190(2021)

  23. [23]

    Yansong Li, Paula Branco, Alexander M Hoole, Manish Marwah, Hari Manassery Koduvely, Guy-Vincent Jourdan, and Stephan Jou. 2025. SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 3014–3032

  24. [24]

    Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xiuzhen Cheng, Fengyuan Xu, and Sheng Zhong. 2025. Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask.arXiv preprint arXiv:2504.13474 (2025)

  25. [25]

    Ziyang Li, Saikat Dutta, and Mayur Naik. 2024. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. arXiv preprint arXiv:2405.17238(2024)

  26. [26]

    Zhihong Liu, Zezhou Yang, and Qing Liao. 2024. Exploration on prompting LLM with code-specific information for vulnerability detection. In2024 IEEE International Conference on Software Services Engineering (SSE). IEEE, 273–281

  27. [27]

    Guilong Lu, Xiaolin Ju, Xiang Chen, Wenlong Pei, and Zhilong Cai. 2024. GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning.Journal of Systems and Software212 (2024), 112031

  28. [28]

    Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, and Yang Liu. 2024. Combining fine-tuning and llm-based agents for intuitive smart contract auditing with justifications.arXiv preprint arXiv:2403.16073(2024)

  29. [29]

    Andrew A Mahyari. 2024. Harnessing the power of llms in source code vulnerability detection. InMILCOM 2024-2024 IEEE Military Communications Conference (MILCOM). IEEE, 251–256

  30. [30]

    Qiheng Mao, Zhenhao Li, Xing Hu, Kui Liu, Xin Xia, and Jianling Sun. 2024. Towards effectively detecting and explaining vulnerabilities using large language models.arXiv e-prints(2024), arXiv–2406

  31. [31]

    Noble Saji Mathews, Yelizaveta Brus, Yousra Aafer, Meiyappan Nagappan, and Shane McIntosh. 2024. Llbezpeky: Leveraging large language models for vulnerability detection.arXiv preprint arXiv:2401.01269(2024)

  32. [32]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

  33. [33]

    Aleksandar Petrov, Philip HS Torr, and Adel Bibi. 2023. When do prompting and prefix-tuning work? a theory of capabilities and limitations.arXiv preprint arXiv:2310.19698(2023)

  34. [34]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)

  35. [35]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

  36. [36]

    Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. 2025. Llms in software security: A survey of vulnerability detection techniques and insights.Comput. Surveys58, 5 (2025), 1–35

  37. [37]

    Cheng Shi, Jiongchi Yu, Ziming Zhao, Jiongyi Chen, and Fan Zhang. 2025. CGIFuzz: Enabling Gray-Box Fuzzing for Web CGI of IoT Devices.IEEE Transactions on Information Forensics and Security(2025)

  38. [38]

    Junhan Shi, Yijia Zhu, Zhenning Shi, Dan Zhao, Qing Li, and Yong Jiang. 2025. SpecCoT: Accelerating Chain-of-Thought Reasoning through Speculative Exploration. InFindings of the Association for Computational Linguistics: EMNLP 2025. 24405–24415. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. The Semantic Trap: Do Fine-tuned LLMs Learn...

  39. [39]

    Parul V Sindhwad, Prateek Ranka, Siddhi Muni, and Faruk Kazi. 2025. VulnArmor: mitigating software vulnerabilities with code resolution and detection techniques.International Journal of Information Technology17, 9 (2025), 5393–5408

  40. [40]

    Benjamin Steenhoek, Md Mahbubur Rahman, Monoshi Kumar Roy, Mirza Sanjida Alam, Hengbo Tong, Swarna Das, Earl T Barr, and Wei Le. 2024. To err is machine: Vulnerability detection challenges llm reasoning.arXiv preprint arXiv:2403.17218(2024)

  41. [41]

    Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Yang Liu, and Yingjiu Li. 2024. Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning.arXiv preprint arXiv:2401.16185 (2024)

  42. [42]

    Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  43. [43]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  44. [44]

    Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. 2024. Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. In2024 IEEE symposium on security and privacy (SP). IEEE, 862–880

  45. [45]

    Jin Wang, Zishan Huang, Hengli Liu, Nianyi Yang, and Yinhao Xiao. 2023. Defecthunter: A novel llm-driven boosted- conformer-based code vulnerability detection mechanism.arXiv preprint arXiv:2309.15324(2023)

  46. [46]

    Xin-Cheng Wen, Cuiyun Gao, Shuzheng Gao, Yang Xiao, and Michael R Lyu. 2024. Scale: Constructing structured natural language comment trees for software vulnerability detection. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 235–247

  47. [47]

    Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment.arXiv preprint arXiv:2312.12148(2023)

  48. [48]

    Aidan ZH Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn. 2024. Large language models for test-free fault localization. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

  49. [49]

    Xin Yin, Chao Ni, and Shaohua Wang. 2024. Multitask-based evaluation of open-source llm on software vulnerability. IEEE Transactions on Software Engineering(2024)

  50. [50]

    Jian Zhang, Chong Wang, Anran Li, Weisong Sun, Cen Zhang, Wei Ma, and Yang Liu. 2024. An empirical study of automated vulnerability localization with large language models.arXiv preprint arXiv:2404.00287(2024)

  51. [51]

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2024. SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning. arXiv:2408.05517 [cs.CL] https://arxiv.org/abs/2408.05517

  52. [52]

    Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. 2025. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–31

  53. [53]

    Xin Zhou, Duc-Manh Tran, Thanh Le-Cong, Ting Zhang, Ivana Clairine Irsan, Joshua Sumarlin, Bach Le, and David Lo. 2024. Comparison of static application security testing tools and large language models for repo-level vulnerability detection.arXiv preprint arXiv:2407.16235(2024). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 J. ACM...