Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap

Fan Zhang; Feiyang Huang; Han Liu; Yang Liu; Yuqiang Sun; Ziqi Yang

arxiv: 2601.22655 · v3 · pith:Z2OXII2Qnew · submitted 2026-01-30 · 💻 cs.CR · cs.SE

Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap

Feiyang Huang , Yuqiang Sun , Fan Zhang , Ziqi Yang , Han Liu , Yang Liu This is my paper

Pith reviewed 2026-05-22 11:33 UTC · model grok-4.3

classification 💻 cs.CR cs.SE

keywords large language modelsvulnerability detectionsupervised fine-tuningsemantic understandingcode analysisreasoning supervisionfailure modes

0 comments

The pith

Fine-tuned LLMs for vulnerability detection rely on superficial code differences rather than grasping root causes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether supervised fine-tuning makes large language models genuinely understand what creates software vulnerabilities or whether they mainly latch onto obvious textual cues in the code. It defines a Semantic Trap through three clear failure signs: models change their answers based on how vulnerable code is paired with other samples, they base decisions on how much the texts differ, and their accuracy drops when the code is altered without changing its meaning. This distinction matters because unreliable detectors could miss real security issues or flag safe code, especially in automated review tools. The authors compare standard fine-tuning against versions that add step-by-step reasoning, using datasets that pair vulnerable code with either its fix or unrelated normal code, plus targeted edits and gap measurements. Results indicate that plain fine-tuning produces misleadingly high scores on unpaired data while failing the three tests, and even reasoning supervision only partly eases the problem while lowering detection rates.

Core claim

The authors establish that decoder-only LLMs after vanilla supervised fine-tuning exhibit the Semantic Trap, defined by pairing-sensitive performance that drops when vulnerable code is shown with its actual patch instead of unrelated normal code, by decisions that track the textual gap between samples, and by fragility when the code undergoes semantic-preserving perturbations. Adding explicit reasoning during fine-tuning lessens these symptoms but reduces recall, with the reduced gap dependency partly due to a floor effect; a taxonomy of reasoning failures further shows persistent errors in control-flow interpretation and API behavior hallucination.

What carries the argument

The Semantic Trap, a failure mode identified through TrapEval that measures whether fine-tuned models internalize vulnerability root causes or exploit surface textual patterns via paired datasets, gap analysis, and perturbation tests.

If this is right

Vanilla fine-tuned models will produce high false-positive rates when vulnerable code is tested against its actual patched version rather than unrelated normal code.
Adding explicit reasoning during training reduces sensitivity to textual gaps but lowers overall recall of real vulnerabilities.
Models continue to misread control flow and invent incorrect API behaviors even after reasoning supervision.
Scores on unpaired data overestimate true understanding of vulnerabilities under standard fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Security tools built on these models may need extra checks for gap dependency to avoid deploying brittle detectors.
Future training could target invariance to meaning-preserving code edits to reduce reliance on surface features.
The same pattern of pairing sensitivity may affect other code-related tasks where LLMs are fine-tuned on before-and-after examples.

Load-bearing premise

That performance gaps between V2P and V2N datasets, combined with CodeBLEU gap analysis and responses to semantic-preserving changes, can separate genuine learning of vulnerability causes from surface-level text pattern matching.

What would settle it

Models that keep high accuracy and low false-positive rates on V2P data even after semantic-preserving code edits, and show no performance difference between V2P and V2N pairings, would indicate escape from the Semantic Trap.

Figures

Figures reproduced from arXiv: 2601.22655 by Fan Zhang, Feiyang Huang, Han Liu, Yang Liu, Yuqiang Sun, Ziqi Yang.

**Figure 1.** Figure 1: Overview of TrapEval. how different data structures impact detection capabilities. Based on this, our RQ1 is “How effective are fine-tuned LLMs in vulnerability detection compared to pre-trained models, and how does the training data composition impact their detection capabilities?” Secondly, the reliability of a detector depends on its ability to capture the intrinsic logic of a vulnerability rather than … view at source ↗

**Figure 2.** Figure 2: Prompt Template during fine-tuning and evaluation. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have shown promising performance in software vulnerability detection, particularly after domain-specific Supervised Fine-Tuning (SFT). However, it remains unclear whether these models genuinely internalize vulnerability root causes or merely exploit surface-level functional patterns. While prior work documented related failures on pre-trained or zero-shot models, the SFT process itself, and how explicit reasoning supervision modulates it, remains under-explored. We study fine-tuned decoder-only LLMs under vanilla SFT and SFT with reasoning supervision, identifying a failure mode we term the Semantic Trap, characterized by three symptoms: pairing-sensitive performance, gap-dictated decisions, and fragility to semantic-preserving changes. To probe this, we propose TrapEval, an evaluation framework comprising two real-world datasets, V2P (vulnerable paired with patched code) and V2N (vulnerable paired with unrelated normal code), alongside semantic perturbations, CodeBLEU-based gap analysis, and an LLM-assisted reasoning failure taxonomy. Evaluating five representative LLMs fine-tuned with and without explicit reasoning (Chain-of-Thought), our results show vanilla SFT yields deceptively high scores on unpaired data (V2N) while failing all three symptoms. Models suffer high false-positive rates on V2P, degrade under perturbations, and exhibit a systematic dependency on the textual gap between vulnerable and patched code. Finetuning with explicit reasoning reduces these symptoms but costs recall; its lack of measurable gap-dependency partly reflects a floor effect rather than escaping the trap. Furthermore, our taxonomy reveals these models still misinterpret control flow and hallucinate API behavior, indicating current fine-tuning mitigates but does not eliminate reliance on surface features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuned LLMs for vulnerability detection often score high by latching onto textual gaps rather than root causes, and the new paired datasets plus perturbation tests make that pattern visible, though the tests themselves need label verification.

read the letter

The main point from this work is that vanilla supervised fine-tuning on vulnerability detection tasks produces models that achieve high scores on unpaired vulnerable and normal code but struggle when the vulnerable code is paired with its actual patch or when minor semantic-preserving changes are made. What is new here is the focus on SFT models specifically, including a comparison to versions trained with explicit reasoning supervision like Chain-of-Thought. The TrapEval framework with the V2P and V2N datasets, along with the use of CodeBLEU for gap analysis and an LLM-assisted taxonomy of reasoning failures, extends earlier observations from zero-shot settings. The paper does well in constructing these paired datasets from real-world sources and testing fragility through perturbations such as variable renaming and equivalent restructuring. This setup helps illustrate the three symptoms of the Semantic Trap: pairing sensitivity, gap dependency, and performance drops under changes. That said, there are some soft spots. The description lacks details on dataset sizes, statistical significance, or exact exclusion criteria, which makes it hard to gauge how robust the results are. More importantly, the perturbations are not verified by experts or static analysis to confirm they preserve the vulnerability label and root cause. If some changes inadvertently fix or alter the vulnerability, then the observed fragility might reflect accurate adaptation rather than reliance on shallow features. The taxonomy being LLM-assisted also raises questions about its reliability compared to manual review. This kind of paper is for researchers in AI for software security who want to improve evaluation methods beyond standard benchmarks. A reader looking for ways to diagnose when models are not truly understanding code vulnerabilities would get value from the framework ideas. It deserves a serious referee because the problem it identifies affects the trustworthiness of AI-assisted auditing tools. I would recommend sending it to peer review with requests for more methodological details and validation of the perturbed samples.

Referee Report

1 major / 2 minor

Summary. The paper claims that fine-tuned decoder-only LLMs for software vulnerability detection do not genuinely internalize root causes but instead fall into a 'Semantic Trap' characterized by three symptoms: pairing-sensitive performance, gap-dictated decisions, and fragility to semantic-preserving changes. It introduces the TrapEval framework with V2P (vulnerable-patched pairs) and V2N (vulnerable-unrelated normal pairs) datasets, CodeBLEU gap analysis, semantic perturbations, and an LLM-assisted reasoning failure taxonomy. Evaluations across five LLMs show vanilla SFT yields deceptively high scores on V2N while failing all symptoms, with high false positives on V2P and degradation under perturbations; explicit reasoning supervision (Chain-of-Thought) reduces gap dependency and some symptoms but at the cost of recall, and the taxonomy reveals persistent misinterpretations of control flow and API behavior.

Significance. If the central findings hold after addressing verification gaps, the work is significant for AI security research by showing that vanilla SFT on vulnerability detection tasks encourages surface-pattern exploitation rather than root-cause understanding, while reasoning augmentation offers partial mitigation. Strengths include the construction of paired real-world datasets, the multi-model empirical evaluation, the introduction of a diagnostic framework with CodeBLEU metrics, and the provision of a failure taxonomy that highlights specific reasoning deficits.

major comments (1)

[§4 (TrapEval Framework, Semantic Perturbations subsection)] §4 (TrapEval Framework, Semantic Perturbations subsection): The fragility-to-perturbations symptom is load-bearing for the Semantic Trap diagnosis and the claim that models exploit surface features. The manuscript assumes perturbations (variable renaming, equivalent restructuring) preserve both the vulnerability label and root cause, yet reports no expert review, static-analysis verification, or label-consistency checks on the perturbed samples. Without this, observed performance drops could reflect correct sensitivity to altered semantics or exploitability rather than surface-cue dependence, directly weakening support for the three-symptom characterization.

minor comments (2)

[Abstract and §5 (Results)] Abstract and §5 (Results): While the high-level outcomes are clear, the absence of reported dataset sizes, sample counts per split, exclusion criteria, or error bars makes it harder to gauge the scale and reliability of the V2P/V2N performance gaps and the floor-effect explanation for reasoning models.
[§6 (Reasoning Failure Taxonomy)] §6 (Reasoning Failure Taxonomy): The taxonomy is a useful contribution, but adding quantitative breakdowns (e.g., percentage of failures per category across vanilla vs. reasoning SFT) would strengthen the link between observed symptoms and specific misinterpretations such as control-flow errors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and commit to revisions that will strengthen the empirical support for the fragility-to-perturbations symptom.

read point-by-point responses

Referee: The fragility-to-perturbations symptom is load-bearing for the Semantic Trap diagnosis and the claim that models exploit surface features. The manuscript assumes perturbations (variable renaming, equivalent restructuring) preserve both the vulnerability label and root cause, yet reports no expert review, static-analysis verification, or label-consistency checks on the perturbed samples. Without this, observed performance drops could reflect correct sensitivity to altered semantics or exploitability rather than surface-cue dependence, directly weakening support for the three-symptom characterization.

Authors: We agree that explicit verification is necessary to substantiate that performance drops under perturbation reflect surface-cue dependence rather than unintended semantic alteration. The perturbations were generated via standard label-preserving transformations (consistent variable renaming and equivalent restructuring that preserve control-flow and data-flow properties relevant to the vulnerability). However, the original manuscript did not detail verification procedures. In the revised version we will add a dedicated paragraph describing (1) static-analysis confirmation of label consistency on all perturbed samples and (2) manual expert review of a random 10% subset confirming that root-cause semantics remain unchanged. These additions will directly address the concern and reinforce the three-symptom characterization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation study

full rationale

This is an empirical investigation that constructs new datasets (V2P, V2N), applies semantic perturbations, computes CodeBLEU gaps, and reports performance metrics plus an LLM-assisted taxonomy across five models. No equations, derivations, or fitted parameters are present that reduce to the inputs by construction. The three symptoms are diagnosed via direct measurement on held-out data rather than self-definition or self-citation chains. The framework is self-contained against external benchmarks (CodeBLEU, perturbation effects) and does not rely on load-bearing prior results from the same authors. Any concerns about label drift under perturbation are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the proposed diagnostic tests isolate semantic understanding; no free parameters are fitted, one new diagnostic concept is introduced without external falsifiable evidence, and standard ML evaluation assumptions are used.

axioms (1)

domain assumption Performance on paired vulnerable-patched code and semantic perturbations distinguishes root-cause understanding from surface pattern exploitation.
Invoked when defining the three symptoms of the Semantic Trap and interpreting results on V2P versus V2N.

invented entities (1)

Semantic Trap no independent evidence
purpose: Characterize the observed failure mode of fine-tuned LLMs relying on textual gaps and pairing sensitivity rather than vulnerability semantics.
New term coined to group the three symptoms; no independent evidence outside the paper's own evaluations is provided.

pith-pipeline@v0.9.0 · 5850 in / 1422 out tokens · 66127 ms · 2026-05-22T11:33:46.521809+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose TrapEval... V2P dataset (Vulnerable-to-Patch pairs) and V2N dataset... semantic-preserving perturbations... CodeBLEU-based gap analysis
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fragility to semantic-preserving changes... gap-dictated decisions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 10 internal anchors

[1]

Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. InProceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. 30–39

work page 2021
[2]

Biagio Boi, Christian Esposito, and Sokjoon Lee. 2024. Smart contract vulnerability detection: The role of large language model (llm).ACM SIGAPP applied computing review24, 2 (2024), 19–29

work page 2024
[3]

Daipeng Cao and W Jun. 2024. Llm-cloudsec: Large language model empowered automatic and deep vulnerability analysis for intelligent clouds. InIEEE INFOCOM 2024-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 1–6

work page 2024
[4]

Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner. 2023. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses. 654–668

work page 2023
[5]

Isaac David, Liyi Zhou, Kaihua Qin, Dawn Song, Lorenzo Cavallaro, and Arthur Gervais. 2023. Do you still need a manual smart contract audit?arXiv preprint arXiv:2306.12338(2023)

work page arXiv 2023
[6]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2024. Vulnerability Detection with Code Language Models: How Far Are We?arXiv preprint arXiv:2403.18624(2024)

work page arXiv 2024
[8]

Xiaohu Du, Ming Wen, Jiahao Zhu, Zifan Xie, Bin Ji, Huijun Liu, Xuanhua Shi, and Hai Jin. 2024. Generalization- enhanced code vulnerability detection via multi-task instruction fine-tuning.arXiv preprint arXiv:2406.03718(2024)

work page arXiv 2024
[9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al

work page
[12]

Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

work page 2022
[13]

Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee

work page
[14]

InProceedings of the 2023 conference on empirical methods in natural language processing

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 5254–5276

work page 2023
[15]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-Coder Technical Report.arXiv preprint arXiv:2409.12186(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Xuefeng Jiang, Lvhua Wu, Sheng Sun, Jia Li, Jingjing Xue, Yuwei Wang, Tingting Wu, and Min Liu. 2024. Investigating large language models for code vulnerability detection: An experimental study.arXiv preprint arXiv:2412.18260(2024)

work page arXiv 2024
[17]

Mete Keltek, Rong Hu, Mohammadreza Fani Sani, and Ziyue Li. 2025. LSAST: Enhancing Cybersecurity Through LLM-Supported Static Application Security Testing. InIFIP International Conference on ICT Systems Security and Privacy J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:20 Huang et al. Protection. Springer, 166–179

work page 2025
[18]

Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2025. Understanding the effectiveness of large language models in detecting security vulnerabilities. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 103–114

work page 2025
[19]

Vasileios Kouliaridis, Georgios Karopoulos, and Georgios Kambourakis. 2024. Assessing the effectiveness of llms in android application vulnerability analysis. InInternational Conference on Attacks and Defenses for Internet-of-Things. Springer, 139–154

work page 2024
[20]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

work page 2023
[21]

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Yansong Li, Paula Branco, Alexander M Hoole, Manish Marwah, Hari Manassery Koduvely, Guy-Vincent Jourdan, and Stephan Jou. 2025. SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 3014–3032

work page 2025
[24]

Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xiuzhen Cheng, Fengyuan Xu, and Sheng Zhong. 2025. Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask.arXiv preprint arXiv:2504.13474 (2025)

work page arXiv 2025
[25]

Ziyang Li, Saikat Dutta, and Mayur Naik. 2024. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. arXiv preprint arXiv:2405.17238(2024)

work page arXiv 2024
[26]

Zhihong Liu, Zezhou Yang, and Qing Liao. 2024. Exploration on prompting LLM with code-specific information for vulnerability detection. In2024 IEEE International Conference on Software Services Engineering (SSE). IEEE, 273–281

work page 2024
[27]

Guilong Lu, Xiaolin Ju, Xiang Chen, Wenlong Pei, and Zhilong Cai. 2024. GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning.Journal of Systems and Software212 (2024), 112031

work page 2024
[28]

Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, and Yang Liu. 2024. Combining fine-tuning and llm-based agents for intuitive smart contract auditing with justifications.arXiv preprint arXiv:2403.16073(2024)

work page arXiv 2024
[29]

Andrew A Mahyari. 2024. Harnessing the power of llms in source code vulnerability detection. InMILCOM 2024-2024 IEEE Military Communications Conference (MILCOM). IEEE, 251–256

work page 2024
[30]

Qiheng Mao, Zhenhao Li, Xing Hu, Kui Liu, Xin Xia, and Jianling Sun. 2024. Towards effectively detecting and explaining vulnerabilities using large language models.arXiv e-prints(2024), arXiv–2406

work page 2024
[31]

Noble Saji Mathews, Yelizaveta Brus, Yousra Aafer, Meiyappan Nagappan, and Shane McIntosh. 2024. Llbezpeky: Leveraging large language models for vulnerability detection.arXiv preprint arXiv:2401.01269(2024)

work page arXiv 2024
[32]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Aleksandar Petrov, Philip HS Torr, and Adel Bibi. 2023. When do prompting and prefix-tuning work? a theory of capabilities and limitations.arXiv preprint arXiv:2310.19698(2023)

work page arXiv 2023
[34]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[35]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. 2025. Llms in software security: A survey of vulnerability detection techniques and insights.Comput. Surveys58, 5 (2025), 1–35

work page 2025
[37]

Cheng Shi, Jiongchi Yu, Ziming Zhao, Jiongyi Chen, and Fan Zhang. 2025. CGIFuzz: Enabling Gray-Box Fuzzing for Web CGI of IoT Devices.IEEE Transactions on Information Forensics and Security(2025)

work page 2025
[38]

Junhan Shi, Yijia Zhu, Zhenning Shi, Dan Zhao, Qing Li, and Yong Jiang. 2025. SpecCoT: Accelerating Chain-of-Thought Reasoning through Speculative Exploration. InFindings of the Association for Computational Linguistics: EMNLP 2025. 24405–24415. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. The Semantic Trap: Do Fine-tuned LLMs Learn...

work page 2025
[39]

Parul V Sindhwad, Prateek Ranka, Siddhi Muni, and Faruk Kazi. 2025. VulnArmor: mitigating software vulnerabilities with code resolution and detection techniques.International Journal of Information Technology17, 9 (2025), 5393–5408

work page 2025
[40]

Benjamin Steenhoek, Md Mahbubur Rahman, Monoshi Kumar Roy, Mirza Sanjida Alam, Hengbo Tong, Swarna Das, Earl T Barr, and Wei Le. 2024. To err is machine: Vulnerability detection challenges llm reasoning.arXiv preprint arXiv:2403.17218(2024)

work page arXiv 2024
[41]

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Yang Liu, and Yingjiu Li. 2024. Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning.arXiv preprint arXiv:2401.16185 (2024)

work page arXiv 2024
[42]

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024
[43]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. 2024. Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. In2024 IEEE symposium on security and privacy (SP). IEEE, 862–880

work page 2024
[45]

Jin Wang, Zishan Huang, Hengli Liu, Nianyi Yang, and Yinhao Xiao. 2023. Defecthunter: A novel llm-driven boosted- conformer-based code vulnerability detection mechanism.arXiv preprint arXiv:2309.15324(2023)

work page arXiv 2023
[46]

Xin-Cheng Wen, Cuiyun Gao, Shuzheng Gao, Yang Xiao, and Michael R Lyu. 2024. Scale: Constructing structured natural language comment trees for software vulnerability detection. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 235–247

work page 2024
[47]

Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment.arXiv preprint arXiv:2312.12148(2023)

work page arXiv 2023
[48]

Aidan ZH Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn. 2024. Large language models for test-free fault localization. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

work page 2024
[49]

Xin Yin, Chao Ni, and Shaohua Wang. 2024. Multitask-based evaluation of open-source llm on software vulnerability. IEEE Transactions on Software Engineering(2024)

work page 2024
[50]

Jian Zhang, Chong Wang, Anran Li, Weisong Sun, Cen Zhang, Wei Ma, and Yang Liu. 2024. An empirical study of automated vulnerability localization with large language models.arXiv preprint arXiv:2404.00287(2024)

work page arXiv 2024
[51]

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2024. SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning. arXiv:2408.05517 [cs.CL] https://arxiv.org/abs/2408.05517

work page arXiv 2024
[52]

Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. 2025. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–31

work page 2025
[53]

Xin Zhou, Duc-Manh Tran, Thanh Le-Cong, Ting Zhang, Ivana Clairine Irsan, Joshua Sumarlin, Bach Le, and David Lo. 2024. Comparison of static application security testing tools and large language models for repo-level vulnerability detection.arXiv preprint arXiv:2407.16235(2024). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 J. ACM...

work page arXiv 2024

[1] [1]

Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. InProceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. 30–39

work page 2021

[2] [2]

Biagio Boi, Christian Esposito, and Sokjoon Lee. 2024. Smart contract vulnerability detection: The role of large language model (llm).ACM SIGAPP applied computing review24, 2 (2024), 19–29

work page 2024

[3] [3]

Daipeng Cao and W Jun. 2024. Llm-cloudsec: Large language model empowered automatic and deep vulnerability analysis for intelligent clouds. InIEEE INFOCOM 2024-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 1–6

work page 2024

[4] [4]

Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner. 2023. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses. 654–668

work page 2023

[5] [5]

Isaac David, Liyi Zhou, Kaihua Qin, Dawn Song, Lorenzo Cavallaro, and Arthur Gervais. 2023. Do you still need a manual smart contract audit?arXiv preprint arXiv:2306.12338(2023)

work page arXiv 2023

[6] [6]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2024. Vulnerability Detection with Code Language Models: How Far Are We?arXiv preprint arXiv:2403.18624(2024)

work page arXiv 2024

[8] [8]

Xiaohu Du, Ming Wen, Jiahao Zhu, Zifan Xie, Bin Ji, Huijun Liu, Xuanhua Shi, and Hai Jin. 2024. Generalization- enhanced code vulnerability detection via multi-task instruction fine-tuning.arXiv preprint arXiv:2406.03718(2024)

work page arXiv 2024

[9] [9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al

work page

[12] [12]

Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

work page 2022

[13] [13]

Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee

work page

[14] [14]

InProceedings of the 2023 conference on empirical methods in natural language processing

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 5254–5276

work page 2023

[15] [15]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-Coder Technical Report.arXiv preprint arXiv:2409.12186(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Xuefeng Jiang, Lvhua Wu, Sheng Sun, Jia Li, Jingjing Xue, Yuwei Wang, Tingting Wu, and Min Liu. 2024. Investigating large language models for code vulnerability detection: An experimental study.arXiv preprint arXiv:2412.18260(2024)

work page arXiv 2024

[17] [17]

Mete Keltek, Rong Hu, Mohammadreza Fani Sani, and Ziyue Li. 2025. LSAST: Enhancing Cybersecurity Through LLM-Supported Static Application Security Testing. InIFIP International Conference on ICT Systems Security and Privacy J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:20 Huang et al. Protection. Springer, 166–179

work page 2025

[18] [18]

Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2025. Understanding the effectiveness of large language models in detecting security vulnerabilities. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 103–114

work page 2025

[19] [19]

Vasileios Kouliaridis, Georgios Karopoulos, and Georgios Kambourakis. 2024. Assessing the effectiveness of llms in android application vulnerability analysis. InInternational Conference on Attacks and Defenses for Internet-of-Things. Springer, 139–154

work page 2024

[20] [20]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

work page 2023

[21] [21]

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Yansong Li, Paula Branco, Alexander M Hoole, Manish Marwah, Hari Manassery Koduvely, Guy-Vincent Jourdan, and Stephan Jou. 2025. SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis. In2025 IEEE Symposium on Security and Privacy (SP). IEEE, 3014–3032

work page 2025

[24] [24]

Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xiuzhen Cheng, Fengyuan Xu, and Sheng Zhong. 2025. Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask.arXiv preprint arXiv:2504.13474 (2025)

work page arXiv 2025

[25] [25]

Ziyang Li, Saikat Dutta, and Mayur Naik. 2024. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. arXiv preprint arXiv:2405.17238(2024)

work page arXiv 2024

[26] [26]

Zhihong Liu, Zezhou Yang, and Qing Liao. 2024. Exploration on prompting LLM with code-specific information for vulnerability detection. In2024 IEEE International Conference on Software Services Engineering (SSE). IEEE, 273–281

work page 2024

[27] [27]

Guilong Lu, Xiaolin Ju, Xiang Chen, Wenlong Pei, and Zhilong Cai. 2024. GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning.Journal of Systems and Software212 (2024), 112031

work page 2024

[28] [28]

Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, and Yang Liu. 2024. Combining fine-tuning and llm-based agents for intuitive smart contract auditing with justifications.arXiv preprint arXiv:2403.16073(2024)

work page arXiv 2024

[29] [29]

Andrew A Mahyari. 2024. Harnessing the power of llms in source code vulnerability detection. InMILCOM 2024-2024 IEEE Military Communications Conference (MILCOM). IEEE, 251–256

work page 2024

[30] [30]

Qiheng Mao, Zhenhao Li, Xing Hu, Kui Liu, Xin Xia, and Jianling Sun. 2024. Towards effectively detecting and explaining vulnerabilities using large language models.arXiv e-prints(2024), arXiv–2406

work page 2024

[31] [31]

Noble Saji Mathews, Yelizaveta Brus, Yousra Aafer, Meiyappan Nagappan, and Shane McIntosh. 2024. Llbezpeky: Leveraging large language models for vulnerability detection.arXiv preprint arXiv:2401.01269(2024)

work page arXiv 2024

[32] [32]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Aleksandar Petrov, Philip HS Torr, and Adel Bibi. 2023. When do prompting and prefix-tuning work? a theory of capabilities and limitations.arXiv preprint arXiv:2310.19698(2023)

work page arXiv 2023

[34] [34]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[35] [35]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. 2025. Llms in software security: A survey of vulnerability detection techniques and insights.Comput. Surveys58, 5 (2025), 1–35

work page 2025

[37] [37]

Cheng Shi, Jiongchi Yu, Ziming Zhao, Jiongyi Chen, and Fan Zhang. 2025. CGIFuzz: Enabling Gray-Box Fuzzing for Web CGI of IoT Devices.IEEE Transactions on Information Forensics and Security(2025)

work page 2025

[38] [38]

Junhan Shi, Yijia Zhu, Zhenning Shi, Dan Zhao, Qing Li, and Yong Jiang. 2025. SpecCoT: Accelerating Chain-of-Thought Reasoning through Speculative Exploration. InFindings of the Association for Computational Linguistics: EMNLP 2025. 24405–24415. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. The Semantic Trap: Do Fine-tuned LLMs Learn...

work page 2025

[39] [39]

Parul V Sindhwad, Prateek Ranka, Siddhi Muni, and Faruk Kazi. 2025. VulnArmor: mitigating software vulnerabilities with code resolution and detection techniques.International Journal of Information Technology17, 9 (2025), 5393–5408

work page 2025

[40] [40]

Benjamin Steenhoek, Md Mahbubur Rahman, Monoshi Kumar Roy, Mirza Sanjida Alam, Hengbo Tong, Swarna Das, Earl T Barr, and Wei Le. 2024. To err is machine: Vulnerability detection challenges llm reasoning.arXiv preprint arXiv:2403.17218(2024)

work page arXiv 2024

[41] [41]

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Yang Liu, and Yingjiu Li. 2024. Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning.arXiv preprint arXiv:2401.16185 (2024)

work page arXiv 2024

[42] [42]

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024

[43] [43]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. 2024. Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. In2024 IEEE symposium on security and privacy (SP). IEEE, 862–880

work page 2024

[45] [45]

Jin Wang, Zishan Huang, Hengli Liu, Nianyi Yang, and Yinhao Xiao. 2023. Defecthunter: A novel llm-driven boosted- conformer-based code vulnerability detection mechanism.arXiv preprint arXiv:2309.15324(2023)

work page arXiv 2023

[46] [46]

Xin-Cheng Wen, Cuiyun Gao, Shuzheng Gao, Yang Xiao, and Michael R Lyu. 2024. Scale: Constructing structured natural language comment trees for software vulnerability detection. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 235–247

work page 2024

[47] [47]

Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment.arXiv preprint arXiv:2312.12148(2023)

work page arXiv 2023

[48] [48]

Aidan ZH Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn. 2024. Large language models for test-free fault localization. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

work page 2024

[49] [49]

Xin Yin, Chao Ni, and Shaohua Wang. 2024. Multitask-based evaluation of open-source llm on software vulnerability. IEEE Transactions on Software Engineering(2024)

work page 2024

[50] [50]

Jian Zhang, Chong Wang, Anran Li, Weisong Sun, Cen Zhang, Wei Ma, and Yang Liu. 2024. An empirical study of automated vulnerability localization with large language models.arXiv preprint arXiv:2404.00287(2024)

work page arXiv 2024

[51] [51]

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2024. SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning. arXiv:2408.05517 [cs.CL] https://arxiv.org/abs/2408.05517

work page arXiv 2024

[52] [52]

Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. 2025. Large language model for vulnerability detection and repair: Literature review and the road ahead.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–31

work page 2025

[53] [53]

Xin Zhou, Duc-Manh Tran, Thanh Le-Cong, Ting Zhang, Ivana Clairine Irsan, Joshua Sumarlin, Bach Le, and David Lo. 2024. Comparison of static application security testing tools and large language models for repo-level vulnerability detection.arXiv preprint arXiv:2407.16235(2024). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 J. ACM...

work page arXiv 2024