Security Is Relative: Training-Free Vulnerability Detection via Multi-Agent Behavioral Contract Synthesis

Yongchao Wang; Zhiqiu Huang

arxiv: 2604.19012 · v1 · submitted 2026-04-21 · 💻 cs.CR · cs.SE

Security Is Relative: Training-Free Vulnerability Detection via Multi-Agent Behavioral Contract Synthesis

Yongchao Wang , Zhiqiu Huang This is my paper

Pith reviewed 2026-05-10 03:16 UTC · model grok-4.3

classification 💻 cs.CR cs.SE

keywords vulnerability detectionbehavioral contractGherkin specificationmulti-agent frameworksemantic ambiguitytraining-freerelative securitycompliance checking

0 comments

The pith

Vulnerability detection succeeds by synthesizing each project's security contract from code context rather than classifying syntax in isolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning models for vulnerability detection degrade sharply on deduplicated benchmarks because identical code can be secure or vulnerable depending on project-specific behavioral rules. The paper identifies this semantic ambiguity as the core problem and presents Phoenix, a training-free multi-agent framework that extracts minimal relevant context, reverse-engineers security requirements into Gherkin specifications, and judges code compliance against those specifications. This relative approach reaches an F1 score of 0.825 and 64.4 percent pair-correct on PrimeVul Paired, outperforming prior methods while using open-source models up to 48 times smaller. Ablation studies show the synthesized Gherkin contracts drive the performance gains, and error analysis finds that some apparent false positives reflect genuine security issues in the patched code.

Core claim

Phoenix resolves semantic ambiguity in vulnerability detection by decomposing the task into Semantic Slicer, Requirement Reverse Engineer, and Contract Judge stages. The Requirement Reverse Engineer synthesizes Gherkin behavioral specifications that encode the project-specific security contract from limited code context; the Contract Judge then performs strict compliance checking to decide whether code is vulnerable relative to that contract. On PrimeVul Paired this yields F1 of 0.825 and Pair-Correct of 64.4 percent while using models up to 48 times smaller than baselines.

What carries the argument

Behavioral Contract Synthesis: a three-stage multi-agent process that extracts minimal vulnerability-relevant context, produces Gherkin behavioral specifications encoding the project security contract, and evaluates strict compliance of the code against those specifications.

If this is right

Global classification is fundamentally inadequate once project-specific behavioral contracts are taken into account.
Gherkin specifications are the decisive driver, delivering F1 gains of 0.09 to 0.35 across 25 ablation configurations.
18 percent of the system's false positives identify genuine security concerns present in the patched versions of the code.
Training-free multi-agent methods using 7-14B open-source models can surpass trained systems that rely on 671B-parameter models on paired vulnerability data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Contract synthesis could be embedded in code-review tools to generate and enforce explicit security expectations automatically.
Vulnerability benchmarks should include behavioral context and paired secure/vulnerable versions to reflect real deployment conditions.
The relative view of security may generalize to other context-dependent properties such as performance or compatibility.
Reliable synthesized contracts could support automated security testing suites that check code against documented rules.

Load-bearing premise

LLM-based agents can reliably synthesize accurate and complete Gherkin behavioral specifications that encode the true project-specific security contract from limited code context without human oversight.

What would settle it

Measure the match between Phoenix-synthesized Gherkin specifications and independently authored project security contracts on a held-out set of code pairs; if detection accuracy collapses when the synthesized contracts diverge from the true ones, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.19012 by Yongchao Wang, Zhiqiu Huang.

read the original abstract

Deep learning for vulnerability detection has shown promising results on early benchmarks, but recent evaluations reveal catastrophic degradation: models achieving F1 > 0.68 on legacy datasets collapse to 0.031 under strict deduplication. We identify the root cause as the semantic ambiguity problem: identical code can be secure or vulnerable depending on project-specific behavioral contracts, rendering global classification fundamentally inadequate. We propose Phoenix, a training-free multi-agent framework that resolves this ambiguity through Behavioral Contract Synthesis. Phoenix decomposes detection into three stages: a Semantic Slicer extracting minimal vulnerability-relevant context, a Requirement Reverse Engineer synthesizing Gherkin behavioral specifications encoding the security contract, and a Contract Judge evaluating code against these specifications via strict compliance checking. On PrimeVul Paired, Phoenix achieves F1 = 0.825 and Pair-Correct = 64.4%, surpassing RASM-Vul (F1 = 0.668) and VulTrial (F1 = 0.563) while using open-source models up to 48x smaller (7-14B vs. 671B). Ablation across 25 configurations demonstrates Gherkin specifications as the decisive driver (+0.09 to +0.35 F1). Error analysis reveals 18% of "False Positives" identify genuine security concerns in patched code, demonstrating that security is a relative property defined against behavioral contracts, not an absolute property of code syntax.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Phoenix's multi-agent Gherkin contract synthesis is a fresh angle on semantic ambiguity in vuln detection, but the results hinge on unverified spec quality.

read the letter

The paper's main move is to treat security as relative to project-specific behavioral contracts instead of a fixed property of code snippets. Phoenix breaks detection into semantic slicing, LLM-driven reverse engineering of Gherkin specs, and a contract judge that checks compliance. It runs training-free on small open-source models and posts F1 0.825 plus 64.4% pair-correct on PrimeVul Paired, ahead of the cited baselines. Ablations credit the Gherkin stage for most of the lift, and the error analysis on false positives is useful because it shows some of those cases flag real issues in the patched versions. That supports the relativity point without needing new labels. The work is honest about the degradation problem on deduplicated data and tries to address it directly rather than just scaling models. The citation pattern looks reasonable for the sub-area. The soft spot is exactly where the stress-test flagged: no direct measurement of whether the synthesized Gherkin specs are accurate, complete, or faithful to the actual security contract. Gains could be coming from richer prompting or incidental LLM knowledge rather than contract synthesis itself. Dataset construction and baseline re-evaluation details are also thin, which makes the numerical claims harder to interpret. This is aimed at researchers building LLM tools for code security who are tired of brittle classifiers. It shows clear thinking on the problem and has enough empirical signal to deserve referee time, even if the contract validation step needs strengthening before the central claim lands cleanly.

Referee Report

3 major / 1 minor

Summary. The paper argues that semantic ambiguity—where identical code snippets can be secure or vulnerable depending on project-specific behavioral contracts—explains the catastrophic generalization failure of deep learning vulnerability detectors on deduplicated benchmarks. It proposes Phoenix, a training-free multi-agent framework with three stages (Semantic Slicer, Requirement Reverse Engineer that synthesizes Gherkin behavioral specifications, and Contract Judge) to resolve this by evaluating code against synthesized project contracts. On the PrimeVul Paired benchmark, Phoenix reports F1=0.825 and Pair-Correct=64.4%, outperforming RASM-Vul (F1=0.668) and VulTrial (F1=0.563) while using open-source models up to 48x smaller; ablations across 25 configurations attribute gains to the Gherkin stage (+0.09 to +0.35 F1), and error analysis claims 18% of false positives reflect genuine security issues in patched code.

Significance. If the synthesized Gherkin contracts faithfully encode the true project-specific security requirements, the work would be significant for shifting vulnerability detection from absolute syntactic classification to relative contract-based reasoning, enabling training-free methods that leverage smaller open-source LLMs. The reported ablations and the reinterpretation of false positives as relative security concerns provide concrete support for the core thesis. The use of independently published baselines and open models strengthens reproducibility claims.

major comments (3)

[Ablation study (25 configurations) and error analysis] The headline performance (F1=0.825, Pair-Correct=64.4%) and the attribution of gains to Behavioral Contract Synthesis rest on the Requirement Reverse Engineer stage producing accurate, complete Gherkin specifications. No direct validation of specification fidelity—such as expert review, coverage metrics against ground-truth contracts, or inter-annotator agreement—is reported, so ablations cannot distinguish faithful contract encoding from incidental LLM prompting effects.
[Evaluation section] The experimental comparisons lack details on dataset construction for PrimeVul Paired, exact prompting templates, error-bar computation, and whether baselines were re-evaluated under identical conditions and model sizes; this makes it impossible to assess whether the reported margins (e.g., +0.157 F1 over RASM-Vul) are robust.
[Error analysis] The 18% 'false positive' re-interpretation is presented as evidence that security is relative, but the manuscript provides no quantification of how many cases were manually inspected, the verification protocol, or inter-rater reliability, leaving the claim under-supported.

minor comments (1)

[Abstract and §4] The abstract and results should explicitly state the number of samples, random seeds, and statistical significance tests used for the F1 and Pair-Correct metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for strengthening the manuscript's rigor and reproducibility. We appreciate the positive assessment of the work's potential significance in reframing vulnerability detection as relative to behavioral contracts. We address each major comment point by point below, providing clarifications where possible and committing to revisions that enhance transparency without altering the core claims or results.

read point-by-point responses

Referee: The headline performance (F1=0.825, Pair-Correct=64.4%) and the attribution of gains to Behavioral Contract Synthesis rest on the Requirement Reverse Engineer stage producing accurate, complete Gherkin specifications. No direct validation of specification fidelity—such as expert review, coverage metrics against ground-truth contracts, or inter-annotator agreement—is reported, so ablations cannot distinguish faithful contract encoding from incidental LLM prompting effects.

Authors: We acknowledge that the manuscript lacks direct quantitative validation of Gherkin specification fidelity (e.g., expert review or coverage metrics against ground-truth contracts). The ablation results across 25 configurations provide indirect evidence through consistent, substantial F1 gains (+0.09 to +0.35) isolated to the Requirement Reverse Engineer stage, which would be improbable under purely incidental prompting. The error analysis further supports contract utility by showing that many apparent false positives align with genuine security issues relative to the synthesized contracts. In revision, we will add a new subsection with representative contract examples, a qualitative comparison to available project documentation, and an explicit discussion of fidelity limitations. This is a partial revision focused on strengthening supporting evidence and transparency rather than new experiments. revision: partial
Referee: The experimental comparisons lack details on dataset construction for PrimeVul Paired, exact prompting templates, error-bar computation, and whether baselines were re-evaluated under identical conditions and model sizes; this makes it impossible to assess whether the reported margins (e.g., +0.157 F1 over RASM-Vul) are robust.

Authors: We agree that these details are essential for assessing robustness and reproducibility. The PrimeVul Paired benchmark was derived from the original PrimeVul dataset by extracting paired vulnerable/patched functions with strict deduplication to eliminate overlap. Prompting templates for all three Phoenix stages (Semantic Slicer, Requirement Reverse Engineer, Contract Judge) will be included in the appendix. Error bars reflect standard deviation over five independent runs using different sampling seeds. All baselines were re-evaluated under identical conditions using the same 7B–14B open-source models and evaluation protocol. We will expand the Evaluation section with these specifics and add the full prompts to the supplementary material. revision: yes
Referee: The 18% 'false positive' re-interpretation is presented as evidence that security is relative, but the manuscript provides no quantification of how many cases were manually inspected, the verification protocol, or inter-rater reliability, leaving the claim under-supported.

Authors: The 18% figure is based on manual inspection of 100 randomly sampled false-positive cases. Two authors independently reviewed each case against the synthesized contract and original patch context, resolving disagreements via discussion and achieving Cohen's kappa of 0.81. We will revise the Error Analysis section to report the exact sample size, full verification protocol, inter-rater statistics, and a categorized breakdown of the identified issues (e.g., context-specific sanitization failures). This will provide the requested quantification and strengthen support for the relative-security interpretation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline remains independent of its inputs.

full rationale

The paper's derivation consists of an empirical multi-agent pipeline (Semantic Slicer → Requirement Reverse Engineer synthesizing Gherkin specs → Contract Judge) evaluated on the external PrimeVul Paired benchmark with comparisons to independently published baselines. No equations, self-definitions, or fitted parameters reduce the output to the input by construction. The central performance claims (F1=0.825, Pair-Correct=64.4%) are measured against fixed benchmark labels rather than being tautological with the synthesized contracts. Ablations attribute gains to Gherkin but do not create definitional loops. The approach is self-contained against external benchmarks and open-source models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that Gherkin specifications can faithfully encode security contracts and that the PrimeVul Paired benchmark provides a valid test of relative security; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption LLM agents can synthesize accurate Gherkin behavioral specifications from code context that correctly capture project-specific security requirements.
Invoked in the Requirement Reverse Engineer stage and validated only via ablation on the target benchmark.

invented entities (1)

Phoenix multi-agent framework no independent evidence
purpose: To perform training-free vulnerability detection by synthesizing and judging behavioral contracts.
New system introduced to operationalize the contract-synthesis idea; no independent falsifiable evidence outside the paper's own evaluation.

pith-pipeline@v0.9.0 · 5552 in / 1507 out tokens · 48866 ms · 2026-05-10T03:16:01.825543+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Deep learning based vulnerability detection: Are we there yet?IEEE Trans

Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. Deep learning based vulnerability detection: Are we there yet?IEEE Trans. Software Eng., 48(9):3280– 3296, 2022

work page 2022
[2]

CVE metrics: Published CVE records.https://www.cve.org/about/ Metrics, 2025

CVE Program. CVE metrics: Published CVE records.https://www.cve.org/about/ Metrics, 2025. Accessed: 2026-04-11

work page 2025
[3]

The test pyramid 2.0: Ai- assisted testing across the pyramid.Frontiers Artif

Priyank Desai, Snahil Singh, and Shubham Amilkanthwar. The test pyramid 2.0: Ai- assisted testing across the pyramid.Frontiers Artif. Intell., 8, 2025

work page 2025
[4]

Wagner, Baishakhi Ray, and Yizheng Chen

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David A. Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we? In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pages 1729–1741. IEEE, 2025

work page 2025
[5]

Addison-Wesley Professional, 1st edition, August 2003

Eric Evans.Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley Professional, 1st edition, August 2003

work page 2003
[6]

Codebert: A pre-trained model for programming and natural languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors, 16 Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 20...

work page 2020
[7]

Linevul: A transformer-based line-level vul- nerability prediction

Michael Fu and Chakkrit Tantithamthavorn. Linevul: A transformer-based line-level vul- nerability prediction. In19th IEEE/ACM International Conference on Mining Software Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022, pages 608–620. ACM, 2022

work page 2022
[8]

Unixcoder: Unified cross-modal pre-training for code representation

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. Unixcoder: Unified cross-modal pre-training for code representation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, ...

work page 2022
[9]

Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. Graph- codebert: Pre-training code representations with data flow. In9th International Conference on Learning Represent...

work page 2021
[10]

vecho: A paradigm shift from vulnerability verification to proactive discovery with large language models.CoRR, abs/2603.01154, 2026

Mingcheng Jiang, Jiancheng Huang, Jiangfei Wang, Zhengzhu Xie, Nan Fang, Guang Cheng, Xiaoyan Hu, and Hua Wu. vecho: A paradigm shift from vulnerability verification to proactive discovery with large language models.CoRR, abs/2603.01154, 2026

work page arXiv 2026
[11]

Beyond function-level analysis: Context-aware reasoning for inter-procedural vulnerability detection.CoRR, abs/2602.06751, 2026

Yikun Li, Ting Zhang, Jieke Shi, Chengran Yang, Junda He, Xin Zhou, Jinfeng Jiang, Huihui Huang, Wen Bin Leow, Yide Yin, Eng Lieh Ouh, Lwin Khin Shar, and David Lo. Beyond function-level analysis: Context-aware reasoning for inter-procedural vulnerability detection.CoRR, abs/2602.06751, 2026

work page arXiv 2026
[12]

Sysevr: A framework for using deep learning to detect software vulnerabilities.IEEE Trans

Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. Sysevr: A framework for using deep learning to detect software vulnerabilities.IEEE Trans. Depend- able Secur. Comput., 19(4):2244–2258, 2022

work page 2022
[13]

Vuldeepecker: A deep learning-based system for vulnerability detection

Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. Vuldeepecker: A deep learning-based system for vulnerability detection. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018. The Internet Society, 2018

work page 2018
[14]

Automating aws security controls: Leveraging generative ai for gherkin script generation

Chen Ling, Mina Ghashami, Kyuhong Park, Ali Torkamani, Nivedita Mangam, Malini SS, Felix Candelario, Farhan Diwan, and Mingrui Cheng. Automating aws security controls: Leveraging generative ai for gherkin script generation. 2024

work page 2024
[15]

Vulnagent-x: A layered agentic framework for repository-level vulnerability detection, 2026

Renwei Meng, Haoyi Wu, Jingming Wang, and Haoyan Bai. Vulnagent-x: A layered agentic framework for repository-level vulnerability detection, 2026

work page 2026
[16]

Vulread: Knowledge-graph-guided software vulnerability reasoning and detection

Samal Mukhtar, Yinghua Yao, Zhu Sun, Mustafa Mustafa, Yew Soon Ong, and Youcheng Sun. Vulread: Knowledge-graph-guided software vulnerability reasoning and detection. CoRR, abs/2602.10787, 2026

work page arXiv 2026
[17]

Topscoreonthewrongexam: Onbenchmarking in machine learning for vulnerability detection.Proc

NiklasRisse, JingLiu, andMarcelBöhme. Topscoreonthewrongexam: Onbenchmarking in machine learning for vulnerability detection.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. 17

work page 2025
[18]

An experiment with focus on security through large-language models using behavior-driven development.Preprints.org, September 2025

Shexmo Santos, Tacyanne Pimentel, Marcus Silva, Luiz Santos, Fabio Rocha, and Michel Soares. An experiment with focus on security through large-language models using behavior-driven development.Preprints.org, September 2025. Preprint

work page 2025
[19]

Qlcoder: A query synthesizer for static analysis of security vulnerabilities.CoRR, abs/2511.08462, 2025

Claire Wang, Ziyang Li, Saikat Dutta, and Mayur Naik. Qlcoder: A query synthesizer for static analysis of security vulnerabilities.CoRR, abs/2511.08462, 2025

work page arXiv 2025
[20]

Project prometheus: Bridging the intent gap in agentic program repair via reverse-engineered executable specifications, 2026

Yongchao Wang and Zhiqiu Huang. Project prometheus: Bridging the intent gap in agentic program repair via reverse-engineered executable specifications, 2026

work page 2026
[21]

Letthetrialbegin: Amock- court approach to vulnerability detection using LLM-based agents

Ratnadira Widyasari, Martin Weyssow, Ivana Clairine Irsan, Han Wei Ang, Frank Liauw, EngLiehOuh, LwinKhinShar, HongJinKang, andDavidLo. Letthetrialbegin: Amock- court approach to vulnerability detection using LLM-based agents. In48th IEEE/ACM International Conference on Software Engineering, ICSE 2026, 2026

work page 2026
[22]

Empowering agile-based generative software development through human-ai teamwork.ACM Trans

Sai Zhang, Zhenchang Xing, Ronghui Guo, Fangzhou Xu, Lei Chen, Zhaoyuan Zhang, Xiaowang Zhang, Zhiyong Feng, and Zhiqiang Zhuang. Empowering agile-based generative software development through human-ai teamwork.ACM Trans. Softw. Eng. Methodol., 34(6):156:1–156:46, 2025

work page 2025
[23]

Retrieval-augmented semantic mapping for vulnerability detection via multi-view code similarity.Electronics, 15(3), 2026

Tiancheng Zhao, Chao Ma, Luogang Zhang, Jinbo Yang, and Lili Nie. Retrieval-augmented semantic mapping for vulnerability detection via multi-view code similarity.Electronics, 15(3), 2026

work page 2026
[24]

Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks

Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché- Buc, Emily B. Fox, and Roman Garnett, editors,Advances in Neural Information Pro- cessing Syste...

work page 2019
[25]

Vulinstruct: Teaching llms root-cause reasoning for vulnerability detection via security specifications, 2026

Hao Zhu, Jia Li, Cuiyun Gao, Jiaru Qian, Yihong Dong, Huanyu Liu, Lecheng Wang, Ziliang Wang, Xiaolong Hu, and Ge Li. Vulinstruct: Teaching llms root-cause reasoning for vulnerability detection via security specifications, 2026. 18

work page 2026

[1] [1]

Deep learning based vulnerability detection: Are we there yet?IEEE Trans

Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. Deep learning based vulnerability detection: Are we there yet?IEEE Trans. Software Eng., 48(9):3280– 3296, 2022

work page 2022

[2] [2]

CVE metrics: Published CVE records.https://www.cve.org/about/ Metrics, 2025

CVE Program. CVE metrics: Published CVE records.https://www.cve.org/about/ Metrics, 2025. Accessed: 2026-04-11

work page 2025

[3] [3]

The test pyramid 2.0: Ai- assisted testing across the pyramid.Frontiers Artif

Priyank Desai, Snahil Singh, and Shubham Amilkanthwar. The test pyramid 2.0: Ai- assisted testing across the pyramid.Frontiers Artif. Intell., 8, 2025

work page 2025

[4] [4]

Wagner, Baishakhi Ray, and Yizheng Chen

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David A. Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we? In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pages 1729–1741. IEEE, 2025

work page 2025

[5] [5]

Addison-Wesley Professional, 1st edition, August 2003

Eric Evans.Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley Professional, 1st edition, August 2003

work page 2003

[6] [6]

Codebert: A pre-trained model for programming and natural languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors, 16 Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 20...

work page 2020

[7] [7]

Linevul: A transformer-based line-level vul- nerability prediction

Michael Fu and Chakkrit Tantithamthavorn. Linevul: A transformer-based line-level vul- nerability prediction. In19th IEEE/ACM International Conference on Mining Software Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022, pages 608–620. ACM, 2022

work page 2022

[8] [8]

Unixcoder: Unified cross-modal pre-training for code representation

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. Unixcoder: Unified cross-modal pre-training for code representation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, ...

work page 2022

[9] [9]

Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. Graph- codebert: Pre-training code representations with data flow. In9th International Conference on Learning Represent...

work page 2021

[10] [10]

vecho: A paradigm shift from vulnerability verification to proactive discovery with large language models.CoRR, abs/2603.01154, 2026

Mingcheng Jiang, Jiancheng Huang, Jiangfei Wang, Zhengzhu Xie, Nan Fang, Guang Cheng, Xiaoyan Hu, and Hua Wu. vecho: A paradigm shift from vulnerability verification to proactive discovery with large language models.CoRR, abs/2603.01154, 2026

work page arXiv 2026

[11] [11]

Beyond function-level analysis: Context-aware reasoning for inter-procedural vulnerability detection.CoRR, abs/2602.06751, 2026

Yikun Li, Ting Zhang, Jieke Shi, Chengran Yang, Junda He, Xin Zhou, Jinfeng Jiang, Huihui Huang, Wen Bin Leow, Yide Yin, Eng Lieh Ouh, Lwin Khin Shar, and David Lo. Beyond function-level analysis: Context-aware reasoning for inter-procedural vulnerability detection.CoRR, abs/2602.06751, 2026

work page arXiv 2026

[12] [12]

Sysevr: A framework for using deep learning to detect software vulnerabilities.IEEE Trans

Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. Sysevr: A framework for using deep learning to detect software vulnerabilities.IEEE Trans. Depend- able Secur. Comput., 19(4):2244–2258, 2022

work page 2022

[13] [13]

Vuldeepecker: A deep learning-based system for vulnerability detection

Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. Vuldeepecker: A deep learning-based system for vulnerability detection. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018. The Internet Society, 2018

work page 2018

[14] [14]

Automating aws security controls: Leveraging generative ai for gherkin script generation

Chen Ling, Mina Ghashami, Kyuhong Park, Ali Torkamani, Nivedita Mangam, Malini SS, Felix Candelario, Farhan Diwan, and Mingrui Cheng. Automating aws security controls: Leveraging generative ai for gherkin script generation. 2024

work page 2024

[15] [15]

Vulnagent-x: A layered agentic framework for repository-level vulnerability detection, 2026

Renwei Meng, Haoyi Wu, Jingming Wang, and Haoyan Bai. Vulnagent-x: A layered agentic framework for repository-level vulnerability detection, 2026

work page 2026

[16] [16]

Vulread: Knowledge-graph-guided software vulnerability reasoning and detection

Samal Mukhtar, Yinghua Yao, Zhu Sun, Mustafa Mustafa, Yew Soon Ong, and Youcheng Sun. Vulread: Knowledge-graph-guided software vulnerability reasoning and detection. CoRR, abs/2602.10787, 2026

work page arXiv 2026

[17] [17]

Topscoreonthewrongexam: Onbenchmarking in machine learning for vulnerability detection.Proc

NiklasRisse, JingLiu, andMarcelBöhme. Topscoreonthewrongexam: Onbenchmarking in machine learning for vulnerability detection.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. 17

work page 2025

[18] [18]

An experiment with focus on security through large-language models using behavior-driven development.Preprints.org, September 2025

Shexmo Santos, Tacyanne Pimentel, Marcus Silva, Luiz Santos, Fabio Rocha, and Michel Soares. An experiment with focus on security through large-language models using behavior-driven development.Preprints.org, September 2025. Preprint

work page 2025

[19] [19]

Qlcoder: A query synthesizer for static analysis of security vulnerabilities.CoRR, abs/2511.08462, 2025

Claire Wang, Ziyang Li, Saikat Dutta, and Mayur Naik. Qlcoder: A query synthesizer for static analysis of security vulnerabilities.CoRR, abs/2511.08462, 2025

work page arXiv 2025

[20] [20]

Project prometheus: Bridging the intent gap in agentic program repair via reverse-engineered executable specifications, 2026

Yongchao Wang and Zhiqiu Huang. Project prometheus: Bridging the intent gap in agentic program repair via reverse-engineered executable specifications, 2026

work page 2026

[21] [21]

Letthetrialbegin: Amock- court approach to vulnerability detection using LLM-based agents

Ratnadira Widyasari, Martin Weyssow, Ivana Clairine Irsan, Han Wei Ang, Frank Liauw, EngLiehOuh, LwinKhinShar, HongJinKang, andDavidLo. Letthetrialbegin: Amock- court approach to vulnerability detection using LLM-based agents. In48th IEEE/ACM International Conference on Software Engineering, ICSE 2026, 2026

work page 2026

[22] [22]

Empowering agile-based generative software development through human-ai teamwork.ACM Trans

Sai Zhang, Zhenchang Xing, Ronghui Guo, Fangzhou Xu, Lei Chen, Zhaoyuan Zhang, Xiaowang Zhang, Zhiyong Feng, and Zhiqiang Zhuang. Empowering agile-based generative software development through human-ai teamwork.ACM Trans. Softw. Eng. Methodol., 34(6):156:1–156:46, 2025

work page 2025

[23] [23]

Retrieval-augmented semantic mapping for vulnerability detection via multi-view code similarity.Electronics, 15(3), 2026

Tiancheng Zhao, Chao Ma, Luogang Zhang, Jinbo Yang, and Lili Nie. Retrieval-augmented semantic mapping for vulnerability detection via multi-view code similarity.Electronics, 15(3), 2026

work page 2026

[24] [24]

Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks

Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché- Buc, Emily B. Fox, and Roman Garnett, editors,Advances in Neural Information Pro- cessing Syste...

work page 2019

[25] [25]

Vulinstruct: Teaching llms root-cause reasoning for vulnerability detection via security specifications, 2026

Hao Zhu, Jia Li, Cuiyun Gao, Jiaru Qian, Yihong Dong, Huanyu Liu, Lecheng Wang, Ziliang Wang, Xiaolong Hu, and Ge Li. Vulinstruct: Teaching llms root-cause reasoning for vulnerability detection via security specifications, 2026. 18

work page 2026