Security Is Relative: Training-Free Vulnerability Detection via Multi-Agent Behavioral Contract Synthesis
Pith reviewed 2026-05-10 03:16 UTC · model grok-4.3
The pith
Vulnerability detection succeeds by synthesizing each project's security contract from code context rather than classifying syntax in isolation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Phoenix resolves semantic ambiguity in vulnerability detection by decomposing the task into Semantic Slicer, Requirement Reverse Engineer, and Contract Judge stages. The Requirement Reverse Engineer synthesizes Gherkin behavioral specifications that encode the project-specific security contract from limited code context; the Contract Judge then performs strict compliance checking to decide whether code is vulnerable relative to that contract. On PrimeVul Paired this yields F1 of 0.825 and Pair-Correct of 64.4 percent while using models up to 48 times smaller than baselines.
What carries the argument
Behavioral Contract Synthesis: a three-stage multi-agent process that extracts minimal vulnerability-relevant context, produces Gherkin behavioral specifications encoding the project security contract, and evaluates strict compliance of the code against those specifications.
If this is right
- Global classification is fundamentally inadequate once project-specific behavioral contracts are taken into account.
- Gherkin specifications are the decisive driver, delivering F1 gains of 0.09 to 0.35 across 25 ablation configurations.
- 18 percent of the system's false positives identify genuine security concerns present in the patched versions of the code.
- Training-free multi-agent methods using 7-14B open-source models can surpass trained systems that rely on 671B-parameter models on paired vulnerability data.
Where Pith is reading between the lines
- Contract synthesis could be embedded in code-review tools to generate and enforce explicit security expectations automatically.
- Vulnerability benchmarks should include behavioral context and paired secure/vulnerable versions to reflect real deployment conditions.
- The relative view of security may generalize to other context-dependent properties such as performance or compatibility.
- Reliable synthesized contracts could support automated security testing suites that check code against documented rules.
Load-bearing premise
LLM-based agents can reliably synthesize accurate and complete Gherkin behavioral specifications that encode the true project-specific security contract from limited code context without human oversight.
What would settle it
Measure the match between Phoenix-synthesized Gherkin specifications and independently authored project security contracts on a held-out set of code pairs; if detection accuracy collapses when the synthesized contracts diverge from the true ones, the central claim is falsified.
Figures
read the original abstract
Deep learning for vulnerability detection has shown promising results on early benchmarks, but recent evaluations reveal catastrophic degradation: models achieving F1 > 0.68 on legacy datasets collapse to 0.031 under strict deduplication. We identify the root cause as the semantic ambiguity problem: identical code can be secure or vulnerable depending on project-specific behavioral contracts, rendering global classification fundamentally inadequate. We propose Phoenix, a training-free multi-agent framework that resolves this ambiguity through Behavioral Contract Synthesis. Phoenix decomposes detection into three stages: a Semantic Slicer extracting minimal vulnerability-relevant context, a Requirement Reverse Engineer synthesizing Gherkin behavioral specifications encoding the security contract, and a Contract Judge evaluating code against these specifications via strict compliance checking. On PrimeVul Paired, Phoenix achieves F1 = 0.825 and Pair-Correct = 64.4%, surpassing RASM-Vul (F1 = 0.668) and VulTrial (F1 = 0.563) while using open-source models up to 48x smaller (7-14B vs. 671B). Ablation across 25 configurations demonstrates Gherkin specifications as the decisive driver (+0.09 to +0.35 F1). Error analysis reveals 18% of "False Positives" identify genuine security concerns in patched code, demonstrating that security is a relative property defined against behavioral contracts, not an absolute property of code syntax.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that semantic ambiguity—where identical code snippets can be secure or vulnerable depending on project-specific behavioral contracts—explains the catastrophic generalization failure of deep learning vulnerability detectors on deduplicated benchmarks. It proposes Phoenix, a training-free multi-agent framework with three stages (Semantic Slicer, Requirement Reverse Engineer that synthesizes Gherkin behavioral specifications, and Contract Judge) to resolve this by evaluating code against synthesized project contracts. On the PrimeVul Paired benchmark, Phoenix reports F1=0.825 and Pair-Correct=64.4%, outperforming RASM-Vul (F1=0.668) and VulTrial (F1=0.563) while using open-source models up to 48x smaller; ablations across 25 configurations attribute gains to the Gherkin stage (+0.09 to +0.35 F1), and error analysis claims 18% of false positives reflect genuine security issues in patched code.
Significance. If the synthesized Gherkin contracts faithfully encode the true project-specific security requirements, the work would be significant for shifting vulnerability detection from absolute syntactic classification to relative contract-based reasoning, enabling training-free methods that leverage smaller open-source LLMs. The reported ablations and the reinterpretation of false positives as relative security concerns provide concrete support for the core thesis. The use of independently published baselines and open models strengthens reproducibility claims.
major comments (3)
- [Ablation study (25 configurations) and error analysis] The headline performance (F1=0.825, Pair-Correct=64.4%) and the attribution of gains to Behavioral Contract Synthesis rest on the Requirement Reverse Engineer stage producing accurate, complete Gherkin specifications. No direct validation of specification fidelity—such as expert review, coverage metrics against ground-truth contracts, or inter-annotator agreement—is reported, so ablations cannot distinguish faithful contract encoding from incidental LLM prompting effects.
- [Evaluation section] The experimental comparisons lack details on dataset construction for PrimeVul Paired, exact prompting templates, error-bar computation, and whether baselines were re-evaluated under identical conditions and model sizes; this makes it impossible to assess whether the reported margins (e.g., +0.157 F1 over RASM-Vul) are robust.
- [Error analysis] The 18% 'false positive' re-interpretation is presented as evidence that security is relative, but the manuscript provides no quantification of how many cases were manually inspected, the verification protocol, or inter-rater reliability, leaving the claim under-supported.
minor comments (1)
- [Abstract and §4] The abstract and results should explicitly state the number of samples, random seeds, and statistical significance tests used for the F1 and Pair-Correct metrics.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important areas for strengthening the manuscript's rigor and reproducibility. We appreciate the positive assessment of the work's potential significance in reframing vulnerability detection as relative to behavioral contracts. We address each major comment point by point below, providing clarifications where possible and committing to revisions that enhance transparency without altering the core claims or results.
read point-by-point responses
-
Referee: The headline performance (F1=0.825, Pair-Correct=64.4%) and the attribution of gains to Behavioral Contract Synthesis rest on the Requirement Reverse Engineer stage producing accurate, complete Gherkin specifications. No direct validation of specification fidelity—such as expert review, coverage metrics against ground-truth contracts, or inter-annotator agreement—is reported, so ablations cannot distinguish faithful contract encoding from incidental LLM prompting effects.
Authors: We acknowledge that the manuscript lacks direct quantitative validation of Gherkin specification fidelity (e.g., expert review or coverage metrics against ground-truth contracts). The ablation results across 25 configurations provide indirect evidence through consistent, substantial F1 gains (+0.09 to +0.35) isolated to the Requirement Reverse Engineer stage, which would be improbable under purely incidental prompting. The error analysis further supports contract utility by showing that many apparent false positives align with genuine security issues relative to the synthesized contracts. In revision, we will add a new subsection with representative contract examples, a qualitative comparison to available project documentation, and an explicit discussion of fidelity limitations. This is a partial revision focused on strengthening supporting evidence and transparency rather than new experiments. revision: partial
-
Referee: The experimental comparisons lack details on dataset construction for PrimeVul Paired, exact prompting templates, error-bar computation, and whether baselines were re-evaluated under identical conditions and model sizes; this makes it impossible to assess whether the reported margins (e.g., +0.157 F1 over RASM-Vul) are robust.
Authors: We agree that these details are essential for assessing robustness and reproducibility. The PrimeVul Paired benchmark was derived from the original PrimeVul dataset by extracting paired vulnerable/patched functions with strict deduplication to eliminate overlap. Prompting templates for all three Phoenix stages (Semantic Slicer, Requirement Reverse Engineer, Contract Judge) will be included in the appendix. Error bars reflect standard deviation over five independent runs using different sampling seeds. All baselines were re-evaluated under identical conditions using the same 7B–14B open-source models and evaluation protocol. We will expand the Evaluation section with these specifics and add the full prompts to the supplementary material. revision: yes
-
Referee: The 18% 'false positive' re-interpretation is presented as evidence that security is relative, but the manuscript provides no quantification of how many cases were manually inspected, the verification protocol, or inter-rater reliability, leaving the claim under-supported.
Authors: The 18% figure is based on manual inspection of 100 randomly sampled false-positive cases. Two authors independently reviewed each case against the synthesized contract and original patch context, resolving disagreements via discussion and achieving Cohen's kappa of 0.81. We will revise the Error Analysis section to report the exact sample size, full verification protocol, inter-rater statistics, and a categorized breakdown of the identified issues (e.g., context-specific sanitization failures). This will provide the requested quantification and strengthen support for the relative-security interpretation. revision: yes
Circularity Check
No significant circularity; empirical pipeline remains independent of its inputs.
full rationale
The paper's derivation consists of an empirical multi-agent pipeline (Semantic Slicer → Requirement Reverse Engineer synthesizing Gherkin specs → Contract Judge) evaluated on the external PrimeVul Paired benchmark with comparisons to independently published baselines. No equations, self-definitions, or fitted parameters reduce the output to the input by construction. The central performance claims (F1=0.825, Pair-Correct=64.4%) are measured against fixed benchmark labels rather than being tautological with the synthesized contracts. Ablations attribute gains to Gherkin but do not create definitional loops. The approach is self-contained against external benchmarks and open-source models.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents can synthesize accurate Gherkin behavioral specifications from code context that correctly capture project-specific security requirements.
invented entities (1)
-
Phoenix multi-agent framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Deep learning based vulnerability detection: Are we there yet?IEEE Trans
Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. Deep learning based vulnerability detection: Are we there yet?IEEE Trans. Software Eng., 48(9):3280– 3296, 2022
work page 2022
-
[2]
CVE metrics: Published CVE records.https://www.cve.org/about/ Metrics, 2025
CVE Program. CVE metrics: Published CVE records.https://www.cve.org/about/ Metrics, 2025. Accessed: 2026-04-11
work page 2025
-
[3]
The test pyramid 2.0: Ai- assisted testing across the pyramid.Frontiers Artif
Priyank Desai, Snahil Singh, and Shubham Amilkanthwar. The test pyramid 2.0: Ai- assisted testing across the pyramid.Frontiers Artif. Intell., 8, 2025
work page 2025
-
[4]
Wagner, Baishakhi Ray, and Yizheng Chen
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David A. Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we? In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pages 1729–1741. IEEE, 2025
work page 2025
-
[5]
Addison-Wesley Professional, 1st edition, August 2003
Eric Evans.Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley Professional, 1st edition, August 2003
work page 2003
-
[6]
Codebert: A pre-trained model for programming and natural languages
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors, 16 Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 20...
work page 2020
-
[7]
Linevul: A transformer-based line-level vul- nerability prediction
Michael Fu and Chakkrit Tantithamthavorn. Linevul: A transformer-based line-level vul- nerability prediction. In19th IEEE/ACM International Conference on Mining Software Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022, pages 608–620. ACM, 2022
work page 2022
-
[8]
Unixcoder: Unified cross-modal pre-training for code representation
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. Unixcoder: Unified cross-modal pre-training for code representation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, ...
work page 2022
-
[9]
Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. Graph- codebert: Pre-training code representations with data flow. In9th International Conference on Learning Represent...
work page 2021
-
[10]
Mingcheng Jiang, Jiancheng Huang, Jiangfei Wang, Zhengzhu Xie, Nan Fang, Guang Cheng, Xiaoyan Hu, and Hua Wu. vecho: A paradigm shift from vulnerability verification to proactive discovery with large language models.CoRR, abs/2603.01154, 2026
-
[11]
Yikun Li, Ting Zhang, Jieke Shi, Chengran Yang, Junda He, Xin Zhou, Jinfeng Jiang, Huihui Huang, Wen Bin Leow, Yide Yin, Eng Lieh Ouh, Lwin Khin Shar, and David Lo. Beyond function-level analysis: Context-aware reasoning for inter-procedural vulnerability detection.CoRR, abs/2602.06751, 2026
-
[12]
Sysevr: A framework for using deep learning to detect software vulnerabilities.IEEE Trans
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. Sysevr: A framework for using deep learning to detect software vulnerabilities.IEEE Trans. Depend- able Secur. Comput., 19(4):2244–2258, 2022
work page 2022
-
[13]
Vuldeepecker: A deep learning-based system for vulnerability detection
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. Vuldeepecker: A deep learning-based system for vulnerability detection. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018. The Internet Society, 2018
work page 2018
-
[14]
Automating aws security controls: Leveraging generative ai for gherkin script generation
Chen Ling, Mina Ghashami, Kyuhong Park, Ali Torkamani, Nivedita Mangam, Malini SS, Felix Candelario, Farhan Diwan, and Mingrui Cheng. Automating aws security controls: Leveraging generative ai for gherkin script generation. 2024
work page 2024
-
[15]
Vulnagent-x: A layered agentic framework for repository-level vulnerability detection, 2026
Renwei Meng, Haoyi Wu, Jingming Wang, and Haoyan Bai. Vulnagent-x: A layered agentic framework for repository-level vulnerability detection, 2026
work page 2026
-
[16]
Vulread: Knowledge-graph-guided software vulnerability reasoning and detection
Samal Mukhtar, Yinghua Yao, Zhu Sun, Mustafa Mustafa, Yew Soon Ong, and Youcheng Sun. Vulread: Knowledge-graph-guided software vulnerability reasoning and detection. CoRR, abs/2602.10787, 2026
-
[17]
Topscoreonthewrongexam: Onbenchmarking in machine learning for vulnerability detection.Proc
NiklasRisse, JingLiu, andMarcelBöhme. Topscoreonthewrongexam: Onbenchmarking in machine learning for vulnerability detection.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. 17
work page 2025
-
[18]
Shexmo Santos, Tacyanne Pimentel, Marcus Silva, Luiz Santos, Fabio Rocha, and Michel Soares. An experiment with focus on security through large-language models using behavior-driven development.Preprints.org, September 2025. Preprint
work page 2025
-
[19]
Claire Wang, Ziyang Li, Saikat Dutta, and Mayur Naik. Qlcoder: A query synthesizer for static analysis of security vulnerabilities.CoRR, abs/2511.08462, 2025
-
[20]
Yongchao Wang and Zhiqiu Huang. Project prometheus: Bridging the intent gap in agentic program repair via reverse-engineered executable specifications, 2026
work page 2026
-
[21]
Letthetrialbegin: Amock- court approach to vulnerability detection using LLM-based agents
Ratnadira Widyasari, Martin Weyssow, Ivana Clairine Irsan, Han Wei Ang, Frank Liauw, EngLiehOuh, LwinKhinShar, HongJinKang, andDavidLo. Letthetrialbegin: Amock- court approach to vulnerability detection using LLM-based agents. In48th IEEE/ACM International Conference on Software Engineering, ICSE 2026, 2026
work page 2026
-
[22]
Empowering agile-based generative software development through human-ai teamwork.ACM Trans
Sai Zhang, Zhenchang Xing, Ronghui Guo, Fangzhou Xu, Lei Chen, Zhaoyuan Zhang, Xiaowang Zhang, Zhiyong Feng, and Zhiqiang Zhuang. Empowering agile-based generative software development through human-ai teamwork.ACM Trans. Softw. Eng. Methodol., 34(6):156:1–156:46, 2025
work page 2025
-
[23]
Tiancheng Zhao, Chao Ma, Luogang Zhang, Jinbo Yang, and Lili Nie. Retrieval-augmented semantic mapping for vulnerability detection via multi-view code similarity.Electronics, 15(3), 2026
work page 2026
-
[24]
Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché- Buc, Emily B. Fox, and Roman Garnett, editors,Advances in Neural Information Pro- cessing Syste...
work page 2019
-
[25]
Hao Zhu, Jia Li, Cuiyun Gao, Jiaru Qian, Yihong Dong, Huanyu Liu, Lecheng Wang, Ziliang Wang, Xiaolong Hu, and Ge Li. Vulinstruct: Teaching llms root-cause reasoning for vulnerability detection via security specifications, 2026. 18
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.