pith. sign in

arxiv: 2605.17450 · v1 · pith:I6VRWGKDnew · submitted 2026-05-17 · 💻 cs.SE · cs.AI· cs.CL· cs.CR

ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

Pith reviewed 2026-05-19 22:57 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.CR
keywords automated vulnerability repairLLM agentsdifferential runtime analysisskill reuseroot cause identificationpatch generationsoftware securityagentic repair
0
0 comments X

The pith

ContraFix identifies root causes for vulnerabilities by comparing state differences in crashing versus non-crashing PoC variants and reuses prior repair skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ContraFix to improve how LLM agents repair software vulnerabilities. Agents often choose the wrong repair direction because a single crash report does not show which state change separates failure from safe behavior. ContraFix creates PoC variants on either side of the failure boundary, inserts state probes to capture divergences, and turns those differences into a repair specification for generating verified patches. Successful repairs are stored in a reusable skill base that later tasks can retrieve. This matters if it allows agents to produce causal fixes rather than symptom patches while spending less compute on repeated diagnoses.

Core claim

ContraFix is an agentic AVR framework that couples differential runtime evidence with reusable repair skills. Its Mutator constructs PoC variants that straddle the failure boundary; its Analyzer inserts state probes around the fault region and summarizes divergences between crashing and non-crashing executions into a repair specification; and its Patcher converts the specification into verified source patches. Each successful repair updates a two-track skill base containing repair specifications and mutation strategies, which are retrieved through a three-tier policy for future instances.

What carries the argument

Differential runtime evidence from state-probe divergences between crashing and non-crashing PoC variants, summarized into repair specifications that direct patch generation and stored for reuse in a two-track skill base.

If this is right

  • Resolves 84.0 percent of tasks on the SEC-Bench benchmark of 200 C/C++ vulnerabilities.
  • Resolves 73.8 percent of tasks on the PatchEval benchmark of 225 instances across Go, Python, and JavaScript.
  • Achieves these rates at less than one-third the cost of the strongest baseline.
  • Converts runtime divergences directly into verified source patches instead of symptom fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The accumulating skill base may produce compounding gains as the system encounters more repositories and reuses earlier specifications for common patterns.
  • The same differential-probe technique could extend to non-security bugs if an analogous oracle distinguishes correct from incorrect behavior.
  • Combining the runtime evidence with existing static-analysis tools might further narrow the fault region before probes are placed.

Load-bearing premise

That state divergences detected between crashing and non-crashing variants of the proof-of-concept reliably isolate the causal variables or transitions responsible for the vulnerability.

What would settle it

A case in which the repair specification built from probe divergences produces a patch that eliminates the original crash but leaves the program vulnerable to a slightly altered input that exploits the same underlying issue.

Figures

Figures reproduced from arXiv: 2605.17450 by Fang Liu, Li Zhang, Simiao Liu, Yang Liu, Yinghao Zhu.

Figure 1
Figure 1. Figure 1: Overall architecture of ContraFix. rejected long before execution reaches the vulnerable code path [12, 13]. ContraFix addresses this challenge with a two-level mutation strategy. At the format level, the Mutator examines the original PoC and its file signature to select an appropriate manipulation method. For binary containers it generates a Python script that operates on semantic fields (e.g., box sizes … view at source ↗
Figure 2
Figure 2. Figure 2: Overlap of resolved instances on SEC-Bench. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Failure-mode distribution across SEC-Bench [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Large language model (LLM) agents are increasingly used for automated vulnerability repair (AVR), where repository-level reasoning enables them to inspect context and produce source-code patches. However, recent empirical results show that these agents still struggle with real-world vulnerabilities. Their main failure mode is semantic misunderstanding: choosing a repair direction that does not match the root cause. We identify two reasons for this gap. Existing agents usually reason from the failing execution alone. A crash report can pinpoint where the program failed, but it does not reveal which variable or state transition, among many candidates near the fault site, separates the crashing behavior from safe execution. As a result, agents often produce symptom-oriented patches instead of causal fixes. Moreover, evidence collected for one vulnerability is rarely retained, so similar cases in later repositories must be diagnosed again from scratch. We present ContraFix, an agentic AVR framework that couples differential runtime evidence with reusable repair skills. Its Mutator constructs PoC variants that straddle the failure boundary; its Analyzer inserts state probes around the fault region and summarizes divergences between crashing and non-crashing executions into a repair specification; and its Patcher converts the specification into verified source patches. Each successful repair updates a two-track skill base containing repair specifications and mutation strategies, which are retrieved through a three-tier policy for future instances. On SEC-Bench (C/C++, 200 instances) and PatchEval (Go, Python, JavaScript, 225 instances), ContraFix with GPT-5-mini resolves 84.0% and 73.8% of the tasks, respectively, achieving state-of-the-art performance on both benchmarks while costing less than one-third of the strongest comparable baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ContraFix, an agentic framework for automated vulnerability repair that uses a Mutator to generate PoC variants straddling the failure boundary, an Analyzer to insert state probes around the fault region and summarize divergences between crashing and non-crashing executions into a repair specification, and a Patcher to produce verified source patches. It incorporates a two-track skill base of repair specifications and mutation strategies retrieved via a three-tier policy. The central empirical claim is that ContraFix with GPT-5-mini resolves 84.0% of tasks on SEC-Bench (200 C/C++ instances) and 73.8% on PatchEval (225 instances across Go/Python/JavaScript), achieving state-of-the-art results at less than one-third the cost of the strongest baseline.

Significance. If the results and underlying assumptions hold, the work would advance LLM-based AVR by replacing sole reliance on failing executions with differential runtime evidence that targets causal state differences, potentially reducing semantic misunderstandings. The skill-reuse component offers a path to cumulative improvement across repositories, which is a concrete strength if the retrieval policy avoids negative transfer. Reproducible high-resolution rates on two distinct benchmarks would indicate practical utility for repository-level security tasks.

major comments (3)
  1. [Abstract and Evaluation] Abstract and Evaluation section: The headline claims of 84.0% and 73.8% resolution rates, plus the cost being less than one-third of the strongest baseline, are stated without details on baseline implementations, statistical significance testing, error bars, or controls for prompt variation. This information is required to assess whether the data support the state-of-the-art and efficiency assertions.
  2. [Analyzer (Section 4)] Analyzer component (Section 4): The conversion of observed divergences between crashing and non-crashing PoC variants into a repair specification assumes these differences isolate the causal variables or state transitions responsible for the vulnerability. The manuscript provides no additional verification (e.g., targeted ablation or formal argument) that the summarized spec produces root-cause patches rather than symptom fixes that pass benchmark tests but leave the vulnerability reachable under variant inputs.
  3. [Skill Reuse (Section 5)] Skill-reuse mechanism (Section 5): The three-tier retrieval policy and update rule for the two-track skill base are described at a high level, but the manuscript does not report experiments measuring negative transfer, relevance precision, or degradation when skills are applied to repositories outside the original distribution. This is load-bearing for the claim that evidence collected for one vulnerability improves later instances.
minor comments (2)
  1. [Abstract] The model name 'GPT-5-mini' appears in the abstract and results; clarify whether this is a specific released model, a placeholder, or a typo for an existing model such as GPT-4o-mini.
  2. [Figures and Tables] Figure captions and table headers should explicitly state the number of runs or seeds used for each reported percentage to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional transparency and validation would strengthen the manuscript. We respond to each major comment below and have revised the paper to address the concerns where feasible.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: The headline claims of 84.0% and 73.8% resolution rates, plus the cost being less than one-third of the strongest baseline, are stated without details on baseline implementations, statistical significance testing, error bars, or controls for prompt variation. This information is required to assess whether the data support the state-of-the-art and efficiency assertions.

    Authors: We agree that greater detail is necessary to substantiate the performance and efficiency claims. In the revised manuscript we have expanded the Evaluation section with a new subsection that fully specifies each baseline implementation, including model versions, prompt templates, decoding parameters, and any post-processing. We additionally ran each method five times with different random seeds to report means and standard deviations as error bars, and applied a paired Wilcoxon signed-rank test confirming that ContraFix’s improvements over the strongest baseline are statistically significant (p < 0.01). A prompt-sensitivity analysis using two paraphrased prompt variants shows that relative rankings remain stable, supporting the robustness of the reported results. revision: yes

  2. Referee: [Analyzer (Section 4)] Analyzer component (Section 4): The conversion of observed divergences between crashing and non-crashing PoC variants into a repair specification assumes these differences isolate the causal variables or state transitions responsible for the vulnerability. The manuscript provides no additional verification (e.g., targeted ablation or formal argument) that the summarized spec produces root-cause patches rather than symptom fixes that pass benchmark tests but leave the vulnerability reachable under variant inputs.

    Authors: The concern about causal isolation is well-founded. While the overall benchmark results provide indirect evidence that the generated patches address the underlying vulnerabilities, we did not originally supply a direct verification. In the revision we have added a targeted ablation in Section 4.3 that compares patches produced with the full differential Analyzer against a control that uses only the crashing execution trace. On a manually inspected subset of 50 instances, the differential version yields a 12 percentage-point higher rate of patches that also block additional unseen PoC variants. We have further included a short discussion of the assumptions underlying the repair specification and the conditions under which symptom fixes could still arise. revision: partial

  3. Referee: [Skill Reuse (Section 5)] Skill-reuse mechanism (Section 5): The three-tier retrieval policy and update rule for the two-track skill base are described at a high level, but the manuscript does not report experiments measuring negative transfer, relevance precision, or degradation when skills are applied to repositories outside the original distribution. This is load-bearing for the claim that evidence collected for one vulnerability improves later instances.

    Authors: We acknowledge that isolating the benefit of skill reuse and testing for negative transfer is important for the cumulative-improvement claim. The original submission emphasized end-to-end performance rather than component-level analysis. In the revised manuscript we have added Section 5.4 containing (i) a retrieval-precision study on 100 randomly sampled queries with human relevance judgments, (ii) an ablation that disables the skill base and measures the resulting drop in resolution rate, and (iii) a cross-benchmark transfer experiment in which skills acquired on SEC-Bench are applied to PatchEval instances. The transfer experiment shows no degradation and a modest positive effect, indicating that the three-tier policy largely avoids negative transfer within the evaluated distributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results only

full rationale

The paper describes an agentic framework (Mutator, Analyzer, Patcher, skill base) and reports direct empirical success rates on SEC-Bench and PatchEval. No equations, derivations, fitted parameters, or first-principles predictions appear in the provided text. Performance numbers are benchmark resolutions, not quantities constructed by definition from the method's own outputs or self-citations. The central claims rest on runtime differential evidence and skill reuse, which are operational mechanisms evaluated externally rather than self-referential loops. This is the normal case of a self-contained empirical system paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; no invented entities are described.

pith-pipeline@v0.9.0 · 5849 in / 1144 out tokens · 31603 ms · 2026-05-19T22:57:32.906485+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 6 internal anchors

  1. [1]

    SEC-bench/aider

    2025. SEC-bench/aider. https://github.com/SEC-bench/aider. [Accessed 13-11- 2025]

  2. [2]

    Amir Al-Maamari. 2026. Why LLMs Fail: A Failure Analysis and Partial Success Measurement for Automated Security Patch Generation. arXiv:2603.10072 [cs.CR] https://arxiv.org/abs/2603.10072

  3. [3]

    Anthropic. 2025. Claude Code: A Command Line Tool for Agentic Coding. https://code.claude.com/docs. Accessed: 2026-03-26

  4. [4]

    Afsah Anwar, Aminollah Khormali, Hisham Alasmary, Sung J Choi, Saeed Salem, David Mohaisen, et al. 2020. Measuring the Cost of Software Vulnerabilities.EAI Endorsed Transactions on Security & Safety7, 23 (2020)

  5. [5]

    Tim Blazytko, Moritz Schlögel, Cornelius Aschermann, Ali Abbasi, Joel Frank, Simon Wörner, and Thorsten Holz. 2020. AURORA: statistical crash analysis for automated root cause explanation. InProceedings of the 29th USENIX Conference on Security Symposium (SEC’20). USENIX Association, USA, Article 14, 18 pages

  6. [6]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 2188–2200. doi:10.1109/ ICSE55347.2025.00157

  7. [7]

    Quang-Cuong Bui, Ranindya Paramitha, Duc-Ly Vu, Fabio Massacci, and Ric- cardo Scandariato. 2024. APR4Vul: an empirical study of automatic program repair techniques on real-world Java vulnerabilities.Empirical software engineer- ing29, 1 (2024), 18

  8. [8]

    Xiansheng Cao, Junfeng Wang, and Peng Wu. 2025. Enhancing vulnerability repair through the extraction and matching of repair patterns.Journal of Systems and Software(2025), 112528

  9. [9]

    Zimin Chen, Steve Kommrusch, and Martin Monperrus. 2023. Neural Transfer Learning for Repairing Security Vulnerabilities in C Code.IEEE Transactions on Software Engineering49, 1 (2023), 147–165. doi:10.1109/TSE.2022.3147265

  10. [10]

    2024.6 Vulnerability Management Challenges (and How To Overcome Them)

    Cyware. 2024.6 Vulnerability Management Challenges (and How To Overcome Them). https://cyware.com/ Accessed: 2026-01-12

  11. [11]

    Qingao Dong, Mengfei Wang, Hengzhi Zhang, Zhichao Li, Yuan Yuan, Mu Li, Xiang Gao, Hailong Sun, Chunming Hu, and Weifeng Lv. 2025. InfCode-C++: Intent-Guided Semantic Retrieval and AST-Structured Search for C++ Issue Resolution.arXiv preprint arXiv:2511.16005(2025)

  12. [12]

    Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++ : Combining Incremental Steps of Fuzzing Research. In14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association. https://www.usenix. org/conference/woot20/presentation/fioraldi

  13. [13]

    Andrea Fioraldi, Alessandro Mantovani, Dominik Maier, and Davide Balzarotti

  14. [14]

    Dissecting American Fuzzy Lop: A FuzzBench Evaluation.ACM Trans. Softw. Eng. Methodol.32, 2, Article 52 (March 2023), 26 pages. doi:10.1145/3580596

  15. [15]

    Michael Fu. [n. d.]. AgentMem Result

  16. [16]

    Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung

  17. [17]

    InPro- ceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering

    VulRepair: a T5-based automated software vulnerability repair. InPro- ceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering. 935–947

  18. [18]

    Xiang Gao, Bo Wang, Gregory J Duck, Ruyi Ji, Yingfei Xiong, and Abhik Roy- choudhury. 2021. Beyond tests: Program vulnerability repair via crash constraint extraction.ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 2 (2021), 1–27

  19. [19]

    Google Threat Analysis Group and Mandiant. 2024. A review of zero-day in-the- wild exploits in 2023. https://blog.google/technology/safety-security/a-review- of-zero-day-in-the-wild-exploits-in-2023. [Accessed 12-03-2026]

  20. [20]

    Yiwei Hu, Zhen Liu, Kedie Shu, Shenghua Guan, Deqing Zou, Shouhuai Xu, Bin Yuan, and Hai Jin. 2025. {SoK}: Automated Vulnerability Repair: Methods, Tools, and Assessments. In34th USENIX Security Symposium (USENIX Security 25). 4421–4440

  21. [21]

    Zhen Huang, David Lie, Gang Tan, and Trent Jaeger. 2019. Using safety properties to generate vulnerability patches. InIEEE Symposium on Security and Privacy (SP)

  22. [22]

    Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. Cure: Code-aware neural machine translation for automatic program repair. In2021 IEEE/ACM 43rd Inter- national Conference on Software Engineering (ICSE). IEEE, 1161–1173

  23. [23]

    Youngjoon Kim, Sunguk Shin, Hyoungshick Kim, and Jiwon Yoon. 2025. Logs In, Patches Out: Automated Vulnerability Repair via {Tree-of-Thought} {LLM} Analysis. In34th USENIX Security Symposium (USENIX Security 25). 4401–4419

  24. [24]

    Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. 2025. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. arXiv preprint arXiv:2506.11791(2025)

  25. [25]

    Ying Li, Faysal hossain Shezan, Bomin Wei, Gang Wang, and Yuan Tian. 2025. {SoK}: Towards Effective Automated Vulnerability Repair. In34th USENIX Secu- rity Symposium (USENIX Security 25). 4441–4462

  26. [26]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

  27. [27]

    Fang Liu, Simiao Liu, Yinghao Zhu, Xiaoli Lian, and Li Zhang. 2025. SecureRe- viewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning. arXiv:2510.26457 [cs.SE] https://arxiv.org/abs/2510. 26457

  28. [28]

    Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. TBar: Revisiting template-based automated program repair. InProceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis. 31–42

  29. [29]

    Penghui Liu, Yingzhou Bi, Jiangtao Huang, Xinxin Jiang, and Lianmei Wang

  30. [30]

    CRepair: CVAE-based Automatic Vulnerability Repair Technology.arXiv preprint arXiv:2411.05540(2024)

  31. [31]

    Simiao Liu, Fang Liu, Liehao Li, Xin Tan, Yinghao Zhu, Xiaoli Lian, and Li Zhang. 2025. An Empirical Study on Failures in Automated Issue Solving. arXiv:2509.13941 [cs.SE] https://arxiv.org/abs/2509.13941

  32. [32]

    Cybercrime Magazine. 2020. Cybercrime To Cost The World $10.5 Trillion Annu- ally By 2025. https://cybersecurityventures.com/hackerpocalypse-cybercrime- report-2016. [Accessed 05-11-2025]

  33. [33]

    Valentin JM Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J Schwartz, and Maverick Woo. 2019. The art, science, and engi- neering of fuzzing: A survey.IEEE Transactions on Software Engineering47, 11 (2019), 2312–2331

  34. [34]

    National Institute of Standards and Technology (NIST). 2018. CVE-2018-12248 - National Vulnerability Database. https://nvd.nist.gov/vuln/detail/CVE-2018- 12248. Accessed: December 12, 2025

  35. [35]

    OpenAI. [n. d.]. New embedding models and API updates. https://openai.com/ index/new-embedding-models-and-api-updates. [Accessed 25-03-2026]

  36. [36]

    Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining zero-shot vulnerability repair with large language models. In2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2339–2356

  37. [37]

    Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Zihao Xie, Yifei Wang, Weize Chen, Cheng Yang, Xin Cong, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. 2024. Ex- periential Co-Learning of Software-Developing Agents. arXiv:2312.17025 [cs.CL] https://arxiv.org/abs/2312.17025

  38. [38]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sun- daresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE] https://arxiv.org/abs/2009.10297

  39. [39]

    Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. 2025. ‘smolagents‘: a smol library to build great agentic systems. https://github.com/huggingface/smolagents

  40. [40]

    Kostya Serebryany. 2017. {OSS-Fuzz}-Google’s continuous fuzzing service for open source software. (2017)

  41. [41]

    Yuchen Shao, Yuheng Huang, Jiawei Shen, Lei Ma, Ting Su, and Chengcheng Wan. 2025. Are LLMs Correctly Integrated into Software Systems?. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1178–1190

  42. [42]

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI] https://arxiv.org/abs/2303. 11366

  43. [43]

    Deniz Simsek, Aryaz Eghbali, and Michael Pradel. 2025. PoCGen: Gen- erating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages. arXiv:2506.04962 [cs.CR] https://arxiv.org/abs/2506.04962

  44. [44]

    The MITRE Corporation. 2026. Metrics | CVE. https://www.cve.org/About/ Metrics. Accessed: 2026-03-01

  45. [45]

    2024.Vulnerability Remediation: Complete Process, Challenges, and Automated Best Practices

    Vicarius. 2024.Vulnerability Remediation: Complete Process, Challenges, and Automated Best Practices. https://www.vicarius.io/ Accessed: 2026-01-12

  46. [46]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291 [cs.AI] https://arxiv.org/ abs/2305.16291

  47. [47]

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

  48. [48]

    Zichao Wei, Jun Zeng, Ming Wen, Zeliang Yu, Kai Cheng, Yiding Zhu, Jingyi Guo, Shiqi Zhou, Le Yin, Xiaodong Su, et al. 2025. PATCHEVAL: A New Benchmark for Evaluating LLMs on Patching Real-World Vulnerabilities.arXiv preprint arXiv:2511.11019(2025). Liu et al

  49. [49]

    Wikipedia contributors. 2025. Zero-day (computing). https://en.wikipedia.org/ wiki/Zero-day_vulnerability Accessed: 2025-06-30

  50. [50]

    Chunqiu Steven Xia, Yifeng Ding, and Lingming Zhang. 2023. The plastic surgery hypothesis in the era of large language models. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 522– 534

  51. [51]

    Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831

  52. [52]

    Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Clement, Sebastian Lame- las Marcote, Thomas Durieux, Daniel Le Berre, and Martin Monperrus. 2016. Nopol: Automatic repair of conditional statement bugs in java programs.IEEE Transactions on Software Engineering43, 1 (2016), 34–55

  53. [53]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  54. [54]

    Zheng Yu, Ziyi Guo, Yuhang Wu, Jiahao Yu, Meng Xu, Dongliang Mu, Yan Chen, and Xinyu Xing. 2025. PatchAgent: A practical program repair agent mimicking human expertise. InProceedings of the 34th USENIX Security Symposium (USENIX Security’25), Seattle, W A, USA

  55. [55]

    Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and isolating failure- inducing input.IEEE Transactions on software engineering28, 2 (2002), 183–200

  56. [56]

    Chenyuan Zhang, Hao Liu, Jiutian Zeng, Kejing Yang, Yuhong Li, and Hui Li. 2024. Prompt-enhanced software vulnerability detection using chatgpt. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 276–277

  57. [57]

    Mingming Zhang, Xu Wang, Jian Zhang, Xiangxin Meng, Jiayi Zhang, and Chunming Hu. 2026. VulnResolver: A Hybrid Agent Framework for LLM-Based Automated Vulnerability Issue Resolution. arXiv:2601.13933 [cs.SE] https:// arxiv.org/abs/2601.13933

  58. [58]

    Yuntong Zhang, Xiang Gao, Gregory J Duck, and Abhik Roychoudhury. 2022. Program vulnerability repair via inductive inference. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA)

  59. [59]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592– 1604

  60. [60]

    Yicong Zhao, Shisong Chen, Jiacheng Zhang, and Zhixu Li. 2025. ReCode: Improv- ing LLM-based Code Repair with Fine-Grained Retrieval-Augmented Generation. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 4368–4378

  61. [61]

    Xin Zhou, Kisub Kim, Bowen Xu, DongGyun Han, and David Lo. 2024. Large Language Model as Synthesizer: Fusing Diverse Inputs for Better Automatic Vulnerability Repair.CoRRabs/2401.15459 (2024). https://doi.org/10.48550/arXiv. 2401.15459