pith. machine review for the scientific record. sign in

arxiv: 2604.11950 · v1 · submitted 2026-04-13 · 💻 cs.SE · cs.AI· cs.CL· cs.CR

Recognition: unknown

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:55 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.CR
keywords proof-of-concept generationLLM bug detectionmulti-agent validationsoftware testingbug report verificationautomated test synthesisregression test creation
0
0 comments X

The pith

AnyPoC generates executable proof-of-concept tests to automatically validate and confirm bugs reported by LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the problem that LLM bug reports are only hypotheses requiring manual checks. It proposes a framework that turns those reports into concrete, runnable tests by fact-checking them, synthesizing PoCs through repeated execution and trace analysis, and then re-running the tests independently. A sympathetic reader would care because this removes the main barrier to scaling automated bug finding: without reliable validation, even good detectors stay impractical on real codebases. The approach is shown to work on twelve large systems spanning multiple languages and to outperform direct use of coding agents.

Core claim

AnyPoC is a multi-agent system that first fact-checks a candidate bug report, then iteratively builds and runs a PoC while logging execution traces, and finally re-executes the PoC under independent scrutiny to guard against hallucination or reward hacking. It also maintains an evolving knowledge base of successful PoCs. When applied to reports from a simple agentic reporter, the system produces 1.3 times more valid PoCs on true bugs and rejects 9.8 times more false reports than state-of-the-art coding agents, resulting in 122 newly discovered bugs of which 105 have been confirmed.

What carries the argument

The three-stage multi-agent pipeline of fact-checking, trace-guided iterative PoC synthesis, and independent re-execution, augmented by an accumulating PoC knowledge base.

If this is right

  • LLM-based bug detectors can be chained directly into confirmed findings without human triage.
  • Generated PoCs can serve as ready-made regression tests, as already happened for 45 cases.
  • The same validation layer can be attached to any existing or future bug reporter.
  • Continuous growth of the PoC knowledge base should improve success rates on new projects over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged validation pattern might transfer to domains outside software, such as hardware design or formal specification errors.
  • If the knowledge base is made project-agnostic, it could reduce the need for per-system tuning.
  • Pairing AnyPoC with stronger base models would likely increase the absolute number of confirmed bugs, but the relative gain over baselines should be measured separately.

Load-bearing premise

The combination of fact-checking, execution-trace feedback, and independent re-execution is sufficient to block LLM hallucination and reward-hacking on bug reports from many different languages and projects.

What would settle it

Apply AnyPoC to a fresh collection of known false-positive bug reports from one of the evaluated systems and measure whether it still rejects the large majority of them.

Figures

Figures reproduced from arXiv: 2604.11950 by Chenyuan Yang, Lingming Zhang, Weidong Wang, Yihan Yang, Zijie Zhao, Ziqi Zhang.

Figure 1
Figure 1. Figure 1: Computation scalability of bug finding. 1 Introduction Automated bug detection based on Large Language Models (LLMs) has been extensively studied in recent years [2, 10, 16, 44, 55, 59, 62, 66]. Different from traditional static analysis [20, 33, 49] and dynamic testing [6, 18, 21, 46], recent LLM agents can autonomously explore the codebase and detect bugs in large systems. Those LLM￾based bug detection t… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AnyPoC framework. exploration, taking up a large portion of the limited context window. By separating the task, the subsequent generator can focus on the challenging task of PoC generation and benefit from a summary of the analysis results [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example trajectory of the bug analysis agent. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example trajectory of the PoC generator agent. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example knowledge base snapshot. From our empirical experience during bug finding, we design a few high-level knowledge categories that are most useful for PoC generation. Namely, the categories include Command Line Tools, Build System, Internal Tools, Test Frameworks, Code, and PoC Format [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Bug type distribution. ICE: internal compiler error, TC: type confusion, NPD: null pointer dereference, SO: stack overflow, DL: deadlock, OOB: out-of-bounds, UBI: use-before-initialization, Int O/U: integer over/underflow, Assert.: assertion failure. This flexibility even allows AnyPoC to automatically find bugs in domain-specific languages (DSL). For example, LLVM uses a DSL called TableGen [34]. Within t… view at source ↗
Figure 8
Figure 8. Figure 8: Bug example in Firefox’s JavaScript JIT engine. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Knowledge usage counts and average ratings. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof-of-concept test, or simply a PoC - such as a script, command sequence, or crafted input - to trigger the suspected defect. Automated PoC generation can act as a scalable validation oracle, enabling end-to-end autonomous bug detection by providing concrete execution evidence. However, naive LLM agents are unreliable validators: they are biased toward "success" and may reward-hack by producing plausible but non-functional PoCs or even hallucinated traces. To address this, we present AnyPoC, a general multi-agent framework that (1) analyzes and fact-checks a candidate bug report, (2) iteratively synthesizes and executes a PoC while collecting execution traces, and (3) independently re-executes and scrutinizes the PoC to mitigate hallucination and reward hacking. In addition, AnyPoC also continuously extracts and evolves a PoC knowledge base to handle heterogeneous tasks. AnyPoC operates on candidate bug reports regardless of their source and can be paired with different bug reporters. To demonstrate practicality and generality, we apply AnyPoC, with a simple agentic bug reporter, on 12 critical software systems across diverse languages/domains (many with millions of lines of code) including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis. Compared to the state-of-the-art coding agents, e.g., Claude Code and Codex, AnyPoC produces 1.3x more valid PoCs for true-positive bug reports and rejects 9.8x more false-positive bug reports. To date, AnyPoC has discovered 122 new bugs (105 confirmed, 86 already fixed), with 45 generated PoCs adopted as official regression tests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AnyPoC, a multi-agent framework for synthesizing executable proof-of-concept (PoC) tests from LLM-generated candidate bug reports. It combines fact-checking of reports, iterative PoC synthesis using execution traces, and independent re-execution to reduce hallucination and reward-hacking, while maintaining an evolving PoC knowledge base. The framework is claimed to be general across bug report sources and is evaluated on 12 large real-world systems (e.g., Firefox, Chromium, LLVM) using a simple agentic bug reporter, outperforming baselines like Claude Code and Codex by producing 1.3x more valid PoCs on true positives and rejecting 9.8x more false positives. It reports discovering 122 new bugs (105 confirmed, 86 fixed) with 45 PoCs adopted as regression tests.

Significance. If the evaluation criteria prove reproducible and non-circular, AnyPoC could meaningfully advance scalable, end-to-end automated bug detection by turning static LLM reports into validated, executable evidence. The scale of the evaluation on production systems with millions of lines of code, the concrete bug-discovery count, and the adoption of generated PoCs as official tests are concrete strengths that would support practical impact in software engineering. The cross-language generality and knowledge-base evolution are also positive features that distinguish it from narrower test-generation approaches.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: The headline claims of 1.3x more valid PoCs for true-positive reports and 9.8x more rejections of false-positive reports rest on unstated criteria for (a) what constitutes a 'valid PoC' and (b) how true/false-positive labels are assigned independently of AnyPoC's own fact-checker. Because the bug reporter is itself an LLM agent, any dependence on AnyPoC's acceptance for labeling would introduce circularity that inflates both multipliers; the manuscript provides no quantitative breakdown of ground-truth sources, inter-rater agreement, or cases where re-execution passed but developers later invalidated the PoC.
  2. [Framework and Evaluation] Framework (steps 1-3) and Evaluation: The description of fact-checking, iterative synthesis with traces, and independent re-execution is high-level and does not include failure-mode analysis or metrics demonstrating that the combination reliably prevents hallucination and reward-hacking across the 12 heterogeneous systems and languages. Without such evidence or ablation results, it is unclear whether new failure modes are introduced, which directly affects the soundness of the reported gains and the 122-bug discovery claim.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from an explicit statement of the exact baseline agents, the precise definition of 'valid PoC' used in the 1.3x metric, and the statistical significance tests applied to the reported multipliers.
  2. [Evaluation] Table or figure presenting the per-system breakdown of PoC validity rates, false-positive rejection rates, and bug-discovery counts would improve clarity and allow readers to assess consistency across domains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below with clarifications and commit to revisions that will improve the transparency and rigor of the evaluation and framework description.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: The headline claims of 1.3x more valid PoCs for true-positive reports and 9.8x more rejections of false-positive reports rest on unstated criteria for (a) what constitutes a 'valid PoC' and (b) how true/false-positive labels are assigned independently of AnyPoC's own fact-checker. Because the bug reporter is itself an LLM agent, any dependence on AnyPoC's acceptance for labeling would introduce circularity that inflates both multipliers; the manuscript provides no quantitative breakdown of ground-truth sources, inter-rater agreement, or cases where re-execution passed but developers later invalidated the PoC.

    Authors: We appreciate the concern about potential circularity. Valid PoCs are defined as executable artifacts that reproduce the reported bug behavior via successful execution and trace verification. True/false-positive labels for input reports were assigned using external sources including project bug trackers and developer confirmations (105 of the 122 reported bugs were developer-confirmed). We will revise the Evaluation section to explicitly define 'valid PoC', provide a quantitative breakdown of ground-truth sources, and note any post-validation discrepancies. Since labeling relied on objective execution outcomes and external confirmations rather than subjective multi-rater assessment, inter-rater agreement statistics were not computed; we will add this as an explicit limitation. revision: yes

  2. Referee: [Framework and Evaluation] Framework (steps 1-3) and Evaluation: The description of fact-checking, iterative synthesis with traces, and independent re-execution is high-level and does not include failure-mode analysis or metrics demonstrating that the combination reliably prevents hallucination and reward-hacking across the 12 heterogeneous systems and languages. Without such evidence or ablation results, it is unclear whether new failure modes are introduced, which directly affects the soundness of the reported gains and the 122-bug discovery claim.

    Authors: We agree that more detailed evidence is needed. We will revise the Framework and Evaluation sections to include ablation studies quantifying the contribution of fact-checking, trace-guided iteration, and independent re-execution to valid PoC rates and false-positive rejection across all 12 systems. We will also add a failure-mode analysis subsection describing observed issues (e.g., incomplete traces in certain languages) and how the evolving knowledge base mitigates them. All 122 discovered bugs received external confirmation, and the revised analysis will confirm that the component combination did not introduce new failure modes undermining the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external projects with direct agent comparisons

full rationale

The paper presents an empirical framework evaluated on 12 real-world open-source systems (Firefox, Chromium, LLVM, etc.) with direct quantitative comparisons to external baselines such as Claude Code and Codex. No equations, fitted parameters, or first-principles derivations are present in the provided text. Reported gains (1.3x valid PoCs, 9.8x false-positive rejections) and bug counts are framed as outcomes of execution-based validation on independent codebases rather than quantities defined in terms of the system's own outputs or prior self-citations. The evaluation chain does not reduce any claimed result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on standard assumptions about LLM prompting and execution environments but introduces no new mathematical axioms or fitted parameters; the knowledge base is extracted rather than postulated as an independent entity.

axioms (1)
  • domain assumption LLM agents can be reliably prompted to perform fact-checking, code synthesis, and trace analysis for bug reports
    This underpins the entire multi-agent workflow described in the abstract.

pith-pipeline@v0.9.0 · 5683 in / 1319 out tokens · 30124 ms · 2026-05-10T15:55:35.410089+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agentic Vulnerability Reasoning on Windows COM Binaries

    cs.CR 2026-05 accept novelty 7.0

    SLYP agentic pipeline discovers race condition vulnerabilities in Windows COM binaries and generates debugger-verified PoCs, scoring 0.973 F1 on a 40-case benchmark and finding 28 new confirmed vulnerabilities in prod...

Reference graph

Works this paper leans on

68 extracted references · 21 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Validate SWE Patches. InForty-second International Conference on Machine Learning. https: //openreview.net/forum?id=b0jYs6JOZu

  2. [2]

    Anthropic. 2025. Automated Security Reviews in Claude Code. Claude Help Center. https://support.claude.com/en/articles/11932705-automated-security- reviews-in-claude-code

  3. [3]

    Anthropic. 2025. Claude Code. https://github.com/anthropics/claude-code

  4. [4]

    Anthropic. 2026. Code Review - Claude Code Docs. https://code.claude.com/ docs/en/code-review

  5. [5]

    Anthropic. 2026. Partnering with Mozilla to improve Firefox’s security. https: //www.anthropic.com/news/mozilla-firefox-security

  6. [6]

    Abhishek Arya, Oliver Chang, Jonathan Metzman, Kostya Serebryany, and Dongge Liu. 2016. OSS-Fuzz. https://github.com/google/oss-fuzz. https: //github.com/google/oss-fuzz

  7. [7]

    Bytecode Alliance. 2026. Wasmtime: A lightweight WebAssembly runtime that is fast, secure, and standards-compliant. https://github.com/bytecodealliance/ wasmtime

  8. [8]

    Longfei Chen, Ruibin Yan, Taiyu Wong, Yiyang Chen, and Chao Zhang. 2025. SmartPoC: Generating Executable and Validated PoCs for Smart Contract Bug Reports.arXiv preprint arXiv:2511.12993(2025)

  9. [9]

    Foundry Contributors. 2026. Foundry: A blazing fast, portable and modular toolkit for Ethereum application development written in Rust. https://github. com/foundry-rs/foundry

  10. [10]

    Cursor. 2026. Bugbot. https://cursor.com/bugbot

  11. [11]

    O. J. Dahl, E. W. Dijkstra, and C. A. R. Hoare (Eds.). 1972.Structured programming. Academic Press Ltd., GBR

  12. [12]

    Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: an efficient SMT solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems(Budapest, Hungary)(TACAS’08/ETAPS’08). Springer-Verlag, Berlin, Heidelberg, 337–340

  13. [13]

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435

  14. [14]

    Will Dietz, Peng Li, John Regehr, and Vikram Adve. 2012. Understanding Integer Overflow in C/C++. InProceedings of the 2012 International Conference on Software Engineering(Zurich, Switzerland)(ICSE 2012). IEEE Press, Piscataway, NJ, USA, 760–770. http://dl.acm.org/citation.cfm?id=2337223.2337313

  15. [15]

    Xueying Du, Jiayi Feng, Yi Zou, Wei Xu, Jie Ma, Wei Zhang, Sisi Liu, Xin Peng, and Yiling Lou. 2026. Reducing False Positives in Static Bug Detection with LLMs: An Empirical Study in Industry. arXiv:2601.18844 [cs.SE] https://arxiv. org/abs/2601.18844

  16. [16]

    Xueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, and Yiling Lou. 2024. Vul-rag: Enhanc- ing llm-based vulnerability detection via knowledge-level rag.arXiv preprint arXiv:2406.11147(2024)

  17. [17]

    Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos

    Mafalda Ferreira, Miguel Monteiro, Tiago Brito, Miguel E. Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos. 2024. Efficient Static Vulnerability Analysis for JavaScript with Multiversion Dependency Graphs.Proc. ACM Program. Lang. 8, PLDI, Article 164 (June 2024), 25 pages. doi:10.1145/3656394

  18. [18]

    Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419

  19. [19]

    GitHub. 2025. Copilot code review now generally available. GitHub Changelog. https://github.blog/changelog/2025-04-04-copilot-code-review- now-generally-available/

  20. [20]

    GitHub. 2026. CodeQL. https://codeql.github.com/

  21. [21]

    Google. 2026. Syzkaller. https://github.com/google/syzkaller/

  22. [22]

    Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. In Forty-second International Conference on Machine Learning. https://openreview. net/forum?id=TXcifVbFpG

  23. [23]

    Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A ground- truth fuzzing benchmark.Proceedings of the ACM on Measurement and Analysis of Computing Systems4, 3 (2020), 1–29

  24. [24]

    Yuchen Ji, Ting Dai, Zhichao Zhou, Yutian Tang, and Jingzhu He. 2025. Artemis: Toward Accurate Detection of Server-Side Request Forgeries through LLM- Assisted Inter-Procedural Path-Sensitive Taint Analysis.Proceedings of the ACM on Programming Languages9, OOPSLA1 (2025), 1349–1377

  25. [25]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net. https://openreview.net/forum?id=VTF8yNQM66

  26. [26]

    René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA)(ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437–440. doi:10.1145/2610384.2628055

  27. [27]

    Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large Language Models are Few- Shot Testers: Exploring LLM-Based General Bug Reproduction. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 2312–2323. doi:10.1109/ICSE48619.2023.00194

  28. [28]

    Jon Kaplan. 2026. Building a better Bugbot. Cursor Blog. https://cursor.com/ blog/building-bugbot

  29. [29]

    George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating fuzz testing. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security. 2123–2138

  30. [30]

    Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. 2025. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=QQhQIqons0

  31. [31]

    Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing static analysis for practical bug detection: An llm-integrated approach.Proceedings of the ACM on Programming Languages8, OOPSLA1 (2024), 474–499

  32. [32]

    Ziyang Li, Saikat Dutta, and Mayur Naik. 2024. IRIS: LLM-assisted static analysis for detecting security vulnerabilities.arXiv preprint arXiv:2405.17238(2024)

  33. [33]

    LLVM Project. 2026. Clang Static Analyzer. https://clang-analyzer.llvm.org/

  34. [34]

    LLVM Project. 2026. TableGen Overview. https://llvm.org/docs/TableGen/

  35. [35]

    Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos

    Filipe Marques, Mafalda Ferreira, André Nascimento, Miguel E. Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos. 2025. Automated Exploit Generation for Node.js Packages.Proc. ACM Program. Lang.9, PLDI, Article 201 (June 2025), 26 pages. doi:10.1145/3729304

  36. [36]

    Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. 2024. Large Language Model guided Protocol Fuzzing.. InNDSS

  37. [37]

    Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Ab- hishek Arya. 2021. Fuzzbench: an open fuzzer benchmarking platform and service. InProceedings of the 29th ACM joint meeting on European software en- gineering conference and symposium on the foundations of software engineering. 1393–1403

  38. [38]

    Mozilla Foundation. 2026. SpiderMonkey JavaScript/WebAssembly Engine. https: //spidermonkey.dev/

  39. [39]

    Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavy- weight dynamic binary instrumentation.SIGPLAN Not.42, 6 (June 2007), 89–100. doi:10.1145/1273442.1250746

  40. [40]

    Vikram Nitin, Baishakhi Ray, and Roshanak Zilouchian Moghaddam. 2025. FaultLine: Automated Proof-of-Vulnerability Generation Using LLM Agents. arXiv:2507.15241 [cs.SE] https://arxiv.org/abs/2507.15241

  41. [41]

    Peter W. O’Hearn. 2019. Incorrectness logic.Proc. ACM Program. Lang.4, POPL, Article 10 (Dec. 2019), 32 pages. doi:10.1145/3371078

  42. [42]

    OpenAI. 2025. Introducing Aardvark: OpenAI’s agentic security researcher. https://openai.com/index/introducing-aardvark/

  43. [43]

    OpenAI. 2025. OpenAI Codex CLI. https://github.com/openai/codex

  44. [44]

    OpenAI. 2026. Codex Security: now in research preview. https://openai.com/ index/codex-security-now-in-research-preview/

  45. [45]

    Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2024. The mutators reloaded: Fuzzing compilers with large language model generated mutation operators. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4. 298–312

  46. [46]

    Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. InCompanion to the 22nd ACM SIGPLAN conference on Object- oriented programming systems and applications companion. 815–816

  47. [47]

    Laura Plein, Wendkûuni C Ouédraogo, Jacques Klein, and Tegawendé F Bissyandé

  48. [48]

    InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings

    Automatic generation of test cases based on bug reports: a feasibility study with large language models. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 360– 361

  49. [49]

    Irtaza Sajid Qureshi, Zhen Ming, et al . 2025. Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework. arXiv preprint arXiv:2510.05365(2025)

  50. [50]

    Semgrep. 2026. Semgrep. https://github.com/semgrep/semgrep

  51. [51]

    Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. 2012. {AddressSanitizer}: A fast address sanity checker. In2012 USENIX annual technical conference (USENIX ATC 12). 309–318

  52. [52]

    Deniz Simsek, Aryaz Eghbali, and Michael Pradel. 2025. PoCGen: Gen- erating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages. arXiv:2506.04962 [cs.CR] https://arxiv.org/abs/2506.04962

  53. [53]

    Dokyung Song, Julian Lettner, Prabhu Rajasekaran, Yeoul Na, Stijn Volckaert, Per Larsen, and Michael Franz. 2019. SoK: Sanitizing for security. In2019 IEEE Symposium on Security and Privacy (SP). IEEE, 1275–1295

  54. [54]

    2025.Debugging with GDB: The GNU Source-Level Debugger

    Richard Stallman, Roland Pesch, Stan Shebs, et al. 2025.Debugging with GDB: The GNU Source-Level Debugger. Free Software Foundation, Boston, MA. https: //www.gnu.org/software/gdb/documentation/ GDB Version 17.1. Zhao et al

  55. [55]

    Evgeniy Stepanov and Konstantin Serebryany. 2015. MemorySanitizer: fast detector of uninitialized memory use in C++. In2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 46–55

  56. [56]

    Big Sleep team. 2024. From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code. https://googleprojectzero.blogspot. com/2024/10/from-naptime-to-big-sleep.html

  57. [57]

    Chris Thunes. 2020. javalang: Pure Python Java parser and tools. https://github. com/c2nes/javalang

  58. [58]

    Claire Wang, Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. QLCoder: A Query Synthesizer For Static Analysis of Security Vulnerabilities.arXiv preprint arXiv:2511.08462(2025)

  59. [59]

    Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. 2025. CyberGym: Evaluating AI Agents’ Cybersecurity Capabilities with Real-World Vulnerabilities at Scale. arXiv:2506.02548 [cs.CR] https://arxiv.org/ abs/2506.02548

  60. [60]

    Qiushi Wu, Yue Xiao, Dhilung Kirat, Kevin Eykholt, Jiyong Jang, and Douglas Lee Schales. 2025. One Bug, Hundreds Behind: LLMs for Large-Scale Bug Discovery. arXiv:2510.14036 [cs.SE] https://arxiv.org/abs/2510.14036

  61. [61]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)

  62. [62]

    Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4all: Universal fuzzing with large language models. InPro- ceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  63. [63]

    Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang. 2025. KNighter: Transforming Static Analysis with LLM-Synthesized Checkers. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (Seoul, Republic of Korea)(SOSP ’25). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3731569.3764827

  64. [64]

    Chenyuan Yang, Zijie Zhao, and Lingming Zhang. 2025. KernelGPT: Enhanced Kernel Fuzzing via Large Language Models(ASPLOS ’25). Association for Com- puting Machinery, New York, NY, USA, 560–573. doi:10.1145/3676641.3716022

  65. [65]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://arxiv.org/abs/2405. 15793

  66. [66]

    Ho, and Percy Liang

    Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glik- barg, Derek Askaryar, Haoxiang Yang, Aolin Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Kenny O Oseleononmen, Dan Boneh, ...

  67. [67]

    Xin Zhou, Ting Zhang, and David Lo. 2024. Large language model for vulner- ability detection: Emerging results and future directions. InProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results. 47–51

  68. [68]

    Tarek Ziadé, Ian Cordasco, and Anthony Sottile. 2024. flake8: Your Tool for Style Guide Enforcement. https://github.com/PyCQA/flake8. Version 7.1.1