pith. machine review for the scientific record. sign in

arxiv: 2604.20801 · v1 · submitted 2026-04-22 · 💻 cs.CR

Recognition: unknown

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:13 UTC · model grok-4.3

classification 💻 cs.CR
keywords multi-agent harnessesLLM agentsvulnerability discoveryzero-day vulnerabilitiesharness synthesistyped graph DSLruntime feedbacksecurity automation
0
0 comments X

The pith

A typed graph DSL and runtime-feedback loop let LLM agents automatically design their own harnesses to find real zero-day vulnerabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the wiring of multiple LLM agents into a coordinated system, called a harness, determines success rates far more than the choice of model alone. Instead of hand-writing these harnesses, AgentFlow searches a joint space of roles, prompts, tools, topologies, and protocols using a typed graph representation, then uses live signals from the running target program to diagnose failures and rewrite the harness. This approach reaches the top score on a public agent benchmark and uncovers previously unknown vulnerabilities in a large, complex codebase. If the method holds, it lowers the barrier to applying agent systems to security tasks that have resisted both human auditors and traditional fuzzers for years.

Core claim

AgentFlow pairs a typed graph DSL covering agent roles, prompts, tools, communication topology, and coordination protocol with an outer feedback loop that reads runtime signals directly from the target program to identify which harness component caused a failure and to guide targeted rewrites. On TerminalBench-2 with Claude Opus 4.6 the system scores 84.3 percent, the highest in the evaluated public leaderboard snapshot; on Google Chrome with Kimi K2.5 it discovers ten previously unknown zero-day vulnerabilities, two of which are critical sandbox-escape flaws.

What carries the argument

The typed graph DSL that represents the full harness design space together with the feedback-driven outer loop that extracts diagnostic runtime signals from the target to drive iterative rewrites.

If this is right

  • Harness design can be treated as an optimizable search problem rather than a manual engineering task, raising success rates several-fold on the same underlying model.
  • Agent-based vulnerability discovery becomes practical for large, instrumentable codebases where manual harness construction has been a bottleneck.
  • The same synthesis technique can be applied to other multi-agent coordination tasks that currently rely on hand-crafted wiring.
  • Runtime feedback from the target supplies a richer training signal than coarse pass/fail outcomes, enabling more precise harness adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the approach scales to even larger targets, it could shift vulnerability research from primarily manual or fuzzer-driven workflows toward automated agent synthesis.
  • The method may transfer to neighboring domains such as automated testing of distributed systems or protocol implementations where coordination topology is critical.
  • One testable extension is to measure how much the discovered vulnerabilities depend on the specific feedback signals versus the breadth of the graph search space alone.

Load-bearing premise

Runtime signals extracted from the target program contain enough diagnostic detail to identify which part of the harness caused a failure and to suggest useful rewrites.

What would settle it

A target program in which different harness errors produce indistinguishable runtime signals, so that the outer loop cannot distinguish which component to rewrite and therefore cannot improve performance.

Figures

Figures reproduced from arXiv: 2604.20801 by Chaofan Shou, Hanzhi Liu, Hongbo Wen, Ryan Jingyang Fang, Xiaonan Liu, Yanju Chen, Yu Feng.

Figure 1
Figure 1. Figure 1: AgentFlow run on libheif (CVE-2020-23109). Each row is one iteration: propose a harness, execute it, score the outcome, and diagnose failures. Iteration 1 uses a single agent that produces a malformed input; iteration 2 adds an analyzer with coverage feedback; iteration 3 adds a verifier with AddressSanitizer and a retry loop, triggering the heap-buffer-overflow. 2 Motivating Example Before describing Agen… view at source ↗
Figure 2
Figure 2. Figure 2: The AgentFlow language: abstract syntax (a) and well-formedness rules (b). The runtime that executes a well-formed program is described in Algorithm 1 and Section 6. a fixed pool. AgentFlow instead ranges over all five components (A, G, Σ, Φ, Ψ) inside a single typed grammar, so every proposed edit, whether it adds an agent, rewires the graph, rebinds a channel, revokes a tool, or changes the retry behavio… view at source ↗
Figure 3
Figure 3. Figure 3: An AgentFlow program (left) and its compiled topology (right). Blue names match blue nodes; teal channels label the structural signal each agent reads. The validator consumes all eight explorer outputs through {{probes.out}} and decides which (if any) reproduces the crash; no separate join operator is needed. Edges. An edge 𝑛1 → 𝑛2 declares that 𝑛1’s output is in scope for 𝑛2’s template. The type system (r… view at source ↗
Figure 4
Figure 4. Figure 4: High-level overview of the AgentFlow optimiza￾tion loop (Algorithm 1). Inputs: target software (e.g. Chrome, TerminalBench-2 tasks, C/C++ libraries) and the Harness DSL (A, G, Σ, Φ, Ψ). Output: discovered vulnerabilities with proof￾of-concept (PoC) exploit inputs. Each iteration proposes a new harness (a DSL program), executes it, scores the result, and diagnoses failures to guide the next proposal. Propos… view at source ↗
Figure 5
Figure 5. Figure 5: Synthesis trajectory (left) and leaderboard comparison (right) on TerminalBench-2 (89 tasks, Claude Opus 4.6, snapshot [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: All experiments run on a public cloud GPU/CPU pool [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: Final synthesized AgentFlow harness for TerminalBench-2: nine specialised agent roles across five phases, with three parallel workspaces merged by an evaluator. Dashed teal arrows are structural-feedback channels. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Synthesized AgentFlow harness for the Chrome campaign (RQ3): 18 agent roles with 192 parallel explorers across seven subsystems. Six feedback loops drive iterative PoC generation. This harness produced the ten zero-days in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

LLM agents have begun to find real security vulnerabilities that human auditors and automated fuzzers missed for decades, in source-available targets where the analyst can build and instrument the code. In practice the work is split among several agents, wired together by a harness: the program that fixes which roles exist, how they pass information, which tools each may call, and how retries are coordinated. When the language model is held fixed, changing only the harness can still change success rates by several-fold on public agent benchmarks, yet most harnesses are written by hand; recent harness optimizers each search only a narrow slice of the design space and rely on coarse pass/fail feedback that gives no diagnostic signal about why a trial failed. AgentFlow addresses both limitations with a typed graph DSL whose search space jointly covers agent roles, prompts, tools, communication topology, and coordination protocol, paired with a feedback-driven outer loop that reads runtime signals from the target program itself to diagnose which part of the harness caused the failure and rewrite it accordingly. We evaluate AgentFlow on TerminalBench-2 with Claude Opus 4.6 and on Google Chrome with Kimi K2.5. AgentFlow reaches 84.3% on TerminalBench-2, the highest score in the public leaderboard snapshot we evaluate against, and discovers ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape vulnerabilities (CVE-2026-5280 and CVE-2026-6297).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentFlow, a system for synthesizing multi-agent harnesses for LLM-driven vulnerability discovery. It defines a typed graph DSL that jointly encodes agent roles, prompts, tools, communication topology, and coordination protocols, then optimizes harnesses via a feedback-driven outer loop that extracts runtime signals from the target program to diagnose failures and perform targeted rewrites. On TerminalBench-2 with Claude Opus 4.6 the system reaches 84.3% (highest in the evaluated leaderboard snapshot); on Google Chrome with Kimi K2.5 it reports discovery of ten previously unknown zero-day vulnerabilities, including two Critical sandbox-escape issues assigned CVE-2026-5280 and CVE-2026-6297.

Significance. If the empirical results are substantiated, the work would be significant for automated security analysis: it shows that a broad, structured search space over harness designs plus diagnostic runtime feedback can outperform hand-crafted or narrowly optimized harnesses, producing both benchmark leadership and concrete, high-severity vulnerabilities in a large, hardened target. The real-world CVE discoveries, if verified, constitute a strong external validation of the approach.

major comments (2)
  1. [Evaluation / Results] The abstract and available description report concrete benchmark and CVE results, yet supply no experimental protocol, baseline comparisons, statistical details, or verification steps for the claimed zero-days. Full methods and data sections are required to judge whether the evidence supports the claims.
  2. [Feedback-driven outer loop / §3] The central mechanism rests on the assumption that runtime signals extracted from the target program supply sufficient diagnostic information to identify which harness component caused failure and to guide effective rewrites. Without details on signal extraction, rewrite heuristics, or ablation studies isolating the contribution of the feedback loop, it is impossible to verify that the outer loop drives targeted improvement rather than undirected search.
minor comments (2)
  1. [Abstract] The CVE identifiers CVE-2026-5280 and CVE-2026-6297 use future year prefixes; confirm whether these are assigned numbers or placeholders and provide public disclosure references if available.
  2. [DSL definition] Clarify the precise coverage of the typed graph DSL: does the representation explicitly include all five elements listed (roles, prompts, tools, topology, protocol) and are there any restrictions on prompt or tool expressiveness?

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas where the manuscript can be strengthened to better support the claims. We address each major comment below and will revise the manuscript to incorporate the requested details on experimental protocols and the feedback mechanism.

read point-by-point responses
  1. Referee: [Evaluation / Results] The abstract and available description report concrete benchmark and CVE results, yet supply no experimental protocol, baseline comparisons, statistical details, or verification steps for the claimed zero-days. Full methods and data sections are required to judge whether the evidence supports the claims.

    Authors: We agree that the current presentation of results would benefit from expanded documentation. Section 4 of the manuscript describes the evaluation on TerminalBench-2 (using Claude Opus 4.6) and Google Chrome (using Kimi K2.5), including the reported 84.3% score and the ten zero-day discoveries with CVE assignments. To fully address this, we will revise Section 4 to include a complete experimental protocol (e.g., harness generation parameters, number of optimization iterations, and termination criteria), explicit baseline comparisons against other leaderboard entries, statistical details such as run-to-run variance where multiple trials were performed, and verification steps for the zero-days. An appendix will be added with high-level descriptions of the vulnerabilities (including the two Critical sandbox escapes CVE-2026-5280 and CVE-2026-6297), confirmation methods, and public CVE references to allow independent assessment. revision: yes

  2. Referee: [Feedback-driven outer loop / §3] The central mechanism rests on the assumption that runtime signals extracted from the target program supply sufficient diagnostic information to identify which harness component caused failure and to guide effective rewrites. Without details on signal extraction, rewrite heuristics, or ablation studies isolating the contribution of the feedback loop, it is impossible to verify that the outer loop drives targeted improvement rather than undirected search.

    Authors: We concur that additional technical detail is needed to substantiate the role of the feedback-driven outer loop. Section 3 introduces the typed graph DSL and the outer loop that uses runtime signals from the target to diagnose failures and trigger targeted rewrites. We will expand this section with explicit descriptions of signal extraction (e.g., parsing of execution logs, error codes, coverage metrics, and crash reports), the rewrite heuristics (mapping specific signal patterns to component-level changes such as prompt edits, tool additions, or topology adjustments), and new ablation studies. These ablations will compare the full system against variants that use only binary pass/fail feedback or disable the diagnostic loop entirely, quantifying the contribution to the observed performance gains on both benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical architecture and evaluation

full rationale

The paper describes an empirical system (typed-graph DSL + runtime-signal feedback loop) evaluated on external benchmarks and a live target (Google Chrome). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All load-bearing claims are experimental outcomes (84.3% on TerminalBench-2, ten zero-days) rather than reductions to inputs by construction. The architecture is presented as a design choice with stated assumptions, not as a theorem forced by prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the newly introduced typed graph DSL and the diagnostic power of runtime signals; the abstract provides no upstream derivation or independent evidence for either.

axioms (1)
  • domain assumption Runtime signals from the target program supply diagnostic information sufficient to identify and rewrite failing harness components
    Invoked in the description of the feedback-driven outer loop that rewrites the harness.

pith-pipeline@v0.9.0 · 5579 in / 1325 out tokens · 71677 ms · 2026-05-10T00:13:23.364251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 26 canonical work pages · 13 internal anchors

  1. [1]

    AfterQuery. 2026. Terminus 2. Terminal-Bench 2.0 leaderboard. Claude Opus 4.6 leaderboard entry dated 2026-02-06. https://www.tbench.ai/ leaderboard/terminal-bench/2.0

  2. [2]

    Anthropic. 2026. Claude Code. Terminal-Bench 2.0 leaderboard. Claude Opus 4.6 leaderboard entry dated 2026-02-07. https://www.tbench.ai/ leaderboard/terminal-bench/2.0

  3. [3]

    Anthropic. 2026. Claude Platform Release Notes. Claude API Docs. February 5, 2026 entry launching Claude Opus 4.6. https://platform.claude.com/docs/en/ release-notes/overview

  4. [4]

    berabuddies. 2026. agentflow. GitHub repository. https://github.com/ berabuddies/agentflow

  5. [5]

    Big Sleep team. 2024. From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code. Project Zero. Posted 2024-11-01. https://projectzero.google/2024/10/from-naptime-to-big-sleep.html

  6. [6]

    BIGAI. 2026. TongAgents. Terminal-Bench 2.0 leaderboard. Claude Opus 4.6 leaderboard entry dated 2026-02-22. https://www.tbench.ai/leaderboard/ terminal-bench/2.0

  7. [7]

    Capy. 2026. Capy. Terminal-Bench 2.0 leaderboard. Claude Opus 4.6 leaderboard entry dated 2026-03-12. https://www.tbench.ai/leaderboard/terminal-bench/2.0

  8. [8]

    Nicholas Carlini, Newton Cheng, Keane Lucas, Michael Moore, Milad Nasr, Vinay Prabhushankar, Winnie Xiao Hakeem Angulu, Evyatar Ben Asher, Jackie Bow, Keir Bradwell, Ben Buchanan, David Forsythe, Daniel Freeman, Alex Gaynor, Xinyang Ge, Logan Graham, Kyla Guru, Hasnain Lakhani, Matt McNiece, Mojtaba Mehrara, Renee Nichol, Adnan Pirzada, Sophia Porter, And...

  9. [9]

    Jordan, Joseph E

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. 12 Synthesizing Multi-Agent Harnesses for Vulnerability Discovery Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. InProceedings of the 41st Internat...

  10. [10]

    Chrome Releases. 2026. Stable Channel Update for Desktop. Chrome Re- leases blog. 2026-04-15. https://chromereleases.googleblog.com/2026/04/stable- channel-update-for-desktop_15.html

  11. [11]

    Chrome Releases. 2026. Stable Channel Update for Desktop. Chrome Re- leases blog. 2026-03-31. https://chromereleases.googleblog.com/2026/03/stable- channel-update-for-desktop_31.html

  12. [12]

    Chrome Releases. 2026. Stable Channel Update for Desktop. Chrome Re- leases blog. 2026-03-18. https://chromereleases.googleblog.com/2026/03/stable- channel-update-for-desktop_18.html

  13. [13]

    Coder. 2026. Mux. Terminal-Bench 2.0 leaderboard. Claude Opus 4.6 leaderboard entry dated 2026-02-13. https://www.tbench.ai/leaderboard/terminal-bench/2.0

  14. [14]

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep- Learning Libraries via Large Language Models. InInternational Symposium on Software Testing and Analysis (ISSTA). doi:10.1145/3597926.3598067

  15. [15]

    Factory. 2026. Droid. Terminal-Bench 2.0 leaderboard. Claude Opus 4.6 leader- board entry dated 2026-02-05. https://www.tbench.ai/leaderboard/terminal- bench/2.0

  16. [16]

    Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. 2024. LLM Agents can Autonomously Hack Websites.arXiv preprint arXiv:2402.06664 (2024). doi:10.48550/arXiv.2402.06664

  17. [17]

    Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++: Combining Incremental Steps of Fuzzing Research. In14th USENIX Workshop on Offensive Technologies (WOOT 20). https://www.usenix.org/conference/woot20/ presentation/fioraldi

  18. [18]

    ForgeCode. 2026. ForgeCode. Terminal-Bench 2.0 leaderboard. Claude Opus 4.6 leaderboard entry dated 2026-03-12. https://www.tbench.ai/leaderboard/ terminal-bench/2.0

  19. [19]

    ForgeCode. 2026. World’s #1 Coding Harness. ForgeCode. Accessed 2026-04-18; the public site reported more than 4.6K GitHub stars. https://forgecode.dev/

  20. [20]

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. InInternational Conference on Learning Representations (ICLR). doi:1...

  21. [21]

    Shengran Hu, Cong Lu, and Jeff Clune. 2025. Automated Design of Agentic Systems. InInternational Conference on Learning Representations (ICLR). doi:10. 48550/arXiv.2408.08435

  22. [22]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav San- thanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2024. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. In International Conference on Learning Representations ...

  23. [23]

    Kimi Team. 2026. Kimi K2.5: Visual Agentic Intelligence.arXiv preprint arXiv:2602.02276(2026). doi:10.48550/arXiv.2602.02276

  24. [24]

    KRAFTON AI. 2026. Terminus-KIRA. Terminal-Bench 2.0 leaderboard. Claude Opus 4.6 leaderboard entry dated 2026-02-22. https://www.tbench.ai/ leaderboard/terminal-bench/2.0

  25. [25]

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. 2026. Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv preprint arXiv:2603.28052(2026). doi:10.48550/arXiv.2603.28052

  26. [26]

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. doi:10.52202/075280-2264

  27. [27]

    In: Transactions of the Association for Computational Linguistics (TACL), 12:157-173 (2024)

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173. doi:10.1162/tacl_a_00638

  28. [28]

    LMSYS. 2026. Chatbot Arena Leaderboard. LMSYS Chatbot Arena. https: //lmarena.ai/leaderboard. Accessed 2026-04-18

  29. [29]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processi...

  30. [30]

    Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. 2024. Large Language Model guided Protocol Fuzzing. InNetwork and Distributed System Security Symposium (NDSS). doi:10.14722/ndss.2024.24556

  31. [31]

    Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. 2024. Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 9340–9366. doi:10.18653/v1/2024.emnlp-main.525

  32. [32]

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15174–15186. do...

  33. [33]

    Roam. 2026. Crux. Terminal-Bench 2.0 leaderboard. Claude Opus 4.6 leaderboard entry dated 2026-02-23. https://www.tbench.ai/leaderboard/terminal-bench/2.0

  34. [34]

    Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Checker. In2012 USENIX Annual Technical Conference (USENIX ATC 12). 309–318. https://www.usenix. org/conference/atc12/technical-sessions/presentation/serebryany

  35. [35]

    Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishna- murthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. 2024. NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security. InAdvances in Neural Information Proc...

  36. [36]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. doi:10.48550/arXiv.2303.11366

  37. [37]

    Terminal-Bench. 2026. terminal-bench@2.0 Leaderboard. Terminal-Bench. Official benchmark leaderboard, accessed 2026-04-17. https://www.tbench.ai/ leaderboard/terminal-bench/2.0

  38. [38]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. Voyager: An Open-Ended Embodied Agent with Large Language Models.Transactions on Machine Learning Research (TMLR)(2024). doi:10.48550/arXiv.2305.16291

  39. [39]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, and Chi Wang. 2024. AutoGen: Enabling Next-Gen Taskinstruction Planner EnvAnalyzer DomainAdvisor ApproachGenerator(Plan A,B,C) Worker 1 Worker 2 Worker 3 /tmp/ws_0 /tmp/ws_1 /tmp/ws_2 Plan A Tmux + execute_commands image_read tool Plan B Tmux + execute_command...

  40. [40]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying LLM-based Software Engineering Agents.arXiv preprint arXiv:2407.01489(2024). doi:10.48550/arXiv.2407.01489

  41. [41]

    Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4All: Universal Fuzzing with Large Language Models. In IEEE/ACM International Conference on Software Engineering (ICSE). doi:10.1145/ 3597503.3639121

  42. [42]

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. 2024. Large Language Models as Optimizers. InInternational Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2309.03409

  43. [43]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. doi:10.48550/arXiv.2305.10601

  44. [44]

    Mert Yüksekgönül, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. 2025. Optimizing generative AI by backpropa- gating language model feedback.Nature(2025). doi:10.1038/s41586-025-08661-4

  45. [45]

    Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang

  46. [46]

    Multi-agent architecture search via agentic supernet.arXiv preprint arXiv:2502.04180, 2025

    Multi-Agent Architecture Search via Agentic Supernet. InInternational Conference on Machine Learning (ICML). doi:10.48550/arXiv.2502.04180

  47. [47]

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. 2025. AFlow: Automating Agentic Workflow Generation. InInternational Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2410.10762

  48. [48]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InInternational Symposium on Software Testing and Analysis (ISSTA). 1592–1604. doi:10.1145/3650212.3680384

  49. [49]

    Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. 2026. Teams of LLM Agents can Exploit Zero-Day Vulnerabilities. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 23–35. doi:10.18653/v1/2026.eacl-long.2 A Open Science The h...