CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly
Pith reviewed 2026-06-29 21:17 UTC · model grok-4.3
The pith
CyberEvolver lets LLM cybersecurity agents revise their own scaffolds from failed execution traces, lifting success rates by 13.6 percent on average.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CyberEvolver decomposes scaffold optimization into a four-layer evolvable agent architecture, converts noisy execution logs into actionable revision signals via a trace-to-diagnosis mechanism, and preserves diversity through population-based beam search, allowing the agent to improve its own structure over iterations and achieve a 13.6 percent average gain in success rate over the seed agent while beating six human-designed agents and two adapted self-improvement baselines on CTF, exploitation, and penetration-testing tasks with four open-source LLMs.
What carries the argument
The four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, paired with the trace-to-diagnosis mechanism and population-based beam search.
If this is right
- Agents built this way adapt across diverse targets and failure modes without fixed human scaffolds.
- The same framework works with multiple open-source LLMs on CTF challenges, vulnerability exploitation, and penetration testing.
- It produces higher success rates than six existing human-designed cybersecurity agents.
- It also exceeds two self-improvement methods taken from other domains.
Where Pith is reading between the lines
- The same decomposition and diversity-preserving search could be tested in agent settings outside cybersecurity where feedback is also sparse.
- If the diagnosis mapping proves stable, teams might shift effort from hand-crafting scaffolds to curating better log-to-signal translators.
- Maintaining a population of variants may prevent the kind of premature convergence seen in single-lineage self-improvement loops.
Load-bearing premise
The trace-to-diagnosis step can turn noisy or obscured execution logs into reliable revision signals without errors compounding across iterations.
What would settle it
Run repeated self-evolution cycles on a set of tasks with deliberately obscured logs and check whether success rates stop rising or begin to fall after the first few iterations.
Figures
read the original abstract
LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce \textsc{CyberEvolver}, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts. Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often obscured by the environment, and low-diversity updates can cause errors to compound over repeated iterations. \textsc{CyberEvolver} addresses these challenges with a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and a population-based beam search strategy that preserves diverse agent variants during evolution. We evaluate \textsc{CyberEvolver} on CTF challenges, vulnerability exploitation, and penetration-testing tasks using four open-source LLMs. Across these settings, \textsc{CyberEvolver} improves the seed agent's success rate by $13.6$\,\% on average, and outperforms six human-designed cybersecurity agents as well as two self-improvement methods adapted from other domains. These results suggest that scaffold self-evolution is a promising direction for building adaptive LLM agents for security testing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CyberEvolver, a self-evolving LLM agent framework for cybersecurity that iteratively revises its scaffold using a four-layer evolvable agent architecture, a trace-to-diagnosis mechanism to convert noisy execution logs into revision signals, and a population-based beam search to maintain diversity. It evaluates the approach on CTF challenges, vulnerability exploitation, and penetration-testing tasks with four open-source LLMs, claiming a 13.6% average success-rate improvement over the seed agent while outperforming six human-designed cybersecurity agents and two adapted self-improvement baselines.
Significance. If the performance claims hold under rigorous controls, the work would offer a concrete demonstration that structured self-evolution can mitigate the limitations of fixed scaffolds in domains with sparse and obscured feedback, advancing adaptive agent design for security applications.
major comments (2)
- [Abstract] Abstract: the central claim of a 13.6% average success-rate lift and outperformance of six human-designed agents plus two baselines is stated without any mention of task counts, run variance, statistical tests, or failure-mode breakdowns, rendering the empirical contribution unverifiable from the given evidence.
- [Abstract] Abstract: the trace-to-diagnosis mechanism is presented as the key step that converts obscured logs into non-compounding revision signals, yet the manuscript supplies no ablation, error analysis, or validation that this conversion step succeeds reliably; because the four-layer architecture and beam search both depend on its output, any systematic mis-diagnosis would invalidate attribution of the reported gains to self-evolution.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive suggestions focused on the abstract. We agree that the abstract can be strengthened for clarity and verifiability while preserving its concise nature. Below we respond point-by-point and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of a 13.6% average success-rate lift and outperformance of six human-designed agents plus two baselines is stated without any mention of task counts, run variance, statistical tests, or failure-mode breakdowns, rendering the empirical contribution unverifiable from the given evidence.
Authors: We agree that the abstract should supply sufficient context for the performance claims. In the revised manuscript we will expand the final sentence of the abstract to state the evaluation scope (CTF, exploitation, and penetration-testing tasks across four LLMs), note that the 13.6% figure is an average over multiple independent runs with reported standard deviation, and indicate that statistical significance was assessed with appropriate tests (e.g., paired t-test or Wilcoxon signed-rank). Failure-mode breakdowns and per-task counts remain in the main experimental section (Section 4) but will be briefly referenced in the abstract for completeness. revision: yes
-
Referee: [Abstract] Abstract: the trace-to-diagnosis mechanism is presented as the key step that converts obscured logs into non-compounding revision signals, yet the manuscript supplies no ablation, error analysis, or validation that this conversion step succeeds reliably; because the four-layer architecture and beam search both depend on its output, any systematic mis-diagnosis would invalidate attribution of the reported gains to self-evolution.
Authors: We acknowledge the referee's concern that the abstract presents the trace-to-diagnosis component without accompanying validation. The current manuscript contains component ablations in Section 5 that quantify the contribution of trace-to-diagnosis relative to the other layers and the beam search; however, a dedicated error analysis of diagnosis accuracy (e.g., manual inspection of a sample of traces and mis-diagnosis rate) is not present. We will therefore add a short paragraph and accompanying table in the revised manuscript that reports diagnosis reliability on a held-out set of traces and shows that mis-diagnosis events remain below a threshold that would explain the observed gains. This addition will be summarized concisely in the abstract as well. revision: yes
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper's central claims consist of empirical performance measurements (13.6% average success-rate lift, outperformance of six human-designed agents and two adapted baselines) obtained on external CTF, vulnerability exploitation, and penetration-testing tasks. These results are evaluated against independent benchmarks rather than quantities fitted inside the same experiments or reduced by construction to the paper's own equations or self-citations. The four-layer architecture, trace-to-diagnosis mechanism, and population beam search are presented as design choices whose effectiveness is tested externally; no self-definitional loops, fitted-input predictions, or load-bearing self-citation chains appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Execution feedback in cybersecurity is sparse and often obscured by the environment.
invented entities (3)
-
four-layer evolvable agent architecture
no independent evidence
-
trace-to-diagnosis mechanism
no independent evidence
-
population-based beam search strategy
no independent evidence
Reference graph
Works this paper leans on
-
[1]
T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulnerabilities, 2024. URLhttps://arxiv.org/abs/2409.16165
-
[2]
Agent skills
Anthropic. Agent skills. https://platform.claude.com/docs/en/agents-and-tools/ agent-skills/overview, 2025. Accessed: 2026-05-07
2025
-
[3]
Applis, Y
L. Applis, Y . Zhang, S. Liang, N. Jiang, L. Tan, and A. Roychoudhury. Unified software engineering agent as AI software engineer, 2025. URL https://arxiv.org/abs/2506. 14683
2025
-
[4]
Brown, B
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...
-
[5]
URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
2020
-
[6]
CSAW CTF Qualification Round 2023
CTFtime. CSAW CTF Qualification Round 2023. https://ctftime.org/event/2087/,
2023
-
[7]
Scoreboard reports 1,096 teams total
-
[8]
DeepSeek-V3 technical report, 2024
DeepSeek-AI. DeepSeek-V3 technical report, 2024. URL https://arxiv.org/abs/2412. 19437
2024
-
[9]
The temperature parameter, n.d
DeepSeek-AI. The temperature parameter, n.d.. URL https://api-docs.deepseek.com/ quick_start/parameter_settings. DeepSeek API Docs; accessed 2026-05-07
2026
-
[10]
Create chat completion, n.d
DeepSeek-AI. Create chat completion, n.d.. URL https://api-docs.deepseek.com/api/ create-chat-completion. DeepSeek API Docs; accessed 2026-05-07
2026
- [11]
-
[12]
X. Deng, J. Da, E. Pan, Y . Y . He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V . Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler. SWE-Bench Pro: Can AI agents solve long-horizon software engineering tasks?, 2025. URL https://scale.com/research/ ...
2025
-
[13]
J. Dodge, M. Sap, A. Marasovi ´c, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Onl...
-
[14]
Fernando, D
C. Fernando, D. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InInternational Conference on Machine Learning, 2024. 10
2024
-
[15]
Fernando, D
C. Fernando, D. S. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InProceedings of the 41st Interna- tional Conference on Machine Learning, 2024. URL https://openreview.net/forum? id=HKkiX32Zw1
2024
-
[16]
L. Gioacchini, M. Mellia, I. Drago, A. Delsanto, G. Siracusano, and R. Bifulco. Autopenbench: Benchmarking generative agents for penetration testing, 2024. URL https://arxiv.org/ abs/2410.03225
-
[17]
Welcome to the eternal september of open source
GitHub. Welcome to the eternal september of open source. here’s what we plan to do for maintainers, 2026. URL https://github.blog/open-source/maintainers/ welcome-to-the-eternal-september-of-open-source-heres-what-we-plan-to-do-for-maintainers/ . GitHub Blog
2026
-
[18]
Bug bounty policy, n.d
Global. Bug bounty policy, n.d. URL https://global.com/bug-bounty-policy/. Exam- ple bug bounty policy requiring actionable proof-of-concept evidence
-
[19]
S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, 2025
2025
- [20]
-
[21]
AI-generated pull requests overwhelming, hard to review carefully, 2026
ITK Discourse Contributors. AI-generated pull requests overwhelming, hard to review carefully, 2026. URL https://discourse.itk.org/t/ ai-generated-pull-requests-overwhelming-hard-to-review-carefully/7728 . Community discussion on maintainer burden from AI-generated pull requests
2026
-
[22]
Z. Ji, D. Wu, W. Jiang, P. Ma, Z. Li, and S. Wang. Measuring and augmenting large language models for solving capture-the-flag challenges, 2025. URL https://arxiv.org/abs/2506. 17644
2025
-
[23]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe- bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=VTF8yNQM66
2024
-
[24]
Kimi K2.5: Visual Agentic Intelligence
Kimi Team. Kimi K2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/ 2602.02276
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
moonshotai/Kimi-K2.5
Kimi Team. moonshotai/Kimi-K2.5. https://huggingface.co/moonshotai/Kimi-K2.5,
-
[26]
Hugging Face model card; accessed 2026-05-07
2026
- [28]
- [29]
-
[30]
D. Lee, G. eun Bae, and I. yun. CTFusion : A CTF-based benchmark for LLM agent evaluation,
-
[31]
URLhttps://openreview.net/forum?id=2zQJHLbyqM
-
[32]
H. Lee, Z. Zhang, H. Lu, and L. Zhang. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. InAdvances in Neural Information Processing Systems,
-
[33]
URLhttps://openreview.net/forum?id=QQhQIqons0
- [34]
-
[35]
J. W. Lin, E. K. Jones, D. J. Jasper, E. J.-s. Ho, A. Wu, A. T. Yang, N. Perry, A. Zou, M. Fredrik- son, J. Z. Kolter, P. Liang, D. Boneh, and D. E. Ho. Comparing ai agents to cybersecurity profes- sionals in real-world penetration testing, 2025. URL https://arxiv.org/abs/2512.09882. 11
-
[36]
A. Z. Liu, J. Choi, S. Sohn, Y . Fu, J. Kim, D.-K. Kim, X. Wang, J. Yoo, and H. Lee. Skillact: Using skill abstractions improves LLM agents. InICML 2024 Workshop on LLMs and Cognition,
2024
-
[37]
URLhttps://openreview.net/forum?id=6LG3cIRrF4
-
[38]
Madaan, N
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, 2023
2023
- [39]
-
[40]
MiniMax M2.5: Built for real-world productivity, 2026
MiniMax. MiniMax M2.5: Built for real-world productivity, 2026. URL https://www. minimax.io/news/minimax-m25. MiniMax official blog
2026
-
[41]
MiniMaxAI/MiniMax-M2.5
MiniMaxAI. MiniMaxAI/MiniMax-M2.5. https://huggingface.co/MiniMaxAI/ MiniMax-M2.5, 2026. Hugging Face model card; accessed 2026-05-07
2026
-
[42]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
A. Novikov, N. V˜u, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. R. Ruiz, A. Mehrabian, M. P. Kumar, A. See, S. Chaudhuri, G. Holland, A. Davies, S. Nowozin, P. Kohli, and M. Balog. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
CSAW Quals 2023: web/cookie-injection challenge meta- data
NYU CTF Bench. CSAW Quals 2023: web/cookie-injection challenge meta- data. https://github.com/NYU-LLM-CTF/NYU_CTF_Bench/blob/main/test/2023/ CSAW-Quals/web/cookie-injection/challenge.json, 2024. Challenge metadata lists the dynamic scoring parameters and final point value of 488
2023
-
[44]
Gpt-5.3-codex system card
OpenAI. Gpt-5.3-codex system card. OpenAI, Feb. 2026. URL https://openai.com/ index/gpt-5-3-codex-system-card/
2026
-
[45]
Ouyang, J
S. Ouyang, J. Yan, I.-H. Hsu, Y . Chen, K. Jiang, Z. Wang, R. Han, L. Le, S. Daruki, X. Tang, V . Tirumalashetty, G. Lee, M. Rofouei, H. Lin, J. Han, C.-Y . Lee, and T. Pfister. Reasoning- Bank: Scaling agent self-evolving with reasoning memory. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? ...
2026
-
[46]
Vulnerability disclosure cheat sheet, 2024
OWASP Foundation. Vulnerability disclosure cheat sheet, 2024. URL https: //cheatsheetseries.owasp.org/cheatsheets/Vulnerability_Disclosure_ Cheat_Sheet.html. OW ASP Cheat Sheet Series
2024
-
[47]
Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Qwen/Qwen3-235B-A22B-Instruct-2507
Qwen Team. Qwen/Qwen3-235B-A22B-Instruct-2507. https://huggingface.co/Qwen/ Qwen3-235B-A22B-Instruct-2507 , 2025. Hugging Face model card; accessed 2026-05-07
2025
-
[49]
Robeyns, M
M. Robeyns, M. Szummer, and L. Aitchison. A self-improving coding agent. InScaling Self- Improving Foundation Models without Human Supervision, 2025. URLhttps://openreview. net/forum?id=rShJCyLsOr
2025
-
[50]
AXE: Grey-Box Exploitability Confirmation for Localized Vulnerability Reports
A. Sajadi, T. Nguyen, K. Damevski, and P. Chatterjee. Axe: An agentic exploit engine for confirming zero-day vulnerability reports.arXiv preprint arXiv:2602.14345, 2026. URL https://arxiv.org/abs/2602.14345
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[51]
A. L. Samuel. Some studies in machine learning using the game of checkers.IBM Journal of Research and Development, 3(3):210–229, 1959. doi: 10.1147/rd.33.0210
-
[52]
J. Schmidhuber. G"odel machines: Fully self-referential optimal universal self-improvers. InAr- tificial General Intelligence, pages 199–226. Springer, 2007. doi: 10.1007/978-3-540-68677-4_ 7
-
[53]
Shang, Y
Y . Shang, Y . Li, K. Zhao, L. Ma, J. Liu, F. Xu, and Y . Li. Agentsquare: Automatic llm agent search in modular design space. InInternational Conference on Learning Representations, 2025. 12
2025
-
[54]
M. Shao, S. Jancheska, M. Udeshi, B. Dolan-Gavitt, H. Xi, K. Milner, B. Chen, M. Yin, S. Garg, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security, 2024. URL https://arxiv.org/abs/2406.05590
- [55]
-
[56]
Shinn, F
N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems,
-
[57]
URLhttps://openreview.net/forum?id=vAElhFcKW6
- [58]
-
[59]
R. S. Sutton and A. G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition,
- [60]
-
[61]
Udeshi, M
M. Udeshi, M. Shao, H. Xi, N. Rani, K. Milner, V . S. C. Putrevu, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique. D-cipher: Dynamic collaborative intelligent multi-agent system with planner and heterogeneous executors for offensive security,
- [62]
-
[63]
G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024. URLhttps://openreview.net/forum?id=ehfRiF0R3a
2024
- [64]
- [65]
-
[66]
Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory, 2024. URL https: //arxiv.org/abs/2409.07429
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
C. J. C. H. Watkins and P. Dayan. Q-learning.Machine Learning, 8:279–292, 1992. doi: 10.1007/BF00992698
-
[68]
C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InInternational Conference on Learning Representations, 2024
2024
-
[69]
J. Yang, A. Prabhakar, K. R. Narasimhan, and S. Yao. Intercode: Standardizing and benchmark- ing interactive coding with execution feedback. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=fvKaLF1ns8
2023
-
[70]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. SWE- agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= mXpq6ut8J3
2024
-
[71]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X
2023
-
[72]
A. K. Zhang, J. Ji, C. Menders, R. Dulepet, T. Qin, R. Y . Wang, J. Wu, K. Liao, J. Li, J. Hu, et al. Bountybench: Dollar impact of ai agent attackers and defenders on real-world cybersecurity systems. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=pIsP4lMlFd. 13
2025
-
[73]
A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. J. Jasper, et al. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models. InThe Thirteenth International Conference on Learning Representations,
-
[74]
URLhttps://openreview.net/forum?id=tc90LV0yRL
-
[75]
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
J. Zhang, S. Hu, C. Lu, R. Lange, and J. Clune. Darwin Godel Machine: Open-ended evolution of self-improving agents, 2025. URLhttps://arxiv.org/abs/2505.22954
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
Zhang, J
J. Zhang, J. Xiang, Z. Yu, F. Teng, X. Chen, J. Chen, M. Zhuge, X. Cheng, S. Hong, J. Wang, B. Zheng, B. Liu, Y . Luo, and C. Wu. AFlow: Automating agentic workflow generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=z5uVAKwmjf
2025
-
[77]
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Q. Zhang, C. Hu, S. Upasani, B. Ma, F. Hong, V . Kamanuru, J. Rainton, C. Wu, M. Ji, H. Li, U. Thakker, J. Zou, and K. Olukotun. Agentic context engineering: Evolving contexts for self-improving language models, 2025. URLhttps://arxiv.org/abs/2510.04618
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[78]
A. Zhao, D. Huang, Q. Xu, M. Lin, Y .-J. Liu, and G. Huang. ExpeL: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024. doi: 10.1609/aaai.v38i17.29936. URL https://doi.org/10.1609/ aaai.v38i17.29936
-
[79]
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building autonomous agents, 2023. URLhttps://arxiv.org/abs/2307.13854
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[80]
Y . Zhu, T. Jin, Y . Pruksachatkun, A. Zhang, S. Liu, S. Cui, S. Kapoor, S. Longpre, K. Meng, R. Weiss, F. Barez, R. Gupta, J. Dhamala, J. Merizian, M. Giulianelli, H. Coppock, C. Ududec, J. Sekhon, J. Steinhardt, A. Kellermann, S. Schwettmann, M. Zaharia, I. Stoica, P. Liang, and D. Kang. Establishing best practices for building rigorous agentic benchmar...
-
[81]
Y . Zhu, A. Kellermann, D. Bowman, P. Li, A. Gupta, A. Danda, R. Fang, C. Jensen, E. Ihli, J. Benn, J. Geronimo, A. Dhir, S. Rao, K. Yu, T. Stone, and D. Kang. Cve-bench: A benchmark for ai agents’ ability to exploit real-world web application vulnerabilities. InProceedings of the 42nd International Conference on Machine Learning, pages 79850–79867, 2025....
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.