FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Adrian Taylor; Chung-Horng Lung; Igor Bogdanov; Jie Gao; Marzia Zaman; Thomas Kunz

arxiv: 2605.16233 · v1 · pith:NUY3N7AVnew · submitted 2026-05-15 · 💻 cs.AI · cs.CL· cs.LG· cs.MA· cs.SY· eess.SY

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Igor Bogdanov , Chung-Horng Lung , Thomas Kunz , Jie Gao , Adrian Taylor , Marzia Zaman This is my paper

Pith reviewed 2026-05-20 18:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.MAcs.SYeess.SY

keywords LLM agentsself-evolving memorypopulation broadcastreflective learningprompt injectionno weight updatesReAct agentsCybORG CAGE-2

0 comments

The pith

A population of LLM agents can evolve effective natural-language memory from their own failures and broadcast the best versions to each other, lifting performance 1.7 to 7.7 times over zero-shot baselines with no model weight updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that LLM agents can improve long-horizon decision making in uncertain environments by turning their own failed attempts into reusable textual knowledge and then sharing the strongest versions across a population. A reader would care because this route avoids both gradient-based retraining and the need for stronger teacher models. The method wraps a reflection step that converts trajectories into rules or examples, then uses an outer loop to spread the best memory while freezing agents that have stabilized. Evidence comes from consistent gains across four different LLM families on a network-defense task where zero-shot performance is poor and heavy-tailed.

Core claim

FORGE runs an inner Reflexion-style loop in which a reflection agent using the identical base model converts failed trajectories into one of three knowledge artifacts (Rules, Examples, or Mixed). An outer population loop then broadcasts the highest-performing artifact to every agent at the end of each stage and applies a graduation test to lock in converged instances. On the CybORG CAGE-2 stochastic POMDP against the B-line attacker, this protocol raises average evaluation return by 1.7-7.7 times over zero-shot and by 29-72 percent over isolated Reflexion across all twelve model-representation pairs while dropping major-failure rates to roughly one percent.

What carries the argument

The population broadcast step that copies the single best memory artifact from the current stage to every active agent, paired with a graduation criterion that removes converged agents from further evolution.

If this is right

Population broadcast itself drives the measured gains; removing graduation changes little beyond compute cost.
Example-based artifacts deliver the highest returns for three of the four tested models.
Rule-based artifacts provide the best balance of performance and token cost, using about 40 percent fewer tokens.
Models that start weaker show the largest relative improvement, indicating the method narrows rather than widens capability gaps.
All reported gains remain confined to the CAGE-2 B-line setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same broadcast-and-graduation structure could be tested on other partially observable long-horizon tasks such as robotic planning or multi-agent games.
If the reflection step works across model families, the approach offers a route to agent improvement that scales with population size instead of model size.
A natural next measurement would be whether graduated memory remains useful when the underlying attacker strategy changes mid-episode.

Load-bearing premise

A reflection agent using the same base model can reliably produce knowledge artifacts from failures that, once injected into prompts and broadcast, produce stable performance gains in the downstream population.

What would settle it

Apply the same broadcast protocol on a different stochastic task or with a fresh set of models and observe that average returns remain at or below the isolated Reflexion baseline.

Figures

Figures reproduced from arXiv: 2605.16233 by Adrian Taylor, Chung-Horng Lung, Igor Bogdanov, Jie Gao, Marzia Zaman, Thomas Kunz.

**Figure 1.** Figure 1: System Overview. (Left) Hierarchical ReAct agent with dynamic memory injection. (Right) Reflexion learning [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Protocol Details. (Left) The FORGE protocol involves parallel execution, champion selection, graduation and broadcast [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of memory representations (Rules, Examples, and Mixed) across zero-shot, Reflexion, and FORGE conditions for all four model families. Bars represent mean return; error bars denote SEM. Improvement factors over zero-shot annotated above FORGE bars; checkmarks indicate the winning condition [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Combined analysis for Gemini-2.5-Flash-Lite. (A) Performance: All representations consistently outperform Baseline. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Protocol comparison. (A) Mean evaluation return across four models under four conditions: FORGE, FORGE without [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Graduation dynamics and no-graduation ablation (all models pooled). (A) Active instances and per-instance compute [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Risk and variance analysis. (A) Cumulative distribution of evaluation scores (all models pooled): zero-shot shows a [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: CAGE-2 Environment Overview. (a) The defender protects a 13-host network segmented into subnets [ [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Rules artifact generated by the Reflector after a failed episode. Each rule is a conditional heuristic injected into the agent’s system prompt. <example description='PlanMonitorAndDecoy AfterReconAnalysis'> Thought: Enterprise_Host shows signs of reconnaissance from 10.0.247.46. Per reflection knowledge, plan monitoring and decoy deployment. Tool: get_suggestion_for_next_action: {"target_host": "Enterprise… view at source ↗

**Figure 10.** Figure 10: Examples artifact generated by the Exemplifier after a failed episode (abbreviated). The demonstration mimics a full ReAct interaction cycle [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Failure trigger threshold analysis. (Left) Per-step penalty distribution across zero-shot episodes (log scale). The red [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FORGE gets clear lifts on CAGE-2 from population broadcast of evolved memory, though artifact quality needs checking.

read the letter

The key point is that FORGE delivers consistent performance improvements on the CAGE-2 network defense POMDP by evolving and broadcasting natural language memory across a population of agents, all without touching the model weights. What stands out is the staged protocol that combines an inner Reflexion loop with an outer population broadcast and graduation step. They test this on four different LLMs and three memory types, showing gains of 1.7 to 7.7 times over zero-shot and 29 to 72 percent over plain Reflexion. The ablation that removes graduation but keeps broadcast still captures most of the benefit, which cleanly identifies the sharing mechanism as the main contributor. It's also interesting that weaker models close the gap more than stronger ones do. The main limitation is the narrow scope. Everything is on one task against one attacker, so we don't know how this holds up elsewhere. More importantly, there's no qualitative look at the actual memory artifacts being generated. If the Rules and Examples are mostly generic or just longer versions of the original prompt, then the self-evolution story weakens even if the numbers improve. The abstract claims the gains are robust across conditions, but without error bars or statistical details visible here, it's hard to judge how reliable the differences are. This work will interest people building LLM agents for real-world control problems where retraining is expensive, such as cybersecurity or robotics. Readers who want practical ways to improve few-shot agent performance through memory sharing will find the comparisons useful. The empirical setup with clear baselines and an ablation makes it worth a full referee process. I would send this to peer review. The idea is concrete enough that reviewers can check the details and suggest extensions to other domains.

Referee Report

3 major / 2 minor

Summary. The paper proposes FORGE, a staged population-based protocol for evolving prompt-injected natural-language memory in hierarchical ReAct LLM agents without weight updates. A reflection agent (same underlying LLM) converts failed trajectories into Rules, Examples, or Mixed artifacts; an outer loop broadcasts the best-performing memory across the population and freezes converged instances via graduation. On CybORG CAGE-2 (stochastic network-defense POMDP, 30-step horizon, B-line attacker), FORGE yields 1.7-7.7× higher average returns than zero-shot and 29-72% gains over Reflexion across 12 model-representation conditions with four LLMs (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B), reducing major-failure rates to ~1%. Ablations identify population broadcast as the key driver while graduation mainly saves compute; Examples performs strongest for most models and Rules offers better cost-reliability.

Significance. If the central performance claims hold under more rigorous verification, the work would be significant for showing that LLM agents can achieve substantial self-improvement through natural-language memory evolution and population broadcast using only inference-time operations and no stronger teacher model. The directional finding that weaker baselines benefit disproportionately is a useful observation for closing capability gaps. The clean isolation of broadcast via the no-graduation ablation and the token-efficiency comparison between Rules and Examples are concrete contributions to agent memory design.

major comments (3)

[Experiments] Experiments section: the manuscript reports large gains from the generated artifacts but provides no qualitative analysis, examples, or content statistics of the Rules, Examples, or Mixed outputs. Without evidence that these artifacts encode reusable strategies (as opposed to generic heuristics, repeated failure summaries, or simple context-length effects), the interpretation of 'self-evolving memory' remains under-supported even if aggregate returns improve.
[Ablation studies] Ablation studies and results tables: while the no-graduation ablation isolates broadcast as the performance driver, the reported averages lack error bars, standard deviations, number of independent runs, or statistical significance tests. This weakens confidence in the 29-72% gains over Reflexion and the claim that broadcast consistently carries the improvements.
[Evaluation] Evaluation setup: all quantitative evidence is confined to a single environment (CAGE-2 with B-line attacker at fixed 30-step horizon). The central claim that FORGE enables reliable self-evolution via population broadcast would be more robust if supported by at least one additional task or attacker to reduce the risk of environment-specific artifacts.

minor comments (2)

[Method] Clarify the exact definition and computation of the graduation threshold and population size in the method section, as these are listed as free parameters yet their sensitivity is not fully explored.
[Results] Ensure consistent reporting of token costs and latency across all 12 conditions; the ~40% token reduction for Rules is noted but not shown with per-model breakdowns.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions will be incorporated to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the manuscript reports large gains from the generated artifacts but provides no qualitative analysis, examples, or content statistics of the Rules, Examples, or Mixed outputs. Without evidence that these artifacts encode reusable strategies (as opposed to generic heuristics, repeated failure summaries, or simple context-length effects), the interpretation of 'self-evolving memory' remains under-supported even if aggregate returns improve.

Authors: We agree that the absence of qualitative examples and content analysis leaves the interpretation of self-evolving memory under-supported. In the revised manuscript we will add a dedicated subsection presenting representative Rules, Examples, and Mixed artifacts drawn from the CAGE-2 trajectories, together with manual categorization of their content (e.g., frequency of specific defensive heuristics versus generic statements) and basic statistics such as token length distributions. These additions will directly address the concern that gains may stem from context-length effects alone. revision: yes
Referee: [Ablation studies] Ablation studies and results tables: while the no-graduation ablation isolates broadcast as the performance driver, the reported averages lack error bars, standard deviations, number of independent runs, or statistical significance tests. This weakens confidence in the 29-72% gains over Reflexion and the claim that broadcast consistently carries the improvements.

Authors: We acknowledge that the current presentation of averages without variability measures or statistical tests reduces confidence in the reported improvements. We will re-execute the primary conditions and the no-graduation ablation across five independent random seeds, add error bars and standard deviations to all tables, and include paired statistical significance tests (e.g., Wilcoxon signed-rank) comparing FORGE against Reflexion. These changes will be reflected in both the main results and ablation sections. revision: yes
Referee: [Evaluation] Evaluation setup: all quantitative evidence is confined to a single environment (CAGE-2 with B-line attacker at fixed 30-step horizon). The central claim that FORGE enables reliable self-evolution via population broadcast would be more robust if supported by at least one additional task or attacker to reduce the risk of environment-specific artifacts.

Authors: We recognize that reliance on a single environment and attacker constitutes a genuine limitation for broad claims of reliable self-evolution. CAGE-2 with the B-line attacker was selected because it is a standard stochastic POMDP benchmark in the cybersecurity literature and exhibits the heavy-tailed negative returns that make memory evolution particularly relevant. In the revision we will expand the Limitations section to explicitly discuss the risk of environment-specific artifacts and will outline concrete directions for future evaluation on additional attackers and tasks. Adding new environments and re-running the full experimental suite, however, exceeds the scope of the current revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmark

full rationale

The paper presents a staged population-based protocol evaluated directly on the external CybORG CAGE-2 stochastic POMDP benchmark against explicit zero-shot and Reflexion baselines. All reported gains (1.7-7.7× over zero-shot, 29-72% over Reflexion) and failure-rate reductions are measured average evaluation returns at 30-step horizon, with ablations (no-graduation) isolating broadcast effects via controlled runs rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No mathematical derivation chain exists that reduces claims to inputs by construction; the protocol is described procedurally and validated against independent benchmark outcomes.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on a small number of experimental hyperparameters and one key domain assumption about reflection quality; no new physical entities are postulated.

free parameters (2)

population size and stage count
Hyperparameters controlling the outer evolutionary loop; values chosen to balance compute and performance gains.
graduation threshold
Criterion for freezing converged instances; directly affects compute savings versus continued broadcast.

axioms (1)

domain assumption A reflection agent using the identical LLM can produce reusable textual heuristics or demonstrations from failed trajectories that improve subsequent agent performance when prompt-injected.
Invoked in the inner Reflexion-style loop and required for any memory evolution to occur.

pith-pipeline@v0.9.0 · 5922 in / 1381 out tokens · 45902 ms · 2026-05-20T18:43:13.305271+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FORGE wraps a Reflexion-style inner loop... outer loop that propagates the best-performing instance's memory... graduation criterion
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 11 internal anchors

[1]

TTCP CAGE Challenge 2

2022. TTCP CAGE Challenge 2. https://github.com/cage-challenge/cage- challenge-2

work page 2022
[2]

CardiffUni Team. 2022. CybORG CAGE-2 Winning Agent: PPO + Greedy Decoys. https://github.com/john-cardiff/-cyborg-cage-2

work page 2022
[3]

Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, and Alvaro A

Sebastián R. Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, and Alvaro A. Cardenas. 2025. Large Language Models are Autonomous Cyber Defenders. arXiv:2505.04843 [cs.CR] https://arxiv.org/abs/2505.04843

work page arXiv 2025
[4]

Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. 2024. PromptBreeder: Self-Referential Self- Improvement via Prompt Evolution. InThe Twelfth International Conference on Learning Representations. arXiv:2309.16797 [cs.CL] https://openreview.net/ forum?id=HKkiX32Zw1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. 2024. AutoGuide: Automated Generation and Selection of Context-Aware Guidelines for Large Language Model Agents. InAdvances in Neural Information Processing Systems. arXiv:2403.08978 [cs.AI] https://openreview.net/forum?id=mRIQz8Zd6O

work page arXiv 2024
[6]

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 2024. Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers. InThe Twelfth International Conference on Learning Representations. arXiv:2309.08532 [cs.CL] https://openreview.net/forum?id=ZG3RaNIsO8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Population Based Training of Neural Networks

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Green Tim, Iain Dunning, Karen Simonyan, et al . 2017. Population Based Training of Neural Networks. arXiv:1711.09846 [cs.LG] https://arxiv.org/abs/1711.09846

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Mitchell Kiely, David Bowman, Maxwell Standen, and Christopher Moir. 2023. On Autonomous Agents in a Cyber Defence Environment. arXiv:2309.07388 [cs.CR] https://arxiv.org/abs/2309.07388

work page arXiv 2023
[9]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processi...

work page 2023
[10]

Bodhisattwa Prasad Majumder, Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. 2024. CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Gener- alization. InThe Twelfth International Conference on Learning Representations. arXiv:2310.10134 [cs.AI] https://openreview.net/forum?...

work page arXiv 2024
[11]

Davis, and Mitchell Kiely

Hamoun Mohammadi, Jonathan J. Davis, and Mitchell Kiely. 2025. Leveraging Large Language Models for Autonomous Cyber Defense: Insights from CAGE-2 Simulations.IEEE Intelligent Systems40, 4 (2025), 29–36. doi:10.1109/MIS.2025. 3568209

work page doi:10.1109/mis.2025 2025
[12]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560 [cs.AI] https://arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

, year = 2023, booktitle =

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST). doi:10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023
[14]

Vishnu Sarukkai, Zhiqiang Xie, and Kayvon Fatahalian. 2025. Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks. arXiv:2505.00234 [cs.LG] https://arxiv.org/abs/2505.00234

work page arXiv 2025
[15]

Narasimhan, and Shunyu Yao

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforce- ment Learning. InAdvances in Neural Information Processing Systems. https: //openreview.net/forum?id=vAElhFcKW6

work page 2023
[16]

Richer, Junae Kim, and Damian Marriott

Maxwell Standen, Martin Lucas, David Bowman, Toby J. Richer, Junae Kim, and Damian Marriott. 2021. CybORG: A Gym for the Development of Autonomous Cyber Agents. arXiv:2108.09118 [cs.CR] https://arxiv.org/abs/2108.09118

work page arXiv 2021
[17]

Mirac Suzgun, Mert Yüksekgönül, Federico Bianchi, Dan Jurafsky, and James Zou. 2025. Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory. arXiv:2504.07952 [cs.LG] https://arxiv.org/abs/2504.07952

work page arXiv 2025
[18]

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. 2025. Multi-Agent Collaboration Mechanisms: A Survey of LLMs. arXiv:2501.06322 [cs.AI] https://arxiv.org/abs/2501.06322

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Sercan O. Arik. 2024. Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization. InAdvances in Neural Information Processing Systems. arXiv:2406.15708 [cs.CL] https://openreview.net/forum?id=IdtoJVWVnX

work page arXiv 2024
[20]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291 [cs.AI] https://arxiv.org/ abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2025. Agent Workflow Memory. InInternational Conference on Machine Learning. arXiv:2409.07429 [cs.AI] https://openreview.net/forum?id=NTAhi2JEEE

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Large Language Models as Optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2024. Large Language Models as Optimizers. InThe Twelfth International Conference on Learning Representations. arXiv:2309.03409 [cs.LG] https://openreview.net/forum?id=Bb4VGOWELI

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations. arXiv:2210.03629 [cs.CL] https://openreview.net/forum?id=WE_vluYUL-X

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. 2024. TextGrad: Automatic "Differentiation" via Text. arXiv:2406.07496 [cs.CL] https://arxiv.org/abs/2406.07496

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. 2025. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models. arXiv:2510.04618 [cs.AI] https://arxiv.org/abs/2510.04618

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

target_host

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2024. ExpeL: LLM Agents Are Experiential Learners. InPro- ceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19632–19642. arXiv:2308.10144 [cs.AI] doi:10.1609/aaai.v38i17.29936 A Ethics Statement & Reproducibility All authors adhere to the ACM Code of Ethic...

work page doi:10.1609/aaai.v38i17.29936 2024

[1] [1]

TTCP CAGE Challenge 2

2022. TTCP CAGE Challenge 2. https://github.com/cage-challenge/cage- challenge-2

work page 2022

[2] [2]

CardiffUni Team. 2022. CybORG CAGE-2 Winning Agent: PPO + Greedy Decoys. https://github.com/john-cardiff/-cyborg-cage-2

work page 2022

[3] [3]

Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, and Alvaro A

Sebastián R. Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, and Alvaro A. Cardenas. 2025. Large Language Models are Autonomous Cyber Defenders. arXiv:2505.04843 [cs.CR] https://arxiv.org/abs/2505.04843

work page arXiv 2025

[4] [4]

Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. 2024. PromptBreeder: Self-Referential Self- Improvement via Prompt Evolution. InThe Twelfth International Conference on Learning Representations. arXiv:2309.16797 [cs.CL] https://openreview.net/ forum?id=HKkiX32Zw1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. 2024. AutoGuide: Automated Generation and Selection of Context-Aware Guidelines for Large Language Model Agents. InAdvances in Neural Information Processing Systems. arXiv:2403.08978 [cs.AI] https://openreview.net/forum?id=mRIQz8Zd6O

work page arXiv 2024

[6] [6]

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 2024. Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers. InThe Twelfth International Conference on Learning Representations. arXiv:2309.08532 [cs.CL] https://openreview.net/forum?id=ZG3RaNIsO8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Population Based Training of Neural Networks

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Green Tim, Iain Dunning, Karen Simonyan, et al . 2017. Population Based Training of Neural Networks. arXiv:1711.09846 [cs.LG] https://arxiv.org/abs/1711.09846

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Mitchell Kiely, David Bowman, Maxwell Standen, and Christopher Moir. 2023. On Autonomous Agents in a Cyber Defence Environment. arXiv:2309.07388 [cs.CR] https://arxiv.org/abs/2309.07388

work page arXiv 2023

[9] [9]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processi...

work page 2023

[10] [10]

Bodhisattwa Prasad Majumder, Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. 2024. CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Gener- alization. InThe Twelfth International Conference on Learning Representations. arXiv:2310.10134 [cs.AI] https://openreview.net/forum?...

work page arXiv 2024

[11] [11]

Davis, and Mitchell Kiely

Hamoun Mohammadi, Jonathan J. Davis, and Mitchell Kiely. 2025. Leveraging Large Language Models for Autonomous Cyber Defense: Insights from CAGE-2 Simulations.IEEE Intelligent Systems40, 4 (2025), 29–36. doi:10.1109/MIS.2025. 3568209

work page doi:10.1109/mis.2025 2025

[12] [12]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560 [cs.AI] https://arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

, year = 2023, booktitle =

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST). doi:10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023

[14] [14]

Vishnu Sarukkai, Zhiqiang Xie, and Kayvon Fatahalian. 2025. Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks. arXiv:2505.00234 [cs.LG] https://arxiv.org/abs/2505.00234

work page arXiv 2025

[15] [15]

Narasimhan, and Shunyu Yao

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforce- ment Learning. InAdvances in Neural Information Processing Systems. https: //openreview.net/forum?id=vAElhFcKW6

work page 2023

[16] [16]

Richer, Junae Kim, and Damian Marriott

Maxwell Standen, Martin Lucas, David Bowman, Toby J. Richer, Junae Kim, and Damian Marriott. 2021. CybORG: A Gym for the Development of Autonomous Cyber Agents. arXiv:2108.09118 [cs.CR] https://arxiv.org/abs/2108.09118

work page arXiv 2021

[17] [17]

Mirac Suzgun, Mert Yüksekgönül, Federico Bianchi, Dan Jurafsky, and James Zou. 2025. Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory. arXiv:2504.07952 [cs.LG] https://arxiv.org/abs/2504.07952

work page arXiv 2025

[18] [18]

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. 2025. Multi-Agent Collaboration Mechanisms: A Survey of LLMs. arXiv:2501.06322 [cs.AI] https://arxiv.org/abs/2501.06322

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Sercan O. Arik. 2024. Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization. InAdvances in Neural Information Processing Systems. arXiv:2406.15708 [cs.CL] https://openreview.net/forum?id=IdtoJVWVnX

work page arXiv 2024

[20] [20]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291 [cs.AI] https://arxiv.org/ abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2025. Agent Workflow Memory. InInternational Conference on Machine Learning. arXiv:2409.07429 [cs.AI] https://openreview.net/forum?id=NTAhi2JEEE

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Large Language Models as Optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2024. Large Language Models as Optimizers. InThe Twelfth International Conference on Learning Representations. arXiv:2309.03409 [cs.LG] https://openreview.net/forum?id=Bb4VGOWELI

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations. arXiv:2210.03629 [cs.CL] https://openreview.net/forum?id=WE_vluYUL-X

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. 2024. TextGrad: Automatic "Differentiation" via Text. arXiv:2406.07496 [cs.CL] https://arxiv.org/abs/2406.07496

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. 2025. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models. arXiv:2510.04618 [cs.AI] https://arxiv.org/abs/2510.04618

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

target_host

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2024. ExpeL: LLM Agents Are Experiential Learners. InPro- ceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19632–19642. arXiv:2308.10144 [cs.AI] doi:10.1609/aaai.v38i17.29936 A Ethics Statement & Reproducibility All authors adhere to the ACM Code of Ethic...

work page doi:10.1609/aaai.v38i17.29936 2024