FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
Pith reviewed 2026-05-20 18:43 UTC · model grok-4.3
The pith
A population of LLM agents can evolve effective natural-language memory from their own failures and broadcast the best versions to each other, lifting performance 1.7 to 7.7 times over zero-shot baselines with no model weight updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FORGE runs an inner Reflexion-style loop in which a reflection agent using the identical base model converts failed trajectories into one of three knowledge artifacts (Rules, Examples, or Mixed). An outer population loop then broadcasts the highest-performing artifact to every agent at the end of each stage and applies a graduation test to lock in converged instances. On the CybORG CAGE-2 stochastic POMDP against the B-line attacker, this protocol raises average evaluation return by 1.7-7.7 times over zero-shot and by 29-72 percent over isolated Reflexion across all twelve model-representation pairs while dropping major-failure rates to roughly one percent.
What carries the argument
The population broadcast step that copies the single best memory artifact from the current stage to every active agent, paired with a graduation criterion that removes converged agents from further evolution.
If this is right
- Population broadcast itself drives the measured gains; removing graduation changes little beyond compute cost.
- Example-based artifacts deliver the highest returns for three of the four tested models.
- Rule-based artifacts provide the best balance of performance and token cost, using about 40 percent fewer tokens.
- Models that start weaker show the largest relative improvement, indicating the method narrows rather than widens capability gaps.
- All reported gains remain confined to the CAGE-2 B-line setting.
Where Pith is reading between the lines
- The same broadcast-and-graduation structure could be tested on other partially observable long-horizon tasks such as robotic planning or multi-agent games.
- If the reflection step works across model families, the approach offers a route to agent improvement that scales with population size instead of model size.
- A natural next measurement would be whether graduated memory remains useful when the underlying attacker strategy changes mid-episode.
Load-bearing premise
A reflection agent using the same base model can reliably produce knowledge artifacts from failures that, once injected into prompts and broadcast, produce stable performance gains in the downstream population.
What would settle it
Apply the same broadcast protocol on a different stochastic task or with a fresh set of models and observe that average returns remain at or below the isolated Reflexion baseline.
Figures
read the original abstract
Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FORGE, a staged population-based protocol for evolving prompt-injected natural-language memory in hierarchical ReAct LLM agents without weight updates. A reflection agent (same underlying LLM) converts failed trajectories into Rules, Examples, or Mixed artifacts; an outer loop broadcasts the best-performing memory across the population and freezes converged instances via graduation. On CybORG CAGE-2 (stochastic network-defense POMDP, 30-step horizon, B-line attacker), FORGE yields 1.7-7.7× higher average returns than zero-shot and 29-72% gains over Reflexion across 12 model-representation conditions with four LLMs (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B), reducing major-failure rates to ~1%. Ablations identify population broadcast as the key driver while graduation mainly saves compute; Examples performs strongest for most models and Rules offers better cost-reliability.
Significance. If the central performance claims hold under more rigorous verification, the work would be significant for showing that LLM agents can achieve substantial self-improvement through natural-language memory evolution and population broadcast using only inference-time operations and no stronger teacher model. The directional finding that weaker baselines benefit disproportionately is a useful observation for closing capability gaps. The clean isolation of broadcast via the no-graduation ablation and the token-efficiency comparison between Rules and Examples are concrete contributions to agent memory design.
major comments (3)
- [Experiments] Experiments section: the manuscript reports large gains from the generated artifacts but provides no qualitative analysis, examples, or content statistics of the Rules, Examples, or Mixed outputs. Without evidence that these artifacts encode reusable strategies (as opposed to generic heuristics, repeated failure summaries, or simple context-length effects), the interpretation of 'self-evolving memory' remains under-supported even if aggregate returns improve.
- [Ablation studies] Ablation studies and results tables: while the no-graduation ablation isolates broadcast as the performance driver, the reported averages lack error bars, standard deviations, number of independent runs, or statistical significance tests. This weakens confidence in the 29-72% gains over Reflexion and the claim that broadcast consistently carries the improvements.
- [Evaluation] Evaluation setup: all quantitative evidence is confined to a single environment (CAGE-2 with B-line attacker at fixed 30-step horizon). The central claim that FORGE enables reliable self-evolution via population broadcast would be more robust if supported by at least one additional task or attacker to reduce the risk of environment-specific artifacts.
minor comments (2)
- [Method] Clarify the exact definition and computation of the graduation threshold and population size in the method section, as these are listed as free parameters yet their sensitivity is not fully explored.
- [Results] Ensure consistent reporting of token costs and latency across all 12 conditions; the ~40% token reduction for Rules is noted but not shown with per-model breakdowns.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions will be incorporated to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the manuscript reports large gains from the generated artifacts but provides no qualitative analysis, examples, or content statistics of the Rules, Examples, or Mixed outputs. Without evidence that these artifacts encode reusable strategies (as opposed to generic heuristics, repeated failure summaries, or simple context-length effects), the interpretation of 'self-evolving memory' remains under-supported even if aggregate returns improve.
Authors: We agree that the absence of qualitative examples and content analysis leaves the interpretation of self-evolving memory under-supported. In the revised manuscript we will add a dedicated subsection presenting representative Rules, Examples, and Mixed artifacts drawn from the CAGE-2 trajectories, together with manual categorization of their content (e.g., frequency of specific defensive heuristics versus generic statements) and basic statistics such as token length distributions. These additions will directly address the concern that gains may stem from context-length effects alone. revision: yes
-
Referee: [Ablation studies] Ablation studies and results tables: while the no-graduation ablation isolates broadcast as the performance driver, the reported averages lack error bars, standard deviations, number of independent runs, or statistical significance tests. This weakens confidence in the 29-72% gains over Reflexion and the claim that broadcast consistently carries the improvements.
Authors: We acknowledge that the current presentation of averages without variability measures or statistical tests reduces confidence in the reported improvements. We will re-execute the primary conditions and the no-graduation ablation across five independent random seeds, add error bars and standard deviations to all tables, and include paired statistical significance tests (e.g., Wilcoxon signed-rank) comparing FORGE against Reflexion. These changes will be reflected in both the main results and ablation sections. revision: yes
-
Referee: [Evaluation] Evaluation setup: all quantitative evidence is confined to a single environment (CAGE-2 with B-line attacker at fixed 30-step horizon). The central claim that FORGE enables reliable self-evolution via population broadcast would be more robust if supported by at least one additional task or attacker to reduce the risk of environment-specific artifacts.
Authors: We recognize that reliance on a single environment and attacker constitutes a genuine limitation for broad claims of reliable self-evolution. CAGE-2 with the B-line attacker was selected because it is a standard stochastic POMDP benchmark in the cybersecurity literature and exhibits the heavy-tailed negative returns that make memory evolution particularly relevant. In the revision we will expand the Limitations section to explicitly discuss the risk of environment-specific artifacts and will outline concrete directions for future evaluation on additional attackers and tasks. Adding new environments and re-running the full experimental suite, however, exceeds the scope of the current revision. revision: partial
Circularity Check
No significant circularity; empirical results on external benchmark
full rationale
The paper presents a staged population-based protocol evaluated directly on the external CybORG CAGE-2 stochastic POMDP benchmark against explicit zero-shot and Reflexion baselines. All reported gains (1.7-7.7× over zero-shot, 29-72% over Reflexion) and failure-rate reductions are measured average evaluation returns at 30-step horizon, with ablations (no-graduation) isolating broadcast effects via controlled runs rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No mathematical derivation chain exists that reduces claims to inputs by construction; the protocol is described procedurally and validated against independent benchmark outcomes.
Axiom & Free-Parameter Ledger
free parameters (2)
- population size and stage count
- graduation threshold
axioms (1)
- domain assumption A reflection agent using the identical LLM can produce reusable textual heuristics or demonstrations from failed trajectories that improve subsequent agent performance when prompt-injected.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FORGE wraps a Reflexion-style inner loop... outer loop that propagates the best-performing instance's memory... graduation criterion
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2022. TTCP CAGE Challenge 2. https://github.com/cage-challenge/cage- challenge-2
work page 2022
-
[2]
CardiffUni Team. 2022. CybORG CAGE-2 Winning Agent: PPO + Greedy Decoys. https://github.com/john-cardiff/-cyborg-cage-2
work page 2022
-
[3]
Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, and Alvaro A
Sebastián R. Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, and Alvaro A. Cardenas. 2025. Large Language Models are Autonomous Cyber Defenders. arXiv:2505.04843 [cs.CR] https://arxiv.org/abs/2505.04843
-
[4]
Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. 2024. PromptBreeder: Self-Referential Self- Improvement via Prompt Evolution. InThe Twelfth International Conference on Learning Representations. arXiv:2309.16797 [cs.CL] https://openreview.net/ forum?id=HKkiX32Zw1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. 2024. AutoGuide: Automated Generation and Selection of Context-Aware Guidelines for Large Language Model Agents. InAdvances in Neural Information Processing Systems. arXiv:2403.08978 [cs.AI] https://openreview.net/forum?id=mRIQz8Zd6O
-
[6]
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. 2024. Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers. InThe Twelfth International Conference on Learning Representations. arXiv:2309.08532 [cs.CL] https://openreview.net/forum?id=ZG3RaNIsO8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Population Based Training of Neural Networks
Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Green Tim, Iain Dunning, Karen Simonyan, et al . 2017. Population Based Training of Neural Networks. arXiv:1711.09846 [cs.LG] https://arxiv.org/abs/1711.09846
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [8]
-
[9]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processi...
work page 2023
-
[10]
Bodhisattwa Prasad Majumder, Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. 2024. CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Gener- alization. InThe Twelfth International Conference on Learning Representations. arXiv:2310.10134 [cs.AI] https://openreview.net/forum?...
-
[11]
Hamoun Mohammadi, Jonathan J. Davis, and Mitchell Kiely. 2025. Leveraging Large Language Models for Autonomous Cyber Defense: Insights from CAGE-2 Simulations.IEEE Intelligent Systems40, 4 (2025), 29–36. doi:10.1109/MIS.2025. 3568209
-
[12]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560 [cs.AI] https://arxiv.org/abs/2310.08560
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST). doi:10.1145/3586183.3606763
- [14]
-
[15]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforce- ment Learning. InAdvances in Neural Information Processing Systems. https: //openreview.net/forum?id=vAElhFcKW6
work page 2023
-
[16]
Richer, Junae Kim, and Damian Marriott
Maxwell Standen, Martin Lucas, David Bowman, Toby J. Richer, Junae Kim, and Damian Marriott. 2021. CybORG: A Gym for the Development of Autonomous Cyber Agents. arXiv:2108.09118 [cs.CR] https://arxiv.org/abs/2108.09118
- [17]
-
[18]
Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. 2025. Multi-Agent Collaboration Mechanisms: A Survey of LLMs. arXiv:2501.06322 [cs.AI] https://arxiv.org/abs/2501.06322
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Sercan O. Arik. 2024. Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization. InAdvances in Neural Information Processing Systems. arXiv:2406.15708 [cs.CL] https://openreview.net/forum?id=IdtoJVWVnX
-
[20]
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291 [cs.AI] https://arxiv.org/ abs/2305.16291
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2025. Agent Workflow Memory. InInternational Conference on Machine Learning. arXiv:2409.07429 [cs.AI] https://openreview.net/forum?id=NTAhi2JEEE
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Large Language Models as Optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2024. Large Language Models as Optimizers. InThe Twelfth International Conference on Learning Representations. arXiv:2309.03409 [cs.LG] https://openreview.net/forum?id=Bb4VGOWELI
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations. arXiv:2210.03629 [cs.CL] https://openreview.net/forum?id=WE_vluYUL-X
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
TextGrad: Automatic "Differentiation" via Text
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. 2024. TextGrad: Automatic "Differentiation" via Text. arXiv:2406.07496 [cs.CL] https://arxiv.org/abs/2406.07496
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. 2025. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models. arXiv:2510.04618 [cs.AI] https://arxiv.org/abs/2510.04618
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2024. ExpeL: LLM Agents Are Experiential Learners. InPro- ceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19632–19642. arXiv:2308.10144 [cs.AI] doi:10.1609/aaai.v38i17.29936 A Ethics Statement & Reproducibility All authors adhere to the ACM Code of Ethic...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.