APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

Bryan Hooi; Jiashuo Yang; Shizun Wang; Yibo Li; Yuan Sui; Yufei He; Zhiyuan Hu; Zhi Zheng

arxiv: 2605.21240 · v1 · pith:MJNLBHKNnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

Yibo Li , Jiashuo Yang , Zhi Zheng , Zhiyuan Hu , Yuan Sui , Shizun Wang , Yufei He , Bryan Hooi This is my paper

Pith reviewed 2026-05-21 05:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-evolving LLM agentspolicy explorationstrategy mapexploration collapseJericho text gamesWebArenaautonomous agentsDAG milestones

0 comments

The pith

APEX sustains exploration in self-evolving LLM agents by maintaining an explicit strategy map as a DAG of milestones with prerequisite edges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-evolving LLM agents accumulate memory across episodes to improve without weight updates, yet they often collapse into repeating familiar high-reward routines and miss better alternatives. The paper proposes APEX to address this by building and maintaining a strategy map, a directed acyclic graph whose nodes are milestones and whose edges encode prerequisite dependencies. Fork Discovery adds new, evidence-grounded directions to the map, while Policy Selection chooses among paths to balance exploration against exploitation during planning. On nine Jericho text-adventure games and the WebArena benchmark, the resulting agents outperform all tested baselines. Ablations confirm that both the map-construction step and the selection rule contribute measurably to the gains.

Core claim

The central claim is that an explicit strategy space organized as a directed acyclic graph of milestones with prerequisite dependency edges, expanded by Fork Discovery and navigated by Policy Selection, prevents exploration collapse and yields higher performance than prior self-evolving agents in long-horizon interactive settings.

What carries the argument

The strategy map, a directed acyclic graph of milestones connected by prerequisite dependency edges, that serves as an explicit, updatable representation of the agent's strategy space.

If this is right

Agents equipped with the strategy map continue to discover superior policies across successive episodes rather than locking into early routines.
Explicit milestone dependencies allow the planner to avoid invalid sequences while still reaching unexplored branches.
Policy Selection produces measurable gains on both text-adventure and realistic web-interaction benchmarks.
Ablation results indicate that removing either Fork Discovery or Policy Selection reduces final performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same DAG structure could be tested on embodied or multi-agent tasks where dependency errors are easier to detect than in text environments.
If milestone nodes can be grounded in natural-language summaries, the map itself becomes an interpretable record of the agent's evolving knowledge.
Maintaining a bounded-size map through periodic pruning might preserve the exploration benefit while controlling memory growth in very long sessions.

Load-bearing premise

An evidence-grounded DAG of milestones with correct prerequisite edges can be maintained and expanded at test time without introducing dependency errors or planning costs that erase the exploration benefit.

What would settle it

A long-horizon interactive task in which the agent repeatedly fails to discover new high-reward paths despite an expanding strategy map, or in which planning time grows so large that overall success drops below non-APEX baselines.

Figures

Figures reproduced from arXiv: 2605.21240 by Bryan Hooi, Jiashuo Yang, Shizun Wang, Yibo Li, Yuan Sui, Yufei He, Zhiyuan Hu, Zhi Zheng.

**Figure 2.** Figure 2: Comparison of Static and Reflexion on three Jericho games. Reflexion often achieves a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of APEX. Each node in the strategy map displays [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Per-episode scores on three representative Jericho games. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Fork Discovery case studies on Pentari (a,b) and Deephome (c,d): strategy maps and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Per-episode scores across 50 episodes on all nine Jericho games. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

LLM agents have shown strong performance across a wide range of complex tasks, including interactive environments that require long-horizon decision making. But these agents cannot learn on the fly at test time. Self-evolving agents address this by accumulating memory and reflection across episodes rather than requiring model-weight updates. However, these agents often suffer from exploration collapse: as memory grows, behavior concentrates around familiar high-reward routines, reducing the chance of discovering better alternatives. To address this problem, we propose Autonomous Policy EXploration (APEX), which builds and maintains an explicit strategy space through a strategy map-a directed acyclic graph of milestones with prerequisite dependency edges. In APEX, Fork Discovery expands the map with evidence-grounded unexplored directions, while Policy Selection balances exploration and exploitation during planning. Evaluated on nine Jericho text-adventure games and WebArena, a realistic web interaction benchmark, APEX outperforms all baselines. Extensive ablations validate each component's contribution and demonstrate robustness across diverse settings, demonstrating APEX's effectiveness for sustained exploration in self-evolving agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

APEX structures self-evolving agent strategies as an explicit DAG of milestones with fork discovery to avoid exploration collapse, and the reported gains on Jericho and WebArena rest on whether the LLM can build reliable prerequisite edges from partial traces.

read the letter

APEX's main idea is to keep self-evolving LLM agents from narrowing down to familiar routines by maintaining an explicit DAG of milestones connected by prerequisite edges. Fork discovery adds new branches when past evidence points to unexplored directions, and policy selection then chooses which paths to follow while trading off exploration against exploitation during planning. This is a more organized approach than the usual unstructured memory dumps or reflection loops in prior agent work, and the paper claims it beats the baselines across nine Jericho text-adventure games and the WebArena web benchmark. The ablations are said to show that each piece contributes, which is useful to see even if the numbers are not in the abstract. The focus on test-time adaptation without weight updates is practical for settings where retraining is not an option. The main soft spot is the construction of those prerequisite edges. They come from prompting the same LLM on noisy, incomplete interaction traces, so wrong dependencies could either block valid new forks or send planning effort down impossible sequences. The stress-test concern about systematic errors in the graph is reasonable and not obviously refuted by the abstract alone. If the full paper includes direct checks on edge accuracy or shows that such errors stay rare enough not to erase the gains, that would help a lot. Otherwise the outperformance could be fragile. This is for researchers working on long-horizon LLM agents that need to keep discovering better behavior over repeated episodes in interactive environments. It deserves a serious referee because the problem is real, the framing is distinct, and the benchmarks are relevant, even though the graph reliability will need close examination in review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces APEX for self-evolving LLM agents facing exploration collapse. It maintains an explicit strategy space as a DAG of milestones connected by prerequisite dependency edges. Fork Discovery expands the DAG with evidence-grounded new directions from interaction traces, while Policy Selection balances exploration and exploitation during planning. The method is evaluated on nine Jericho text-adventure games and WebArena, where it is reported to outperform all baselines, supported by ablations on component contributions and robustness across settings.

Significance. If the empirical claims hold, the work provides a concrete mechanism for structured, long-horizon exploration in LLM agents without weight updates, addressing a recognized limitation in memory-based self-evolution. The DAG-based strategy map offers an interpretable alternative to unstructured reflection, with potential applicability to other interactive benchmarks. The inclusion of ablations and multi-environment testing strengthens the case for the approach's practical value.

major comments (2)

[§4.2] §4.2 (Fork Discovery): The prerequisite edges in the strategy map are generated by prompting the same LLM on partial, noisy interaction traces. No quantitative assessment of edge accuracy, error rate, or sensitivity to observation incompleteness is reported. Because the central performance gains rest on the DAG correctly encoding necessary orderings for valid Fork Discovery and Policy Selection, unmeasured dependency errors could systematically block useful paths or waste planning budget, directly undermining the outperformance claim.
[§5] §5 (Experiments): Results on Jericho and WebArena report outperformance without error bars, number of independent runs, or statistical significance tests. This absence makes it impossible to judge whether the gains over baselines are robust or could be explained by variance, which is load-bearing for the primary empirical contribution.

minor comments (2)

[Abstract] Abstract: States that APEX 'outperforms all baselines' but supplies no numerical values, baseline names, or effect sizes, reducing immediate readability for readers scanning for concrete evidence.
[§3] Notation: The distinction between 'milestones' and 'strategies' is used interchangeably in places; a single consistent definition would improve clarity in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important aspects of empirical rigor that we have addressed in revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [§4.2] §4.2 (Fork Discovery): The prerequisite edges in the strategy map are generated by prompting the same LLM on partial, noisy interaction traces. No quantitative assessment of edge accuracy, error rate, or sensitivity to observation incompleteness is reported. Because the central performance gains rest on the DAG correctly encoding necessary orderings for valid Fork Discovery and Policy Selection, unmeasured dependency errors could systematically block useful paths or waste planning budget, directly undermining the outperformance claim.

Authors: We acknowledge that the manuscript did not include a direct quantitative evaluation of prerequisite edge accuracy, error rates, or sensitivity to incomplete observations. The edges are produced by LLM prompting on traces, which can contain noise. However, Fork Discovery is designed with explicit evidence-grounding steps that filter implausible dependencies before insertion into the DAG. The ablation results in Section 5 show that removing Fork Discovery causes clear performance degradation relative to the full system, providing indirect support that the generated orderings are sufficiently reliable for the reported gains. To address the referee's concern directly, the revised manuscript adds a new paragraph in Section 4.2 that reports edge-level precision and recall on a held-out sample of traces together with a sensitivity analysis under progressively masked observations. These additions strengthen the presentation without altering the core experimental findings. revision: yes
Referee: [§5] §5 (Experiments): Results on Jericho and WebArena report outperformance without error bars, number of independent runs, or statistical significance tests. This absence makes it impossible to judge whether the gains over baselines are robust or could be explained by variance, which is load-bearing for the primary empirical contribution.

Authors: We agree that the original manuscript omitted explicit reporting of variance, the number of independent runs, and statistical significance tests, which limits assessment of result robustness. Although each environment was evaluated across multiple episodes, standard deviations and run-level statistics were not presented. In the revised manuscript we have updated all tables and figures in Section 5 to display error bars (standard error of the mean) computed over five independent runs per environment. We have also added paired t-test p-values comparing APEX against each baseline; the improvements remain statistically significant (p < 0.05) in the large majority of settings. These changes directly respond to the referee's concern and reinforce the reliability of the outperformance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic construction with independent procedural definitions

full rationale

The paper presents APEX as an explicit algorithmic procedure for constructing and expanding a strategy map DAG via Fork Discovery and Policy Selection components. No equations, fitted parameters, or derived quantities are described that reduce to the method's own outputs or prior self-citations. The central claims rest on the procedural definitions of the DAG maintenance steps and empirical results on Jericho and WebArena, which are external benchmarks rather than self-referential fits. The derivation chain is self-contained as a set of rules for exploration without any load-bearing self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone; the method is described at the level of algorithmic components rather than mathematical assumptions.

pith-pipeline@v0.9.0 · 5730 in / 1113 out tokens · 34642 ms · 2026-05-21T05:18:39.757836+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, AlexanderDuality.lean reality_from_one_distinction, alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

strategy map—a directed acyclic graph of milestones with prerequisite dependency edges... Fork Discovery expands the map with evidence-grounded unexplored directions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

[1]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=VTF8yNQM66

work page 2024
[2]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.CoRR, abs/2408.06292,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In Amir Globersons, Lester Mackey,...

work page 2024
[5]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025, Proceedings of Machine L...

work page 2025
[8]

Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan

Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interactive fiction games: A colossal adventure. InThe Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Arti...

work page doi:10.1609/aaai.v34i05.6297 2020
[9]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2...

work page 2024
[10]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Ad- vances in Neural Information Processing Systems 36: Annual Conference on Neural In- formation Processing Sy...

work page 2023
[11]

V oyager: An open-ended embodied agent with large language models.Trans

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.Trans. Mach. Learn. Res., 2024, 2024. URL https://openreview.net/forum? id=ehfRiF0R3a

work page 2024
[13]

#exploration: A study of count-based explo- ration for deep reinforcement learning

Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schul- man, Filip De Turck, and Pieter Abbeel. #exploration: A study of count-based explo- ration for deep reinforcement learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, edi- tors,Advances in ...

work page 2017
[14]

Contingency-aware exploration in reinforcement learning

Jongwook Choi, Yijie Guo, Marcin Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi, and Honglak Lee. Contingency-aware exploration in reinforcement learning. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=HyxGB2AcY7

work page 2019
[15]

Curiosity-driven exploration for off-policy reinforcement learning methods

Boyao Li, Tao Lu, Jiayi Li, Ning Lu, Yinghao Cai, and Shuo Wang. Curiosity-driven exploration for off-policy reinforcement learning methods. In2019 IEEE International Conference on Robotics and Biomimetics, ROBIO 2019, Dali, China, December 6-8, 2019, pages 1109–1114. IEEE, 2019. doi: 10.1109/ROBIO49542.2019.8961529. URLhttps://doi.org/10.1109/ ROBIO49542...

work page doi:10.1109/robio49542.2019.8961529 2019
[16]

Fast and slow curiosity for high-level exploration in rein- forcement learning.Appl

Nicolas Bougie and Ryutaro Ichise. Fast and slow curiosity for high-level exploration in rein- forcement learning.Appl. Intell., 51(2):1086–1107, 2021. doi: 10.1007/S10489-020-01849-3. URLhttps://doi.org/10.1007/s10489-020-01849-3

work page doi:10.1007/s10489-020-01849-3 2021
[17]

Never give up: Learning directed exploration strategies

Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andrew Bolt, and Charles Blundell. Never give up: Learning directed exploration strategies. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, Apri...

work page 2020
[18]

Diversity is all you need: Learning skills without a reward function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=SJx63jRqFm

work page 2019
[19]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Mach. Learn., 47(2-3):235–256, 2002. doi: 10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352

work page doi:10.1023/a:1013689704352 2002
[20]

Monte carlo tree search for comprehensive exploration in llm-based automatic heuristic design

Zhi Zheng, Zhuoliang Xie, Zhenkun Wang, and Bryan Hooi. Monte carlo tree search for comprehensive exploration in llm-based automatic heuristic design. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, V...

work page 2025
[21]

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933. 11

work page 1933
[22]

A tutorial on thompson sampling.Found

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Found. Trends Mach. Learn., 11(1):1–96, 2018. doi: 10.1561/2200000070. URLhttps://doi.org/10.1561/2200000070

work page doi:10.1561/2200000070 2018
[23]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings ...

work page 2018
[26]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement learning - an introduction, 2nd Edition. MIT Press, 2018. URLhttp://www.incompleteideas.net/book/the-book-2nd.html

work page 2018
[27]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models.CoRR, abs/2510.04618, 2025. doi: 10.48550/ARXIV .2510.04618. URL https://doi.org/10.48550/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[28]

Analyze your progress: What have you achieved? What’s your next objective?

work page
[29]

progress_analysis

Propose your best next action with reasoning RESPONSE FORMAT (JSON): {"progress_analysis": "...", "current_milestone_completed": false, "next_objective": "...", "reasoning": "...", "action": "..."} KEY RULES: - If you haven’t received rewards recently, you are likely stuck — try a fundamentally different approach - LOOP DETECTION: Look at the LAST ACTIONS...

work page

[1] [1]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=VTF8yNQM66

work page 2024

[2] [2]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob N. Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.CoRR, abs/2408.06292,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [4]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In Amir Globersons, Lester Mackey,...

work page 2024

[4] [5]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025, Proceedings of Machine L...

work page 2025

[5] [8]

Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan

Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interactive fiction games: A colossal adventure. InThe Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Arti...

work page doi:10.1609/aaai.v34i05.6297 2020

[6] [9]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2...

work page 2024

[7] [10]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Ad- vances in Neural Information Processing Systems 36: Annual Conference on Neural In- formation Processing Sy...

work page 2023

[8] [11]

V oyager: An open-ended embodied agent with large language models.Trans

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.Trans. Mach. Learn. Res., 2024, 2024. URL https://openreview.net/forum? id=ehfRiF0R3a

work page 2024

[9] [13]

#exploration: A study of count-based explo- ration for deep reinforcement learning

Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schul- man, Filip De Turck, and Pieter Abbeel. #exploration: A study of count-based explo- ration for deep reinforcement learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, edi- tors,Advances in ...

work page 2017

[10] [14]

Contingency-aware exploration in reinforcement learning

Jongwook Choi, Yijie Guo, Marcin Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi, and Honglak Lee. Contingency-aware exploration in reinforcement learning. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=HyxGB2AcY7

work page 2019

[11] [15]

Curiosity-driven exploration for off-policy reinforcement learning methods

Boyao Li, Tao Lu, Jiayi Li, Ning Lu, Yinghao Cai, and Shuo Wang. Curiosity-driven exploration for off-policy reinforcement learning methods. In2019 IEEE International Conference on Robotics and Biomimetics, ROBIO 2019, Dali, China, December 6-8, 2019, pages 1109–1114. IEEE, 2019. doi: 10.1109/ROBIO49542.2019.8961529. URLhttps://doi.org/10.1109/ ROBIO49542...

work page doi:10.1109/robio49542.2019.8961529 2019

[12] [16]

Fast and slow curiosity for high-level exploration in rein- forcement learning.Appl

Nicolas Bougie and Ryutaro Ichise. Fast and slow curiosity for high-level exploration in rein- forcement learning.Appl. Intell., 51(2):1086–1107, 2021. doi: 10.1007/S10489-020-01849-3. URLhttps://doi.org/10.1007/s10489-020-01849-3

work page doi:10.1007/s10489-020-01849-3 2021

[13] [17]

Never give up: Learning directed exploration strategies

Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martín Arjovsky, Alexander Pritzel, Andrew Bolt, and Charles Blundell. Never give up: Learning directed exploration strategies. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, Apri...

work page 2020

[14] [18]

Diversity is all you need: Learning skills without a reward function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=SJx63jRqFm

work page 2019

[15] [19]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Mach. Learn., 47(2-3):235–256, 2002. doi: 10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352

work page doi:10.1023/a:1013689704352 2002

[16] [20]

Monte carlo tree search for comprehensive exploration in llm-based automatic heuristic design

Zhi Zheng, Zhuoliang Xie, Zhenkun Wang, and Bryan Hooi. Monte carlo tree search for comprehensive exploration in llm-based automatic heuristic design. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, V...

work page 2025

[17] [21]

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3/4):285–294, 1933. 11

work page 1933

[18] [22]

A tutorial on thompson sampling.Found

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling.Found. Trends Mach. Learn., 11(1):1–96, 2018. doi: 10.1561/2200000070. URLhttps://doi.org/10.1561/2200000070

work page doi:10.1561/2200000070 2018

[19] [23]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings ...

work page 2018

[20] [26]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement learning - an introduction, 2nd Edition. MIT Press, 2018. URLhttp://www.incompleteideas.net/book/the-book-2nd.html

work page 2018

[21] [27]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models.CoRR, abs/2510.04618, 2025. doi: 10.48550/ARXIV .2510.04618. URL https://doi.org/10.48550/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025

[22] [28]

Analyze your progress: What have you achieved? What’s your next objective?

work page

[23] [29]

progress_analysis

Propose your best next action with reasoning RESPONSE FORMAT (JSON): {"progress_analysis": "...", "current_milestone_completed": false, "next_objective": "...", "reasoning": "...", "action": "..."} KEY RULES: - If you haven’t received rewards recently, you are likely stuck — try a fundamentally different approach - LOOP DETECTION: Look at the LAST ACTIONS...

work page