pith. sign in

arxiv: 2606.10507 · v1 · pith:V45XJJ6Tnew · submitted 2026-06-09 · 💻 cs.AI

HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

Pith reviewed 2026-06-27 13:02 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentshierarchical planninginformation foldinglong-horizon taskssubgoal decompositionprocess rewardsautonomous agentscontext management
0
0 comments X

The pith

HIPIF trains LLM agents to break long tasks into subgoals and fold away completed histories to preserve reasoning quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix how LLM agents lose track of overall goals as conversation histories grow during multi-turn tasks. It does this by training agents end-to-end to plan around explicit subgoals while folding finished subgoal records into compact summaries. Hierarchical reflection and subgoal-specific process rewards then stabilize planning and execution. Readers should care because current approaches either need expensive extra models or expert data, yet still suffer from context overload on realistic long-horizon problems.

Core claim

HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. To stabilize subgoal-based planning and execution, it combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories.

What carries the argument

Information folding, which summarizes and removes completed subgoal histories while keeping only the information needed for ongoing global state tracking and future decisions.

If this is right

  • Agents maintain better global task state tracking across extended sequences without external supervision.
  • Subgoal generation and transitions become more stable through the combination of reflection and process rewards.
  • Performance gains appear on standard agent benchmarks without task-specific expert trajectories or auxiliary models.
  • End-to-end training directly optimizes the agent for reduced long-context interference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The folding approach could be tested on non-agent LLM tasks that involve long documents or multi-step reasoning chains.
  • If folding works reliably, it might allow smaller context windows to suffice for complex agent workloads.
  • Future work could measure whether the learned subgoal rewards transfer across different benchmark domains.

Load-bearing premise

Folding completed subgoal histories removes only non-essential information and does not discard details required for correct future reasoning or global state tracking.

What would settle it

If agents trained with HIPIF still show degraded performance on long-horizon benchmarks where early subgoal details are required for later correct decisions, compared to unfolder baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.10507 by Changyuan Tian, Jingang Wang, Juncheng Diao, Peiguang Li, Qingbin Li, Rongxiang Weng, Xunliang Cai, Yongwei Zhou, Zhicong Lu.

Figure 1
Figure 1. Figure 1: Overview of the design of HIPIF. (a): End-to-End Training for Hierarchical Planning and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validation success-rate curves of 3B models on three benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-step token consumption. Token Efficiency [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity analysis of the subgoal-oriented process rewards on AlfWorld. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt template for ALFWorld. Folded Memory Templates. Figures 8, 9, and 10 show folded subgoal-level memory examples for ALFWorld, VirtualHome, and ScienceWorld, respectively. The memory is organized to support both global task tracking and local subgoal execution. Specifically, we first present the current subgoal memory, which contains the active subgoal and its recent action-observation records. We the… view at source ↗
Figure 6
Figure 6. Figure 6: prompt template for ScienceWorld. More importantly, the curves show that the gains of HIPIF are not only reflected in the final validation accuracy, but also in the overall training trajectory. Compared with GRPO, subgoal-based training can be less stable at the early stage, indicating that explicit subgoals introduce additional optimization difficulties. The later improvement of HIPIF shows that hierarchi… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for VirtualHome. H Limitations. HIPIF is primarily evaluated in simulated long-horizon interaction benchmarks with structured observations and action spaces. While these environments provide controlled and reproducible testbeds, extending the framework to more open-ended real-world settings may require additional perception and action-grounding components. Moreover, HIPIF uses structured ou… view at source ↗
Figure 8
Figure 8. Figure 8: Example of the folded subgoal-level memory used by HIPIF. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of the folded subgoal-level memory used by HIPIF on VirtualHome. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of the folded subgoal-level memory used by HIPIF on ScienceWorld. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Validation success-rate curves of 7B models on ALFWorld, VirtualHome, and Science [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Validation success-rate curves comparing GiGPO, HIPIF, and HIPIF+GiGPO on the 3B [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
read the original abstract

While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progress through fine-grained credit assignment to alleviate long-horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long-term dependency. However, these methods still do not directly address long-context interference, in which continuously growing histories weaken the agent's ability to track the global task state and impair subsequent reasoning and decision-making. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we propose Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. Furthermore, to stabilize subgoal-based planning and execution, HIPIF combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. The method trains agents end-to-end to decompose tasks into explicit subgoals, fold completed subgoal histories to mitigate long-context interference, and stabilize planning via hierarchical reflection combined with subgoal-oriented process rewards. It claims this avoids reliance on auxiliary models or expert trajectories and demonstrates validity through experiments on three public agentic benchmarks.

Significance. If the folding mechanism can be shown to preserve global state details while reducing interference, the approach would offer a practical way to scale LLM agents to longer horizons without external supervision, building on hierarchical RL ideas but addressing context management directly. The end-to-end training and benchmark results, if reproducible with ablations, would strengthen claims about stable subgoal generation and execution.

major comments (3)
  1. [§3.2] §3.2 (Information Folding): The central claim that folding removes only non-essential content while preserving details needed for future reasoning and global state tracking lacks a concrete mechanism (e.g., selection criteria, learned compressor, or rule-based summary) or proof that irreversible compression does not discard facts from subgoal i that affect subgoal k >> i. This is load-bearing for the long-context interference solution.
  2. [§4] §4 (Experiments): No ablation isolating the folding component is reported; without it, performance gains on the three benchmarks cannot be attributed to folding versus hierarchical reflection or process rewards alone, weakening the claim that folding specifically reduces interference.
  3. [§3.3] §3.3 (Hierarchical Reflection and Process Rewards): The process reward formulation and how it guides subgoal transition/execution without task-specific tuning is underspecified; if the rewards are derived from the same LLM, the stability claim risks circularity in the absence of external verification.
minor comments (2)
  1. [Abstract] Abstract and §1: The phrase 'demonstrate the validity of our method' is vague; replace with specific metrics or improvements over baselines.
  2. Notation: Define all acronyms (e.g., HIPIF components) on first use and ensure consistent use of 'subgoal' vs. 'sub-goal' throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications where possible and committing to revisions that strengthen the manuscript without overstating our current results.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Information Folding): The central claim that folding removes only non-essential content while preserving details needed for future reasoning and global state tracking lacks a concrete mechanism (e.g., selection criteria, learned compressor, or rule-based summary) or proof that irreversible compression does not discard facts from subgoal i that affect subgoal k >> i. This is load-bearing for the long-context interference solution.

    Authors: We agree that §3.2 would benefit from greater specificity. The folding operation in HIPIF is implemented as a deterministic, rule-based procedure that retains only the subgoal outcome, key state deltas reported by the environment, and a one-sentence summary of execution status; all intermediate LLM-generated thoughts and action traces are discarded. We will expand the section with pseudocode and a worked example demonstrating that global state variables required for later subgoals are explicitly preserved by the rule set. A formal proof that no critical fact is ever lost is not feasible without task-specific assumptions, so we will instead emphasize the empirical safeguards (hierarchical reflection and environment-verified rewards) that mitigate any residual risk. revision: yes

  2. Referee: [§4] §4 (Experiments): No ablation isolating the folding component is reported; without it, performance gains on the three benchmarks cannot be attributed to folding versus hierarchical reflection or process rewards alone, weakening the claim that folding specifically reduces interference.

    Authors: The referee correctly identifies a missing control. We will add a new ablation in the revised §4 that compares the full HIPIF model against an otherwise identical variant that retains complete subgoal histories (no folding). Results will be reported on all three benchmarks together with statistical significance tests, allowing readers to isolate the contribution of the folding mechanism. revision: yes

  3. Referee: [§3.3] §3.3 (Hierarchical Reflection and Process Rewards): The process reward formulation and how it guides subgoal transition/execution without task-specific tuning is underspecified; if the rewards are derived from the same LLM, the stability claim risks circularity in the absence of external verification.

    Authors: We will clarify in the revision that process rewards are computed directly from environment feedback (binary success/failure signals on each declared subgoal plus a small shaping term based on progress metrics provided by the benchmark environments). The LLM is used only for reflection and next-subgoal proposal; it does not generate the reward values. This external grounding removes the circularity concern. The exact reward equation and the fact that no task-specific tuning is applied beyond the shared environment interface will be stated explicitly in §3.3. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces HIPIF as an end-to-end trained method using subgoal organization, information folding, hierarchical reflection, and process rewards, evaluated on three external public benchmarks without reliance on auxiliary models or expert trajectories. The provided abstract and description contain no equations, fitted parameters presented as predictions, self-citations that bear the central claim, or uniqueness theorems imported from prior author work. The derivation chain is self-contained against external benchmarks rather than reducing to internal redefinitions or self-referential fits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, mathematical axioms, or new invented entities are stated. The approach implicitly relies on standard assumptions of LLM fine-tuning and reinforcement learning from the broader literature.

pith-pipeline@v0.9.1-grok · 5758 in / 1125 out tokens · 24638 ms · 2026-06-27T13:02:55.603513+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 13 canonical work pages · 8 internal anchors

  1. [1]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

  2. [2]

    Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

  3. [3]

    Agent-flan: Designing data and methods of effective agent tuning for large language models

    Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9354–9366, 2024

  4. [4]

    Group-in-group policy optimization for llm agent training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  5. [5]

    Segment policy optimization: Effective segment-level credit assignment in rl for large language models

    Yiran Guo, Lijie Xu, Jie Liu, Ye Dan, and Shuang Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  6. [6]

    Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model

    Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32779–32798, 2025

  7. [7]

    Divide and conquer: Grounding llms as efficient decision-making agents via offline hierarchical reinforcement learning

    Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, and Yu Cheng. Divide and conquer: Grounding llms as efficient decision-making agents via offline hierarchical reinforcement learning. InInternational Conference on Machine Learning, pages 24570–24590. PMLR, 2025

  8. [8]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  9. [9]

    Pre-trained language models for interactive decision-making

    Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. InAdvances in Neural Information Processing Systems, 2022

  10. [10]

    Deepagent: A general reasoning agent with scalable toolsets

    Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, et al. Deepagent: A general reasoning agent with scalable toolsets. InProceedings of the ACM Web Conference 2026, pages 2219–2230, 2026

  11. [11]

    Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems, 36:23813–23825, 2023

    Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems, 36:23813–23825, 2023

  12. [12]

    Gradient coupling: The hidden barrier to generalization in agentic reinforcement learning

    Jingyu Liu, Xiaopeng Wu, Jingquan Peng, Kehan Chen, Chuan Yu, Lizhong Ding, and Yong Liu. Gradient coupling: The hidden barrier to generalization in agentic reinforcement learning. arXiv preprint arXiv:2509.23870, 2025

  13. [13]

    Piper: Benchmarking and prompting event reasoning boundary of llms via debiasing-distillation enhanced tuning

    Zhicong Lu, Changyuan Tian, PeiguangLi PeiguangLi, Li Jin, Sirui Wang, Wei Jia, Ying Shen, and Guangluan Xu. Piper: Benchmarking and prompting event reasoning boundary of llms via debiasing-distillation enhanced tuning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28591–28613, 2025

  14. [14]

    Hisr: Hindsight information modulated segmental process rewards for multi-turn agentic reinforcement learning.arXiv preprint arXiv:2603.18683, 2026

    Zhicong Lu, Zichuan Lin, Wei Jia, Changyuan Tian, Deheng Ye, Peiguang Li, Li Jin, Nayu Liu, Guangluan Xu, and Wei Feng. Hisr: Hindsight information modulated segmental process rewards for multi-turn agentic reinforcement learning.arXiv preprint arXiv:2603.18683, 2026. 10

  15. [15]

    When do transformers shine in rl? decoupling memory from credit assignment.Advances in Neural Information Processing Systems, 36:50429–50452, 2023

    Tianwei Ni, Michel Ma, Benjamin Eysenbach, and Pierre-Luc Bacon. When do transformers shine in rl? decoupling memory from credit assignment.Advances in Neural Information Processing Systems, 36:50429–50452, 2023

  16. [16]

    HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

    Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. Hiper: Hierarchical reinforcement learning with explicit credit assignment for large language model agents.arXiv preprint arXiv:2602.16165, 2026

  17. [17]

    Virtualhome: Simulating household activities via programs

    Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018

  18. [18]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  19. [19]

    Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024

    Gabriel Sarch, Lawrence Jang, Michael J Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024

  20. [20]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  21. [21]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  22. [22]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  23. [23]

    Alfworld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations

  24. [24]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020

  25. [25]

    arXiv preprint arXiv:2403.02502 , year=

    Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization for llm agents.arXiv preprint arXiv:2403.02502, 2024

  26. [26]

    Scaling long-horizon LLM agent via context-folding.arXiv preprint arXiv:2510.11967, 2025

    Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025

  27. [27]

    Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

    Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

  28. [28]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  29. [29]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  30. [30]

    V oyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research. 11

  31. [31]

    arXiv preprint arXiv:2505.20732 , year=

    Hanlin Wang, Chak Tou Leong, Jiashuo Wang, Jian Wang, and Wenjie Li. Spa-rl: Reinforcing llm agents via stepwise progress attribution.arXiv preprint arXiv:2505.20732, 2025

  32. [32]

    Steca: Step-level trajectory calibration for llm agent learning

    Hanlin Wang, Jian Wang, Chak Tou Leong, and Wenjie Li. Steca: Step-level trajectory calibration for llm agent learning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11597–11614, 2025

  33. [33]

    Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

    Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

  34. [34]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

  35. [35]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  36. [36]

    A-mem: Agentic memory for LLM agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=FiM0M8gcct

  37. [37]

    Language agents with reinforcement learning for strategic play in the werewolf game

    Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. Language agents with reinforcement learning for strategic play in the werewolf game. InInternational Conference on Machine Learning, pages 55434–55464. PMLR, 2024

  38. [38]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  39. [39]

    Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

  40. [40]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  41. [41]

    Agentfold: Long-horizon web agents with proactive context folding

    Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Jingren Zhou, Siheng Chen, and Yong Jiang. Agentfold: Long-horizon web agents with proactive context folding. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview....

  42. [42]

    A survey on the memory mechanism of large language model-based agents

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

  43. [43]

    Epo: Hierarchical llm agents with environment preference optimization

    Qi Zhao, Haotian Fu, Chen Sun, and George Konidaris. Epo: Hierarchical llm agents with environment preference optimization. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6401–6415, 2024

  44. [44]

    Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

    Shuai Zhen, Yanhua Yu, Ruopei Guo, Nan Cheng, and Yang Deng. Hierarchical reinforcement learning with augmented step-level transitions for llm agents.arXiv preprint arXiv:2604.05808, 2026

  45. [45]

    Archer: Training language model agents via hierarchical multi-turn rl

    Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn rl. InForty-first International Conference on Machine Learning,

  46. [46]

    Nothing happens

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents. InFirst Workshop on Multi-Turn Interactions in Large Language Models, . 12 A Datasets Details ALFWorld.ALFWorld [ 23] is an embodied text-based ...