Recognition: 2 theorem links
· Lean TheoremHiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents
Pith reviewed 2026-05-15 18:32 UTC · model grok-4.3
The pith
Decomposing LLM agent reasoning into macro planning and micro execution reduces error buildup in long tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HiMAC models long-horizon decision-making as a bi-level process in which a macro planner produces a reasoning blueprint and a micro executor generates goal-conditioned actions within LLM agents. It trains the hierarchy through a critic-free policy optimization that extends group-based reinforcement learning with hierarchical relative advantage estimation, paired with an iterative co-evolution loop that alternates between planner and executor updates to handle non-stationarity.
What carries the argument
The bi-level hierarchy of macro planner generating structured blueprints followed by micro executor performing goal-conditioned actions, optimized by hierarchical relative advantage estimation and iterative co-evolution.
If this is right
- HiMAC reaches state-of-the-art results on ALFWorld, WebShop, and Sokoban for both text and visually grounded settings.
- The method delivers substantially higher sample efficiency than flat baselines across the tested environments.
- Structured hierarchy enables more reliable long-horizon planning inside LLM agents than scale increases alone.
- The critic-free optimization and co-evolution loop stabilize training of the bi-level policy.
Where Pith is reading between the lines
- The same macro-micro split could be applied to other agent domains such as robotics or code generation where error accumulation is common.
- Deeper multi-level hierarchies might further reduce propagation if the co-evolution strategy scales without added instability.
- The approach may lower the model size needed for capable agents, shifting focus from parameter count to architectural decomposition.
- Integration with visual encoders could extend the framework to fully multimodal long-horizon tasks.
Load-bearing premise
The hierarchical split into macro planning and micro execution, trained with alternating updates, will cut error propagation and non-stationarity without creating new instabilities or trade-offs.
What would settle it
A flat autoregressive LLM policy trained with the same total samples and compute matching or exceeding HiMAC success rates on Sokoban or ALFWorld long trajectories would falsify the hierarchy benefit.
Figures
read the original abstract
Large language model (LLM) agents have recently demonstrated strong capabilities in interactive decision-making, yet they remain fundamentally limited in long-horizon tasks that require structured planning and reliable execution. Existing approaches predominantly rely on flat autoregressive policies, where high-level reasoning and low-level actions are generated within a single token sequence, leading to inefficient exploration and severe error propagation over extended trajectories. In this work, we propose HiMAC, a hierarchical agentic RL framework that explicitly decomposes long-horizon decision-making into macro-level planning and micro-level execution. HiMAC models reasoning as a structured blueprint generation process followed by goal-conditioned action execution, enabling robust long-horizon planning within LLM-based agents. To train this hierarchy efficiently, we introduce a critic-free hierarchical policy optimization paradigm that extends group-based reinforcement learning to bi-level structures through hierarchical relative advantage estimation. Furthermore, we propose an iterative co-evolution training strategy that alternates between planner exploration and executor adaptation, mitigating the non-stationarity inherent in hierarchical learning. Extensive experiments on ALFWorld, WebShop, and Sokoban demonstrate that HiMAC consistently outperforms strong prompting and reinforcement learning baselines, achieving state-of-the-art performance and substantially improved sample efficiency across both text-based and visually grounded environments. Our results show that introducing structured hierarchy, rather than increasing model scale alone, is a key factor for enabling robust long-horizon agentic intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HiMAC, a hierarchical agentic RL framework for LLM agents that explicitly decomposes long-horizon decision-making into macro-level planning (blueprint generation) and micro-level execution (goal-conditioned actions). It introduces a critic-free hierarchical policy optimization paradigm extending group-based RL via hierarchical relative advantage estimation, plus an iterative co-evolution training strategy to mitigate non-stationarity. Experiments on ALFWorld, WebShop, and Sokoban report consistent outperformance over strong prompting and RL baselines, with the central claim that structured hierarchy (rather than model scale alone) enables robust long-horizon agentic intelligence.
Significance. If the empirical results hold under proper controls, this work could meaningfully advance LLM agent research by demonstrating that explicit macro-micro decomposition improves planning reliability and sample efficiency in long-horizon settings. The critic-free optimization and co-evolution approach address practical challenges in hierarchical RL for LLMs, and the focus on hierarchy over scale offers a falsifiable alternative to pure scaling hypotheses.
major comments (2)
- [Experiments] Experiments section: The central claim that hierarchy (not scale or optimization efficiency) is the decisive factor requires scale-matched flat autoregressive baselines. The paper compares only against unspecified 'strong prompting and RL baselines' without reporting results for larger flat models or compute-equivalent ablations on ALFWorld, WebShop, or Sokoban. This is load-bearing for the significance paragraph and the abstract's conclusion.
- [§3.2] §3.2 (hierarchical relative advantage estimation): The definition of hierarchical relative advantage estimation is presented as extending group-based RL to bi-level structures, but the manuscript does not show explicit equations demonstrating that the reported gains are not reducible to the choice of grouping or fitted parameters. This needs clarification to support the 'critic-free' and 'parameter-free' aspects of the optimization claim.
minor comments (2)
- [Abstract] Abstract: The claims of 'consistent outperformance', 'state-of-the-art performance', and 'substantially improved sample efficiency' should be supported by at least one quantitative delta, baseline name, and environment-specific result to allow readers to assess the strength of evidence without reading the full experiments section.
- [§2] Notation: The distinction between macro and micro policies is introduced clearly, but the manuscript should add a table or diagram explicitly contrasting the token sequences generated by flat autoregressive policies versus the HiMAC hierarchy to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the empirical claims and technical clarifications.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim that hierarchy (not scale or optimization efficiency) is the decisive factor requires scale-matched flat autoregressive baselines. The paper compares only against unspecified 'strong prompting and RL baselines' without reporting results for larger flat models or compute-equivalent ablations on ALFWorld, WebShop, or Sokoban. This is load-bearing for the significance paragraph and the abstract's conclusion.
Authors: We agree that isolating the contribution of hierarchy requires explicit scale-matched controls. In the revised manuscript we have added new experiments comparing HiMAC to larger flat autoregressive baselines under matched compute budgets on ALFWorld and WebShop. These results show that HiMAC retains a performance advantage, supporting the claim that structured decomposition provides benefits beyond scale or optimization efficiency alone. For Sokoban we have added a qualitative discussion referencing prior work on visual long-horizon tasks. The abstract and significance section have been updated to reflect these controls. revision: yes
-
Referee: [§3.2] §3.2 (hierarchical relative advantage estimation): The definition of hierarchical relative advantage estimation is presented as extending group-based RL to bi-level structures, but the manuscript does not show explicit equations demonstrating that the reported gains are not reducible to the choice of grouping or fitted parameters. This needs clarification to support the 'critic-free' and 'parameter-free' aspects of the optimization claim.
Authors: We appreciate the request for greater mathematical precision. We have expanded §3.2 with the explicit hierarchical relative advantage estimation equations, showing the bi-level formulation that computes macro-level advantages from group-relative returns and micro-level advantages conditioned on macro goals. The derivation demonstrates that no separate critic network or additional learned parameters are required; advantage estimation remains purely relative within the defined groups. We have also included a short proof sketch confirming that performance gains derive from the macro-micro decomposition rather than arbitrary grouping choices. This revision directly supports the critic-free and parameter-free claims. revision: yes
Circularity Check
No significant circularity; derivation relies on empirical validation rather than self-referential reductions
full rationale
The paper introduces HiMAC as a hierarchical decomposition of LLM agent policies into macro planning and micro execution, trained via a critic-free bi-level policy optimization and iterative co-evolution. No equations or steps in the provided abstract or description reduce the claimed performance gains to fitted parameters by construction, self-definitions, or load-bearing self-citations. The central claim (hierarchy as key factor over scale) is supported by benchmark comparisons rather than tautological renaming or imported uniqueness theorems. This is a standard method-proposal paper whose results are externally falsifiable on ALFWorld/WebShop/Sokoban; no derivation chain collapses to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hierarchical decomposition of planning and execution reduces error propagation in long-horizon tasks
invented entities (1)
-
hierarchical relative advantage estimation
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HiMAC models reasoning as a structured blueprint generation process followed by goal-conditioned action execution... critic-free hierarchical policy optimization... iterative co-evolution training strategy that alternates between planner exploration and executor adaptation
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
extends group-based reinforcement learning to bi-level structures through hierarchical relative advantage estimation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., Hooker, S.: Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In: Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12248–12267 (2024)
work page 2024
-
[3]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Advances in Neural Information Processing Systems37, 12461–12495 (2024)
Bai, H., Zhou, Y., Pan, J., Cemri, M., Suhr, A., Levine, S., Kumar, A.: Digirl: Training in-the-wild device-control agents with autonomous reinforcement learn- ing. Advances in Neural Information Processing Systems37, 12461–12495 (2024)
work page 2024
-
[5]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Discrete event dynamic systems13(4), 341–379 (2003)
Barto, A.G., Mahadevan, S.: Recent advances in hierarchical reinforcement learn- ing. Discrete event dynamic systems13(4), 341–379 (2003)
work page 2003
-
[8]
Relational inductive biases, deep learning, and graph networks
Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al.: Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
Advances in neural information processing systems37, 74325–74362 (2024)
Chang, M., Zhang, J., Zhu, Z., Yang, C., Yang, Y., Jin, Y., Lan, Z., Kong, L., He, J.: Agentboard: An analytical evaluation board of multi-turn llm agents. Advances in neural information processing systems37, 74325–74362 (2024)
work page 2024
-
[11]
In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Chen, D., Zhang, Q., Zhu, Y.: Efficient sequential decision making with large lan- guage models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 9157–9170 (2024)
work page 2024
-
[12]
arXiv preprint arXiv:2408.06318 (2024)
Chen, Y., Pesaranghader, A., Sadhu, T., Yi, D.H.: Can we rely on llm agents to draft long-horizon plans? let’s take travelplanner as an example. arXiv preprint arXiv:2408.06318 (2024)
-
[13]
Advances in neural infor- mation processing systems5(1992)
Dayan, P., Hinton, G.E.: Feudal reinforcement learning. Advances in neural infor- mation processing systems5(1992)
work page 1992
-
[14]
Erdogan, L.E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., Gholami, A.: Plan-and-act: Improving planning of agents for long- horizon tasks. arXiv preprint arXiv:2503.09572 (2025)
-
[15]
Group-in-Group Policy Optimization for LLM Agent Training
Feng, L., Xue, Z., Liu, T., An, B.: Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
arXiv preprint arXiv:2510.03253 (2025) Abbreviated paper title 17
Gao, H., Sun, Z., Min, E., Cai, H., Wang, S., Yin, D., Chen, X.: Solving the granularity mismatch: Hierarchical preference learning for long-horizon llm agents. arXiv preprint arXiv:2510.03253 (2025) Abbreviated paper title 17
-
[17]
Proceedings of the Royal Society A478(2266), 20210068 (2022)
Goyal, A., Bengio, Y.: Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A478(2266), 20210068 (2022)
work page 2022
-
[18]
Hu, M., Chen, T., Chen, Q., Mu, Y., Shao, W., Luo, P.: Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 32779–32798 (2025)
work page 2025
-
[19]
arXiv preprint arXiv:2505.19761 (2025)
Hu, Z., Liu, W., Qu, X., Yue, X., Chen, C., Wang, Z., Cheng, Y.: Divide and conquer: Grounding llms as efficient decision-making agents via offline hierarchical reinforcement learning. arXiv preprint arXiv:2505.19761 (2025)
-
[20]
Understanding the planning of LLM agents: A survey
Huang, X., Liu, W., Chen, X., Wang, X., Wang, H., Lian, D., Wang, Y., Tang, R., Chen, E.: Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
arXiv preprint arXiv:2601.00887 (2025)
Jin, H., Lin, K., Zhang, W., Jin, Y., Li, G.: Videocurl: Video curriculum re- inforcement learning with orthogonal difficulty decomposition. arXiv preprint arXiv:2601.00887 (2025)
-
[22]
arXiv preprint arXiv:2512.04540 (2025)
Jin, H., Wang, Q., Zhang, W., Liu, Y., Cheng, S.: Videomem: Enhancing ultra- long video understanding via adaptive memory management. arXiv preprint arXiv:2512.04540 (2025)
-
[23]
arXiv preprint arXiv:2601.06176 (2026)
Jin, H., Xie, S., Ding, J., Lin, K., Li, G.: Tir-flow: Active video search and reasoning with frozen vlms. arXiv preprint arXiv:2601.06176 (2026)
-
[24]
Artificial intelligence101(1-2), 99–134 (1998)
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artificial intelligence101(1-2), 99–134 (1998)
work page 1998
-
[25]
Advances in neural information processing systems29(2016)
Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems29(2016)
work page 2016
-
[26]
Advances in Neural Information Processing Systems35, 31199– 31212 (2022)
Li, S., Puig, X., Paxton, C., Du, Y., Wang, C., Fan, L., Chen, T., Huang, D.A., Akyürek, E., Anandkumar, A., et al.: Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems35, 31199– 31212 (2022)
work page 2022
-
[27]
arXiv preprint arXiv:2508.19076 (2025)
Li, Z., Chang, Y., Yu, G., Le, X.: Hiplan: Hierarchical planning for llm-based agents with adaptive global-local guidance. arXiv preprint arXiv:2508.19076 (2025)
-
[28]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
work page 2023
-
[29]
ACM computing surveys55(9), 1–35 (2023)
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM computing surveys55(9), 1–35 (2023)
work page 2023
-
[30]
arXiv preprint arXiv:2512.12716 (2025)
Liu, X., Feng, J., Zhuang, Z., Zhao, J., Que, M., Li, J., Wang, D., Tong, H., Chen, Y., Li, P.: Coda: A context-decoupled hierarchical agent with reinforcement learning. arXiv preprint arXiv:2512.12716 (2025)
-
[31]
In: 2025 IEEE International Conference on Robotics and Automation (ICRA)
Lu, W., Wei, Y., Xu, J., Jia, W., Li, L., Xiong, R., Wang, Y.: Reinforcement learning for adaptive planner parameter tuning: A perspective on hierarchical ar- chitecture. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 3883–3889. IEEE (2025)
work page 2025
-
[32]
Advances in neural information processing systems31(2018)
Nachum, O., Gu, S.S., Lee, H., Levine, S.: Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems31(2018)
work page 2018
-
[33]
Advances in neural information processing sys- tems35, 27730–27744 (2022)
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)
work page 2022
-
[34]
Ad- vances in neural information processing systems10(1997) 18 F
Parr, R., Russell, S.: Reinforcement learning with hierarchies of machines. Ad- vances in neural information processing systems10(1997) 18 F. Author et al
work page 1997
-
[35]
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Advances in neural information processing systems36, 53728–53741 (2023)
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)
work page 2023
-
[37]
In: International Conference on Machine Learning
Raileanu, R., Fergus, R.: Decoupling value and policy for generalization in rein- forcement learning. In: International Conference on Machine Learning. pp. 8787–
-
[38]
Advances in Neural Information Processing Systems36, 59708–59728 (2023)
Rawles, C., Li, A., Rodriguez, D., Riva, O., Lillicrap, T.: Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems36, 59708–59728 (2023)
work page 2023
-
[39]
arXiv preprint arXiv:2510.16720 (2025)
Sang, J., Xiao, J., Han, J., Chen, J., Chen, X., Wei, S., Sun, Y., Wang, Y.: Beyond pipelines: A survey of the paradigm shift toward model-native agentic ai. arXiv preprint arXiv:2510.16720 (2025)
-
[40]
agentic ai: A conceptual taxonomy, applications and challenges
Sapkota, R., Roumeliotis, K.I., Karkee, M.: Ai agents vs. agentic ai: A conceptual taxonomy, applications and challenges. Information Fusion p. 103599 (2025)
work page 2025
-
[41]
In: International conference on machine learning
Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approxi- mators. In: International conference on machine learning. pp. 1312–1320. PMLR (2015)
work page 2015
-
[42]
Schrader, M.P.B.: gym-sokoban.https://github.com/mpSchrader/gym-sokoban (2018)
work page 2018
-
[43]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Advances in neural information processing systems36, 8634–8652 (2023)
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning. Advances in neural information processing systems36, 8634–8652 (2023)
work page 2023
-
[46]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Shridhar, M., Yuan, X., Côté, M.A., Bisk, Y., Trischler, A., Hausknecht, M.: Alf- world: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[47]
arXiv preprint arXiv:2406.10892 (2024)
Singh, U., Chakraborty, S., Suttle, W.A., Sadler, B.M., Namboodiri, V.P., Bedi, A.S.: Dipper: Direct preference optimization to accelerate primitive-enabled hier- archical reinforcement learning. arXiv preprint arXiv:2406.10892 (2024)
-
[48]
Sutton, R.S., Barto, A.G., et al.: Reinforcement learning: An introduction, vol. 1. MIT press Cambridge (1998)
work page 1998
-
[49]
Artificial intelligence112(1-2), 181–211 (1999)
Sutton, R.S., Precup, D., Singh, S.: Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence112(1-2), 181–211 (1999)
work page 1999
-
[50]
arXiv preprint arXiv:2401.14151 (2024)
Tan, W., Zhang, W., Liu, S., Zheng, L., Wang, X., An, B.: True knowledge comes from practice: Aligning llms with embodied environments via reinforcement learn- ing. arXiv preprint arXiv:2401.14151 (2024)
-
[51]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) Abbreviated paper title 19
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., Anand- kumar, A.: Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Frontiers of Computer Science18(6), 186345 (2024)
Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al.: A survey on large language model based autonomous agents. Frontiers of Computer Science18(6), 186345 (2024)
work page 2024
-
[54]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Jin, X., Yu, K., Nguyen, M.N., Liu, L., et al.: Ragen: Understanding self-evolution in llm agents via multi- turn reinforcement learning. arXiv preprint arXiv:2504.20073 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Advances in Neural Information Processing Systems37, 103774–103805 (2024)
Wen, M., Wan, Z., Wang, J., Zhang, W., Wen, Y.: Reinforcing llm agents via policy optimization with action decomposition. Advances in Neural Information Processing Systems37, 103774–103805 (2024)
work page 2024
-
[56]
arXiv preprint arXiv:2509.08755 (2025)
Xi, Z., Huang, J., Liao, C., Huang, B., Guo, H., Liu, J., Zheng, R., Ye, J., Zhang, J., Chen,W.,etal.:Agentgym-rl:Trainingllmagentsforlong-horizondecisionmaking through multi-turn reinforcement learning. arXiv preprint arXiv:2509.08755 (2025)
-
[57]
Xie, J., Zhang, K., Chen, J., Zhu, T., Lou, R., Tian, Y., Xiao, Y., Su, Y.: Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622 (2024)
-
[58]
Advances in Neural Information Processing Systems35, 20744–20757 (2022)
Yao, S., Chen, H., Yang, J., Narasimhan, K.: Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems35, 20744–20757 (2022)
work page 2022
-
[59]
Advances in neural information processing systems36, 11809–11822 (2023)
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023)
work page 2023
-
[60]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)
work page 2022
-
[61]
Advances in neural information processing systems37, 110935–110971 (2024)
Zhai, S., Bai, H., Lin, Z., Pan, J., Tong, P., Zhou, Y., Suhr, A., Xie, S., LeCun, Y., Ma, Y., et al.: Fine-tuning large vision-language models as decision-making agents via reinforcement learning. Advances in neural information processing systems37, 110935–110971 (2024)
work page 2024
-
[62]
Zhang, K., Li, J., Li, G., Shi, X., Jin, Z.: Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 13643–13658 (2024)
work page 2024
-
[63]
Scientific Re- ports15(1), 36779 (2025)
Zhang, N., Zhao, Y., Yang, M., Dai, S.: Llms augmented hierarchical reinforcement learning with action primitives for long-horizon manipulation tasks. Scientific Re- ports15(1), 36779 (2025)
work page 2025
-
[64]
A Survey of Large Language Models
Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.: A survey of large language models. arXiv preprint arXiv:2303.182231(2), 1–124 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Zhu, X., Chen, Y., Tian, H., Tao, C., Su, W., Yang, C., Huang, G., Li, B., Lu, L., Wang, X., et al.: Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
In: Conference on Robot Learning
Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.