arxiv: 2603.00977 · v2 · submitted 2026-03-01 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

Hongbo Jin , Rongpeng Zhu , Jiayu Ding , Guibo Luo , Ge Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:32 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords LLM agentshierarchical reinforcement learninglong-horizon planningmacro-micro decompositionpolicy optimizationagentic RLerror propagationco-evolution training

0 comments

The pith

Decomposing LLM agent reasoning into macro planning and micro execution reduces error buildup in long tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model agents often fail on extended tasks because their single autoregressive sequence mixes high-level plans with low-level actions, causing errors to compound. HiMAC introduces an explicit split where a macro level generates a structured blueprint of goals and a micro level executes goal-conditioned actions. The framework trains this split with a critic-free method that estimates relative advantages across levels and an iterative process that alternates planner exploration with executor adaptation. Tests on ALFWorld, WebShop, and Sokoban show consistent gains over flat prompting and reinforcement learning baselines along with better sample efficiency. The central result is that adding this structured hierarchy improves long-horizon reliability more than simply enlarging the model.

Core claim

HiMAC models long-horizon decision-making as a bi-level process in which a macro planner produces a reasoning blueprint and a micro executor generates goal-conditioned actions within LLM agents. It trains the hierarchy through a critic-free policy optimization that extends group-based reinforcement learning with hierarchical relative advantage estimation, paired with an iterative co-evolution loop that alternates between planner and executor updates to handle non-stationarity.

What carries the argument

The bi-level hierarchy of macro planner generating structured blueprints followed by micro executor performing goal-conditioned actions, optimized by hierarchical relative advantage estimation and iterative co-evolution.

If this is right

HiMAC reaches state-of-the-art results on ALFWorld, WebShop, and Sokoban for both text and visually grounded settings.
The method delivers substantially higher sample efficiency than flat baselines across the tested environments.
Structured hierarchy enables more reliable long-horizon planning inside LLM agents than scale increases alone.
The critic-free optimization and co-evolution loop stabilize training of the bi-level policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same macro-micro split could be applied to other agent domains such as robotics or code generation where error accumulation is common.
Deeper multi-level hierarchies might further reduce propagation if the co-evolution strategy scales without added instability.
The approach may lower the model size needed for capable agents, shifting focus from parameter count to architectural decomposition.
Integration with visual encoders could extend the framework to fully multimodal long-horizon tasks.

Load-bearing premise

The hierarchical split into macro planning and micro execution, trained with alternating updates, will cut error propagation and non-stationarity without creating new instabilities or trade-offs.

What would settle it

A flat autoregressive LLM policy trained with the same total samples and compute matching or exceeding HiMAC success rates on Sokoban or ALFWorld long trajectories would falsify the hierarchy benefit.

Figures

Figures reproduced from arXiv: 2603.00977 by Ge Li, Guibo Luo, Hongbo Jin, Jiayu Ding, Rongpeng Zhu.

**Figure 1.** Figure 1: A typical case comparsion in WebShop environment and overall performance. (a) Flat-policy architectures easily get lost in irrelevant observations (Context Drift). (b) HiMAC addresses this exploration inefficiency through a bi-level decoupled architecture (Planner and Executor sharing the same parameters πθ), mapping structured blueprints to executable atomic actions. (c) HiMAC yields consistent state-of-… view at source ↗

**Figure 2.** Figure 2: Overall pipeline of HiMAC. Phase A (left) optimizes the macro-policy for blueprint generation using GRPO while freezing the micro-policy. Phase B (right) optimizes the micro-policy for atomic action execution conditioned on a fixed blueprint z ∗ . Both phases update the same underlying LLM parameters θ. The core premise of HiMAC is decoupling long-horizon reasoning from local execution. First, the Macro-P… view at source ↗

**Figure 3.** Figure 3: Evolution of HiMAC’s planning and emergent behaviors. For the task "put a candle in toilet": (Left) Early training shows blind exploration due to lacking spatial priors. (Right) Post-convergence, the planner efficiently navigates to the target and develops emergent self-verification. Qualitative Analysis of Learned Blueprints. To gain interpretable insight into HiMAC’s planning behavior, Figure 3 visuali… view at source ↗

read the original abstract

Large language model (LLM) agents have recently demonstrated strong capabilities in interactive decision-making, yet they remain fundamentally limited in long-horizon tasks that require structured planning and reliable execution. Existing approaches predominantly rely on flat autoregressive policies, where high-level reasoning and low-level actions are generated within a single token sequence, leading to inefficient exploration and severe error propagation over extended trajectories. In this work, we propose HiMAC, a hierarchical agentic RL framework that explicitly decomposes long-horizon decision-making into macro-level planning and micro-level execution. HiMAC models reasoning as a structured blueprint generation process followed by goal-conditioned action execution, enabling robust long-horizon planning within LLM-based agents. To train this hierarchy efficiently, we introduce a critic-free hierarchical policy optimization paradigm that extends group-based reinforcement learning to bi-level structures through hierarchical relative advantage estimation. Furthermore, we propose an iterative co-evolution training strategy that alternates between planner exploration and executor adaptation, mitigating the non-stationarity inherent in hierarchical learning. Extensive experiments on ALFWorld, WebShop, and Sokoban demonstrate that HiMAC consistently outperforms strong prompting and reinforcement learning baselines, achieving state-of-the-art performance and substantially improved sample efficiency across both text-based and visually grounded environments. Our results show that introducing structured hierarchy, rather than increasing model scale alone, is a key factor for enabling robust long-horizon agentic intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HiMAC splits LLM agents into macro planning and micro execution with a critic-free bi-level optimizer and co-evolution training, but the experiments do not isolate whether the hierarchy itself drives the gains or if the optimization changes do.

read the letter

HiMAC decomposes long-horizon agent tasks into a high-level planner that outputs structured blueprints and a low-level executor that follows them, trained with hierarchical relative advantage estimation and an alternating planner-executor update loop. The paper reports better results than standard prompting and RL baselines on ALFWorld, WebShop, and Sokoban, with claims of improved sample efficiency and reduced error propagation over flat autoregressive policies. The explicit bi-level structure and the co-evolution schedule are the concrete additions here; they give a practical way to handle non-stationarity that flat token generation often runs into. The work is straightforward about the problems with single-sequence policies and offers a clear architectural pattern that others could try. The main limitation is that the baselines are described only as strong prompting and RL methods without model-size or compute details, and there are no ablations against flat policies at matched scale. The stress-test concern is fair: the performance delta could come from the critic-free optimization or the training schedule rather than the macro-micro split. Without those controls the claim that structure beats scale rests on incomplete evidence. This is for people building LLM agents for interactive or embodied tasks who want ideas on adding hierarchy. A reader working on agent reliability would get usable concepts from it, though they would need to verify the experimental controls themselves. I would send it for peer review. The framework is worth checking with tighter ablations.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes HiMAC, a hierarchical agentic RL framework for LLM agents that explicitly decomposes long-horizon decision-making into macro-level planning (blueprint generation) and micro-level execution (goal-conditioned actions). It introduces a critic-free hierarchical policy optimization paradigm extending group-based RL via hierarchical relative advantage estimation, plus an iterative co-evolution training strategy to mitigate non-stationarity. Experiments on ALFWorld, WebShop, and Sokoban report consistent outperformance over strong prompting and RL baselines, with the central claim that structured hierarchy (rather than model scale alone) enables robust long-horizon agentic intelligence.

Significance. If the empirical results hold under proper controls, this work could meaningfully advance LLM agent research by demonstrating that explicit macro-micro decomposition improves planning reliability and sample efficiency in long-horizon settings. The critic-free optimization and co-evolution approach address practical challenges in hierarchical RL for LLMs, and the focus on hierarchy over scale offers a falsifiable alternative to pure scaling hypotheses.

major comments (2)

[Experiments] Experiments section: The central claim that hierarchy (not scale or optimization efficiency) is the decisive factor requires scale-matched flat autoregressive baselines. The paper compares only against unspecified 'strong prompting and RL baselines' without reporting results for larger flat models or compute-equivalent ablations on ALFWorld, WebShop, or Sokoban. This is load-bearing for the significance paragraph and the abstract's conclusion.
[§3.2] §3.2 (hierarchical relative advantage estimation): The definition of hierarchical relative advantage estimation is presented as extending group-based RL to bi-level structures, but the manuscript does not show explicit equations demonstrating that the reported gains are not reducible to the choice of grouping or fitted parameters. This needs clarification to support the 'critic-free' and 'parameter-free' aspects of the optimization claim.

minor comments (2)

[Abstract] Abstract: The claims of 'consistent outperformance', 'state-of-the-art performance', and 'substantially improved sample efficiency' should be supported by at least one quantitative delta, baseline name, and environment-specific result to allow readers to assess the strength of evidence without reading the full experiments section.
[§2] Notation: The distinction between macro and micro policies is introduced clearly, but the manuscript should add a table or diagram explicitly contrasting the token sequences generated by flat autoregressive policies versus the HiMAC hierarchy to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the empirical claims and technical clarifications.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim that hierarchy (not scale or optimization efficiency) is the decisive factor requires scale-matched flat autoregressive baselines. The paper compares only against unspecified 'strong prompting and RL baselines' without reporting results for larger flat models or compute-equivalent ablations on ALFWorld, WebShop, or Sokoban. This is load-bearing for the significance paragraph and the abstract's conclusion.

Authors: We agree that isolating the contribution of hierarchy requires explicit scale-matched controls. In the revised manuscript we have added new experiments comparing HiMAC to larger flat autoregressive baselines under matched compute budgets on ALFWorld and WebShop. These results show that HiMAC retains a performance advantage, supporting the claim that structured decomposition provides benefits beyond scale or optimization efficiency alone. For Sokoban we have added a qualitative discussion referencing prior work on visual long-horizon tasks. The abstract and significance section have been updated to reflect these controls. revision: yes
Referee: [§3.2] §3.2 (hierarchical relative advantage estimation): The definition of hierarchical relative advantage estimation is presented as extending group-based RL to bi-level structures, but the manuscript does not show explicit equations demonstrating that the reported gains are not reducible to the choice of grouping or fitted parameters. This needs clarification to support the 'critic-free' and 'parameter-free' aspects of the optimization claim.

Authors: We appreciate the request for greater mathematical precision. We have expanded §3.2 with the explicit hierarchical relative advantage estimation equations, showing the bi-level formulation that computes macro-level advantages from group-relative returns and micro-level advantages conditioned on macro goals. The derivation demonstrates that no separate critic network or additional learned parameters are required; advantage estimation remains purely relative within the defined groups. We have also included a short proof sketch confirming that performance gains derive from the macro-micro decomposition rather than arbitrary grouping choices. This revision directly supports the critic-free and parameter-free claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on empirical validation rather than self-referential reductions

full rationale

The paper introduces HiMAC as a hierarchical decomposition of LLM agent policies into macro planning and micro execution, trained via a critic-free bi-level policy optimization and iterative co-evolution. No equations or steps in the provided abstract or description reduce the claimed performance gains to fitted parameters by construction, self-definitions, or load-bearing self-citations. The central claim (hierarchy as key factor over scale) is supported by benchmark comparisons rather than tautological renaming or imported uniqueness theorems. This is a standard method-proposal paper whose results are externally falsifiable on ALFWorld/WebShop/Sokoban; no derivation chain collapses to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard RL assumptions plus the domain assumption that explicit hierarchy mitigates error propagation in LLM agents; no free parameters or invented entities with independent evidence are detailed in the abstract.

axioms (1)

domain assumption Hierarchical decomposition of planning and execution reduces error propagation in long-horizon tasks
Invoked as the core motivation and design principle throughout the abstract.

invented entities (1)

hierarchical relative advantage estimation no independent evidence
purpose: Extend group-based RL to bi-level planner-executor structures
Introduced as a new component of the policy optimization paradigm

pith-pipeline@v0.9.0 · 5553 in / 1287 out tokens · 58698 ms · 2026-05-15T18:32:42.452125+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HiMAC models reasoning as a structured blueprint generation process followed by goal-conditioned action execution... critic-free hierarchical policy optimization... iterative co-evolution training strategy that alternates between planner exploration and executor adaptation
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

extends group-based reinforcement learning to bi-level structures through hierarchical relative advantage estimation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 17 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers)

Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., Hooker, S.: Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In: Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12248–12267 (2024)

work page 2024
[3]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Advances in Neural Information Processing Systems37, 12461–12495 (2024)

Bai, H., Zhou, Y., Pan, J., Cemri, M., Suhr, A., Levine, S., Kumar, A.: Digirl: Training in-the-wild device-control agents with autonomous reinforcement learn- ing. Advances in Neural Information Processing Systems37, 12461–12495 (2024)

work page 2024
[5]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Discrete event dynamic systems13(4), 341–379 (2003)

Barto, A.G., Mahadevan, S.: Recent advances in hierarchical reinforcement learn- ing. Discrete event dynamic systems13(4), 341–379 (2003)

work page 2003
[8]

Relational inductive biases, deep learning, and graph networks

Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al.: Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Advances in neural information processing systems37, 74325–74362 (2024)

Chang, M., Zhang, J., Zhu, Z., Yang, C., Yang, Y., Jin, Y., Lan, Z., Kong, L., He, J.: Agentboard: An analytical evaluation board of multi-turn llm agents. Advances in neural information processing systems37, 74325–74362 (2024)

work page 2024
[11]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Chen, D., Zhang, Q., Zhu, Y.: Efficient sequential decision making with large lan- guage models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 9157–9170 (2024)

work page 2024
[12]

arXiv preprint arXiv:2408.06318 (2024)

Chen, Y., Pesaranghader, A., Sadhu, T., Yi, D.H.: Can we rely on llm agents to draft long-horizon plans? let’s take travelplanner as an example. arXiv preprint arXiv:2408.06318 (2024)

work page arXiv 2024
[13]

Advances in neural infor- mation processing systems5(1992)

Dayan, P., Hinton, G.E.: Feudal reinforcement learning. Advances in neural infor- mation processing systems5(1992)

work page 1992
[14]

PLAN-AND-ACT: Improving planning of agents for long-horizon tasks.arXiv preprint arXiv:2503.09572, 2025

Erdogan, L.E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., Gholami, A.: Plan-and-act: Improving planning of agents for long- horizon tasks. arXiv preprint arXiv:2503.09572 (2025)

work page arXiv 2025
[15]

Group-in-Group Policy Optimization for LLM Agent Training

Feng, L., Xue, Z., Liu, T., An, B.: Group-in-group policy optimization for llm agent training. arXiv preprint arXiv:2505.10978 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

arXiv preprint arXiv:2510.03253 (2025) Abbreviated paper title 17

Gao, H., Sun, Z., Min, E., Cai, H., Wang, S., Yin, D., Chen, X.: Solving the granularity mismatch: Hierarchical preference learning for long-horizon llm agents. arXiv preprint arXiv:2510.03253 (2025) Abbreviated paper title 17

work page arXiv 2025
[17]

Proceedings of the Royal Society A478(2266), 20210068 (2022)

Goyal, A., Bengio, Y.: Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A478(2266), 20210068 (2022)

work page 2022
[18]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Hu, M., Chen, T., Chen, Q., Mu, Y., Shao, W., Luo, P.: Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 32779–32798 (2025)

work page 2025
[19]

arXiv preprint arXiv:2505.19761 (2025)

Hu, Z., Liu, W., Qu, X., Yue, X., Chen, C., Wang, Z., Cheng, Y.: Divide and conquer: Grounding llms as efficient decision-making agents via offline hierarchical reinforcement learning. arXiv preprint arXiv:2505.19761 (2025)

work page arXiv 2025
[20]

Understanding the planning of LLM agents: A survey

Huang, X., Liu, W., Chen, X., Wang, X., Wang, H., Lian, D., Wang, Y., Tang, R., Chen, E.: Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

arXiv preprint arXiv:2601.00887 (2025)

Jin, H., Lin, K., Zhang, W., Jin, Y., Li, G.: Videocurl: Video curriculum re- inforcement learning with orthogonal difficulty decomposition. arXiv preprint arXiv:2601.00887 (2025)

work page arXiv 2025
[22]

arXiv preprint arXiv:2512.04540 (2025)

Jin, H., Wang, Q., Zhang, W., Liu, Y., Cheng, S.: Videomem: Enhancing ultra- long video understanding via adaptive memory management. arXiv preprint arXiv:2512.04540 (2025)

work page arXiv 2025
[23]

arXiv preprint arXiv:2601.06176 (2026)

Jin, H., Xie, S., Ding, J., Lin, K., Li, G.: Tir-flow: Active video search and reasoning with frozen vlms. arXiv preprint arXiv:2601.06176 (2026)

work page arXiv 2026
[24]

Artificial intelligence101(1-2), 99–134 (1998)

Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artificial intelligence101(1-2), 99–134 (1998)

work page 1998
[25]

Advances in neural information processing systems29(2016)

Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems29(2016)

work page 2016
[26]

Advances in Neural Information Processing Systems35, 31199– 31212 (2022)

Li, S., Puig, X., Paxton, C., Du, Y., Wang, C., Fan, L., Chen, T., Huang, D.A., Akyürek, E., Anandkumar, A., et al.: Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems35, 31199– 31212 (2022)

work page 2022
[27]

arXiv preprint arXiv:2508.19076 (2025)

Li, Z., Chang, Y., Yu, G., Le, X.: Hiplan: Hierarchical planning for llm-based agents with adaptive global-local guidance. arXiv preprint arXiv:2508.19076 (2025)

work page arXiv 2025
[28]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

work page 2023
[29]

ACM computing surveys55(9), 1–35 (2023)

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM computing surveys55(9), 1–35 (2023)

work page 2023
[30]

arXiv preprint arXiv:2512.12716 (2025)

Liu, X., Feng, J., Zhuang, Z., Zhao, J., Que, M., Li, J., Wang, D., Tong, H., Chen, Y., Li, P.: Coda: A context-decoupled hierarchical agent with reinforcement learning. arXiv preprint arXiv:2512.12716 (2025)

work page arXiv 2025
[31]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

Lu, W., Wei, Y., Xu, J., Jia, W., Li, L., Xiong, R., Wang, Y.: Reinforcement learning for adaptive planner parameter tuning: A perspective on hierarchical ar- chitecture. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 3883–3889. IEEE (2025)

work page 2025
[32]

Advances in neural information processing systems31(2018)

Nachum, O., Gu, S.S., Lee, H., Levine, S.: Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems31(2018)

work page 2018
[33]

Advances in neural information processing sys- tems35, 27730–27744 (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

work page 2022
[34]

Ad- vances in neural information processing systems10(1997) 18 F

Parr, R., Russell, S.: Reinforcement learning with hierarchies of machines. Ad- vances in neural information processing systems10(1997) 18 F. Author et al

work page 1997
[35]

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Advances in neural information processing systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

work page 2023
[37]

In: International Conference on Machine Learning

Raileanu, R., Fergus, R.: Decoupling value and policy for generalization in rein- forcement learning. In: International Conference on Machine Learning. pp. 8787–

work page
[38]

Advances in Neural Information Processing Systems36, 59708–59728 (2023)

Rawles, C., Li, A., Rodriguez, D., Riva, O., Lillicrap, T.: Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems36, 59708–59728 (2023)

work page 2023
[39]

arXiv preprint arXiv:2510.16720 (2025)

Sang, J., Xiao, J., Han, J., Chen, J., Chen, X., Wei, S., Sun, Y., Wang, Y.: Beyond pipelines: A survey of the paradigm shift toward model-native agentic ai. arXiv preprint arXiv:2510.16720 (2025)

work page arXiv 2025
[40]

agentic ai: A conceptual taxonomy, applications and challenges

Sapkota, R., Roumeliotis, K.I., Karkee, M.: Ai agents vs. agentic ai: A conceptual taxonomy, applications and challenges. Information Fusion p. 103599 (2025)

work page 2025
[41]

In: International conference on machine learning

Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approxi- mators. In: International conference on machine learning. pp. 1312–1320. PMLR (2015)

work page 2015
[42]

Schrader, M.P.B.: gym-sokoban.https://github.com/mpSchrader/gym-sokoban (2018)

work page 2018
[43]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Advances in neural information processing systems36, 8634–8652 (2023)

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning. Advances in neural information processing systems36, 8634–8652 (2023)

work page 2023
[46]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Shridhar, M., Yuan, X., Côté, M.A., Bisk, Y., Trischler, A., Hausknecht, M.: Alf- world: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[47]

arXiv preprint arXiv:2406.10892 (2024)

Singh, U., Chakraborty, S., Suttle, W.A., Sadler, B.M., Namboodiri, V.P., Bedi, A.S.: Dipper: Direct preference optimization to accelerate primitive-enabled hier- archical reinforcement learning. arXiv preprint arXiv:2406.10892 (2024)

work page arXiv 2024
[48]

Sutton, R.S., Barto, A.G., et al.: Reinforcement learning: An introduction, vol. 1. MIT press Cambridge (1998)

work page 1998
[49]

Artificial intelligence112(1-2), 181–211 (1999)

Sutton, R.S., Precup, D., Singh, S.: Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence112(1-2), 181–211 (1999)

work page 1999
[50]

arXiv preprint arXiv:2401.14151 (2024)

Tan, W., Zhang, W., Liu, S., Zheng, L., Wang, X., An, B.: True knowledge comes from practice: Aligning llms with embodied environments via reinforcement learn- ing. arXiv preprint arXiv:2401.14151 (2024)

work page arXiv 2024
[51]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) Abbreviated paper title 19

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., Anand- kumar, A.: Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Frontiers of Computer Science18(6), 186345 (2024)

Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al.: A survey on large language model based autonomous agents. Frontiers of Computer Science18(6), 186345 (2024)

work page 2024
[54]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Jin, X., Yu, K., Nguyen, M.N., Liu, L., et al.: Ragen: Understanding self-evolution in llm agents via multi- turn reinforcement learning. arXiv preprint arXiv:2504.20073 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Advances in Neural Information Processing Systems37, 103774–103805 (2024)

Wen, M., Wan, Z., Wang, J., Zhang, W., Wen, Y.: Reinforcing llm agents via policy optimization with action decomposition. Advances in Neural Information Processing Systems37, 103774–103805 (2024)

work page 2024
[56]

arXiv preprint arXiv:2509.08755 (2025)

Xi, Z., Huang, J., Liao, C., Huang, B., Guo, H., Liu, J., Zheng, R., Ye, J., Zhang, J., Chen,W.,etal.:Agentgym-rl:Trainingllmagentsforlong-horizondecisionmaking through multi-turn reinforcement learning. arXiv preprint arXiv:2509.08755 (2025)

work page arXiv 2025
[57]

Travelplanner: A benchmark for real-world planning with language agents.arXiv preprint arXiv:2402.01622,

Xie, J., Zhang, K., Chen, J., Zhu, T., Lou, R., Tian, Y., Xiao, Y., Su, Y.: Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622 (2024)

work page arXiv 2024
[58]

Advances in Neural Information Processing Systems35, 20744–20757 (2022)

Yao, S., Chen, H., Yang, J., Narasimhan, K.: Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems35, 20744–20757 (2022)

work page 2022
[59]

Advances in neural information processing systems36, 11809–11822 (2023)

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023)

work page 2023
[60]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)

work page 2022
[61]

Advances in neural information processing systems37, 110935–110971 (2024)

Zhai, S., Bai, H., Lin, Z., Pan, J., Tong, P., Zhou, Y., Suhr, A., Xie, S., LeCun, Y., Ma, Y., et al.: Fine-tuning large vision-language models as decision-making agents via reinforcement learning. Advances in neural information processing systems37, 110935–110971 (2024)

work page 2024
[62]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Zhang, K., Li, J., Li, G., Shi, X., Jin, Z.: Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 13643–13658 (2024)

work page 2024
[63]

Scientific Re- ports15(1), 36779 (2025)

Zhang, N., Zhao, Y., Yang, M., Dai, S.: Llms augmented hierarchical reinforcement learning with action primitives for long-horizon manipulation tasks. Scientific Re- ports15(1), 36779 (2025)

work page 2025
[64]

A Survey of Large Language Models

Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.: A survey of large language models. arXiv preprint arXiv:2303.182231(2), 1–124 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Zhu, X., Chen, Y., Tian, H., Tao, C., Su, W., Yang, C., Huang, G., Li, B., Lu, L., Wang, X., et al.: Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

In: Conference on Robot Learning

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)

work page 2023