Latent Action Reparameterization for Efficient Agent Inference

Bang Liu; Chenglin Wu; Cheng Yang; Fang Wu; Huaming Chen; Jiayi Zhang; Jiri Gesi; Qingwen Zeng; Qiyue Chen; Siru Ouyang

arxiv: 2605.18597 · v2 · pith:G67634FRnew · submitted 2026-05-18 · 💻 cs.AI

Latent Action Reparameterization for Efficient Agent Inference

Wenhao Huang , Qingwen Zeng , Qiyue Chen , Zijie Guo , Yu Sun , Cheng Yang , Siru Ouyang , Jiri Gesi

show 6 more authors

Fang Wu Jiayi Zhang Huaming Chen Bang Liu Xiangru Tang Chenglin Wu

This is my paper

Pith reviewed 2026-05-20 10:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentslatent action reparameterizationaction space learningdecision horizon reductioninference efficiencytrajectory learningmulti-step behaviors

0 comments

The pith

Reparameterizing LLM agent actions into learned latent units shortens the effective decision horizon while preserving expressiveness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that long sequences of low-level textual actions create a key bottleneck for LLM agents by inflating the number of steps in planning and execution. It proposes Latent Action Reparameterization to learn a compact set of latent actions from observed trajectories, where each latent action stands for a multi-step semantic behavior. This reparameterization lets the agent decide and act at a higher level of abstraction, directly integrated into the model without separate hierarchical controllers. If the approach holds, it reduces the tokens processed and wall-clock time for inference under fixed budgets while keeping or raising task success rates. The work frames action representation learning as an underexplored lever for scaling efficient agent systems alongside architecture and hardware advances.

Core claim

We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations.

What carries the argument

Latent Action Reparameterization (LAR), which learns compact latent actions from trajectories to encode multi-step semantic behaviors and reparameterizes the original action space for higher-level decision making.

If this is right

The effective action horizon shortens because each latent action covers multiple original steps.
Inference runs with fewer action tokens and lower wall-clock time under fixed compute budgets.
Task success rates remain stable or rise across LLM-based agent benchmarks.
Planning and execution both shift to operate over the abstract latent representations.
Action representation learning emerges as a distinct factor that complements model and hardware improvements for scaling agent inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-derived latent actions might transfer to related but unseen tasks, reducing the need for fresh learning on each new problem.
Combining the reparameterization with existing prompt or system optimizations could produce additive reductions in inference cost.
If latent units prove stable, future agents could discover and refine them continuously from ongoing interactions rather than from fixed training trajectories.

Load-bearing premise

Latent actions learned from observed trajectories will stay sufficiently expressive and generalize across new tasks without hand-crafted macros or external hierarchical controllers.

What would settle it

Measure whether task success rates fall significantly on held-out benchmarks when agents operate exclusively over the learned latent actions rather than the original low-level action sequences.

Figures

Figures reproduced from arXiv: 2605.18597 by Bang Liu, Chenglin Wu, Cheng Yang, Fang Wu, Huaming Chen, Jiayi Zhang, Jiri Gesi, Qingwen Zeng, Qiyue Chen, Siru Ouyang, Wenhao Huang, Xiangru Tang, Yu Sun, Zijie Guo.

**Figure 2.** Figure 2: These two curves indicate two different character [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Progressive abstraction ablation results. Task per [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Case analysis of LAR on a TriviaQA example. LAR abstracts low-entropy structural [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Large language model (LLM) agents often rely on long sequences of low-level textual actions, resulting in large effective decision horizons and high inference cost. While prior work has focused on improving inference efficiency through system-level optimizations or prompt engineering, we argue that a key bottleneck lies in the representation of the action space itself. We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations. Across a range of LLM-based agent benchmarks, LAR significantly reduces the effective action horizon and improves inference efficiency under fixed compute budgets. As a consequence, our approach achieves substantial reductions in action tokens and corresponding wall-clock inference time, while maintaining or improving task success rates. These results suggest that action representation learning is a critical and underexplored factor in scaling efficient LLM agent inference, complementary to advances in model architecture and hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAR learns latent actions from trajectories to shorten LLM agent horizons and cut inference costs, but the generalization story rests on thin evidence.

read the letter

The main point is that this work learns a compact latent action space directly from agent trajectories and folds it into the LLM so the model plans and acts over fewer steps. That produces measurable drops in token count and wall-clock time on the benchmarks they test, while success rates stay flat or improve a bit. The approach avoids hand-crafted macros and instead treats the latents as learned multi-step behaviors that get used at both planning and execution time. That framing is a reasonable extension of older hierarchical ideas into the LLM-agent setting, and the efficiency numbers are the part that could matter for people running these systems at scale under fixed budgets. Credit to them for showing the idea can be integrated without an external controller. The soft spot is exactly the one the stress-test flags. The abstract gives no equations for the latent objective, no description of how trajectories were collected or diversified, and no ablations on out-of-distribution tasks. If the latents mostly capture correlations from the training rollouts, the shorter horizon might simply trade away fine-grained control on new problems rather than preserve expressiveness. Without those checks visible, it is hard to know whether the reported gains are robust or just fitting the evaluation distribution. This is the kind of paper that would interest people already working on practical LLM-agent deployment and token budgets. A reader who wants concrete levers for inference cost would get something usable from the results section, even if the method section needs more detail. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject; the efficiency angle is worth checking with proper ablations and OOD tests.

Referee Report

2 major / 2 minor

Summary. The paper introduces Latent Action Reparameterization (LAR) for LLM-based agents. It learns a compact latent action space from observed trajectories such that each latent unit encodes a multi-step semantic behavior. Reparameterizing the original action space into these latents allows planning and execution over a shorter effective horizon while aiming to retain the expressiveness of the full action space. The method is integrated directly into the model without hand-crafted macros or external hierarchies, yielding fewer action tokens, lower wall-clock inference time, and maintained or improved success rates on agent benchmarks.

Significance. If the central claims hold, the work is significant for identifying action representation learning as a complementary lever to system-level optimizations and prompt engineering in scaling LLM agent inference. Learning latents directly from trajectories avoids reliance on external controllers and demonstrates concrete efficiency gains under fixed compute. This framing of the action space as a learnable bottleneck is a useful perspective for the field.

major comments (2)

[§3.2] §3.2 (latent learning objective): the claim that latent actions preserve full expressiveness of the original space is load-bearing for the shorter-horizon benefit, yet the objective appears to optimize only reconstruction on training trajectories; without an explicit coverage or diversity term, it is possible that fine-grained distinctions required for novel tasks are collapsed.
[Experiments] Experiments (benchmark results): success rates are reported as maintained or improved, but no ablation isolates whether gains arise from true semantic abstraction versus trajectory-specific correlations; if the latter, the method would not deliver the claimed generalization across new tasks without retraining.

minor comments (2)

[Abstract] Abstract: the phrase 'substantial reductions' would be clearer if accompanied by concrete factors or percentages for token count and wall-clock time.
[§3.1] Notation: the mapping from latent to low-level actions during execution should be stated explicitly to clarify any added decoding overhead.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where the feedback identifies areas for improvement.

read point-by-point responses

Referee: [§3.2] §3.2 (latent learning objective): the claim that latent actions preserve full expressiveness of the original space is load-bearing for the shorter-horizon benefit, yet the objective appears to optimize only reconstruction on training trajectories; without an explicit coverage or diversity term, it is possible that fine-grained distinctions required for novel tasks are collapsed.

Authors: We agree that the reconstruction objective in §3.2 does not include an explicit diversity or coverage regularizer, which leaves open the theoretical possibility of collapsed distinctions on out-of-distribution tasks. Our defense rests on the fact that the training trajectories are drawn from a broad distribution of successful agent executions across multiple environments and task types, which in practice encourages the latent space to encode semantically distinct behaviors rather than collapsing them. The decoder is trained to reconstruct the original action sequences, preserving the ability to express any observed behavior. That said, we acknowledge this is an assumption rather than a formally proven guarantee. In the revised manuscript we will expand §3.2 with a dedicated paragraph discussing this limitation, add a quantitative analysis of latent-space coverage (e.g., entropy of the latent distribution and nearest-neighbor distances between distinct original actions), and clarify that novel tasks are handled by composing multiple latent actions. We will also note the absence of an explicit diversity term as a direction for future work. revision: partial
Referee: Experiments (benchmark results): success rates are reported as maintained or improved, but no ablation isolates whether gains arise from true semantic abstraction versus trajectory-specific correlations; if the latter, the method would not deliver the claimed generalization across new tasks without retraining.

Authors: This is a fair critique of our experimental design. While the reported benchmarks already include tasks held out from the latent-learning phase, we did not include an explicit ablation that contrasts semantic abstraction against mere trajectory-specific correlations. To address the concern directly, the revised version will add a new ablation subsection. We will (1) retrain LAR on a deliberately restricted trajectory set that lacks diversity and evaluate generalization on disjoint task families, and (2) compare against a non-semantic baseline that performs simple action clustering without the learned latent objective. The results of these controls will be reported transparently; if they support the semantic-abstraction interpretation we will highlight them, and if they reveal limitations we will discuss them as such. revision: yes

Circularity Check

0 steps flagged

No circularity: framework proposal with empirical claims, no derivation chain or self-referential reductions present

full rationale

The provided abstract and description outline a proposed framework (LAR) that learns compact latent actions from trajectories to enable shorter-horizon planning while preserving expressiveness. No equations, derivations, or mathematical steps are shown. Claims rest on empirical outcomes (reduced tokens, maintained success rates across benchmarks) rather than any closed-form prediction or first-principles result that reduces to its inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are referenced in the text. The learning-from-trajectories step is presented as an input to the method, not as a tautological output. This is a standard non-circular empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim implicitly rests on the unstated assumption that a learned latent space can be both compact and lossless.

pith-pipeline@v0.9.0 · 5787 in / 1117 out tokens · 32565 ms · 2026-05-20T10:41:30.661028+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

latent actions learned from agent trajectories... approximated through the next-token entropy of a candidate segment H(s)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reparameterizing agent actions into latent units... shorter effective horizon while preserving expressiveness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 14 internal anchors

[1]

Hierarchical reinforcement learning: a survey.International journal of computing and digital systems, 4(02), 2015

Mostafa Al-Emran. Hierarchical reinforcement learning: a survey.International journal of computing and digital systems, 4(02), 2015

work page 2015
[2]

Modeling and planning with macro-actions in decentralized pomdps.Journal of Artificial Intelligence Research, 64:817–859, 2019

Christopher Amato, George Konidaris, Leslie P Kaelbling, and Jonathan P How. Modeling and planning with macro-actions in decentralized pomdps.Journal of Artificial Intelligence Research, 64:817–859, 2019

work page 2019
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

work page 2017
[5]

Fastmtp: Accelerating llm inference with enhanced multi-token prediction.arXiv preprint arXiv:2509.18362, 2025

Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, and Xi Chen. Fastmtp: Accelerating llm inference with enhanced multi-token prediction.arXiv preprint arXiv:2509.18362, 2025

work page arXiv 2025
[6]

Efficient sequential decision making with large language models.arXiv preprint arXiv:2406.12125, 2024

Dingyang Chen, Qi Zhang, and Yinglun Zhu. Efficient sequential decision making with large language models.arXiv preprint arXiv:2406.12125, 2024

work page arXiv 2024
[7]

Hardware-aware parallel prompt decoding for memory-efficient acceleration of llm inference.arXiv preprint arXiv:2405.18628, 2024

Hao Mark Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I Venieris, and Hongxiang Fan. Hardware-aware parallel prompt decoding for memory-efficient acceleration of llm inference.arXiv preprint arXiv:2405.18628, 2024

work page arXiv 2024
[8]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

A comprehensive survey of prompt engineering techniques in large language models.TechRxiv, 2025

Tonmoy Debnath, Md Nurul Absar Siddiky, Muhammad Enayetur Rahman, Prosenjit Das, Antu Kumar Guha, Muhammad Rezaur Rahman, and HM Kabir. A comprehensive survey of prompt engineering techniques in large language models.TechRxiv, 2025

work page 2025
[11]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023
[12]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[13]

Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024. 10

work page 2024
[14]

Agentquest: A modular benchmark framework to measure progress and improve llm agents.arXiv preprint arXiv:2404.06411, 2024

Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. Agentquest: A modular benchmark framework to measure progress and improve llm agents.arXiv preprint arXiv:2404.06411, 2024

work page arXiv 2024
[15]

Robotouille: An asynchronous planning benchmark for llm agents.arXiv preprint arXiv:2502.05227, 2025

Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, and Sanjiban Choudhury. Robotouille: An asynchronous planning benchmark for llm agents.arXiv preprint arXiv:2502.05227, 2025

work page arXiv 2025
[16]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

A., Wutschitz, L., Chen, Y ., Sim, R., and Rajmohan, S

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

work page arXiv 2025
[19]

Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, and Weinan Zhang

Jeonghye Kim, Sojeong Rhee, Minbeom Kim, Dohyung Kim, Sangmook Lee, Youngchul Sung, and Kyomin Jung. Reflact: World-grounded decision making in llm agents via goal-state reflection.arXiv preprint arXiv:2505.15182, 2025

work page arXiv 2025
[20]

Critic-guided decoding for controlled text generation

Minbeom Kim, Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, and Kyomin Jung. Critic-guided decoding for controlled text generation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 4598–4612, 2023

work page 2023
[21]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[22]

Guiding large language models via directional stimulus prompting.Advances in Neural Information Processing Systems, 36:62630–62656, 2023

Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. Guiding large language models via directional stimulus prompting.Advances in Neural Information Processing Systems, 36:62630–62656, 2023

work page 2023
[23]

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang, Bingxiang He, and Yi R Fung. Costbench: Evaluating multi-turn cost-optimal planning and adaptation in dynamic environments for llm tool-use agents.arXiv preprint arXiv:2511.02734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Memgpt: Towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: Towards llms as operating systems. 2023

work page 2023
[26]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[27]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

work page 2023
[28]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023. 11

work page 2023
[30]

Distilling reasoning capabilities into smaller language models.Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, 2023

Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models.Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, 2023

work page 2023
[31]

Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025

Siao Tang, Xinyin Ma, Gongfan Fang, and Xinchao Wang. Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025

work page arXiv 2025
[32]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[33]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 1, 2023

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

work page arXiv 2023
[34]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[35]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[37]

Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067,

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067, 2025

work page arXiv 2025
[38]

KodCode: A di- verse, challenging, and verifiable synthetic dataset for coding.arXiv preprint, arXiv:2503.02951, 2025

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A di- verse, challenging, and verifiable synthetic dataset for coding.arXiv preprint arXiv:2503.02951, 2025

work page arXiv 2025
[39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

From what to why: A multi-agent system for evidence-based chemical reaction condition reasoning.arXiv preprint arXiv:2509.23768, 2025

Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, and Feiwei Qin. From what to why: A multi-agent system for evidence-based chemical reaction condition reasoning.arXiv preprint arXiv:2509.23768, 2025

work page arXiv 2025
[41]

Aria: Training language agents with intention-driven reward aggregation

Ruihan Yang, Yikai Zhang, Aili Chen, Xintao Wang, Siyu Yuan, Jiangjie Chen, Deqing Yang, and Yanghua Xiao. Aria: Training language agents with intention-driven reward aggregation. arXiv preprint arXiv:2506.00539, 2025

work page arXiv 2025
[42]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[43]

Dual latent memory for visual multi-agent system.arXiv preprint arXiv:2602.00471, 2026

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, et al. Dual latent memory for visual multi-agent system.arXiv preprint arXiv:2602.00471, 2026

work page arXiv 2026
[44]

Enhancing decision-making for llm agents via step-level q-value models

Yuanzhao Zhai, Tingkai Yang, Kele Xu, Dawei Feng, Cheng Yang, Bo Ding, and Huaimin Wang. Enhancing decision-making for llm agents via step-level q-value models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27161–27169, 2025

work page 2025
[45]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jian- hao Ruan, Jinlin Wang, Maojia Song, et al. Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

work page arXiv 2025
[47]

Harnessing Agentic Evolution

Jiayi Zhang, Yongfeng Gu, Jianhao Ruan, Maojia Song, Yiran Peng, Zhiguang Han, Jinyu Xiang, Zhitao Wang, Caiyin Yang, Yixi Ouyang, Bang Liu, Chenglin Wu, and Yuyu Luo. Harnessing agentic evolution.arXiv preprint arXiv:2605.13821, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

A survey on the memory mechanism of large language model-based agents

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

work page 2025
[49]

Prise: Llm-style sequence compression for learning temporal action abstractions in control.arXiv preprint arXiv:2402.10450, 2024

Ruijie Zheng, Ching-An Cheng, Hal Daumé III, Furong Huang, and Andrey Kolobov. Prise: Llm-style sequence compression for learning temporal action abstractions in control.arXiv preprint arXiv:2402.10450, 2024

work page arXiv 2024
[50]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models, 2024.URL https://arxiv. org/abs/2404.14294, 2024. A Detailed Experimental Setup and Design Rationale A.1 Agent Models and Training Protocol We conduct experiments usingMet...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Hierarchical reinforcement learning: a survey.International journal of computing and digital systems, 4(02), 2015

Mostafa Al-Emran. Hierarchical reinforcement learning: a survey.International journal of computing and digital systems, 4(02), 2015

work page 2015

[2] [2]

Modeling and planning with macro-actions in decentralized pomdps.Journal of Artificial Intelligence Research, 64:817–859, 2019

Christopher Amato, George Konidaris, Leslie P Kaelbling, and Jonathan P How. Modeling and planning with macro-actions in decentralized pomdps.Journal of Artificial Intelligence Research, 64:817–859, 2019

work page 2019

[3] [3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

work page 2017

[5] [5]

Fastmtp: Accelerating llm inference with enhanced multi-token prediction.arXiv preprint arXiv:2509.18362, 2025

Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, and Xi Chen. Fastmtp: Accelerating llm inference with enhanced multi-token prediction.arXiv preprint arXiv:2509.18362, 2025

work page arXiv 2025

[6] [6]

Efficient sequential decision making with large language models.arXiv preprint arXiv:2406.12125, 2024

Dingyang Chen, Qi Zhang, and Yinglun Zhu. Efficient sequential decision making with large language models.arXiv preprint arXiv:2406.12125, 2024

work page arXiv 2024

[7] [7]

Hardware-aware parallel prompt decoding for memory-efficient acceleration of llm inference.arXiv preprint arXiv:2405.18628, 2024

Hao Mark Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I Venieris, and Hongxiang Fan. Hardware-aware parallel prompt decoding for memory-efficient acceleration of llm inference.arXiv preprint arXiv:2405.18628, 2024

work page arXiv 2024

[8] [8]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

A comprehensive survey of prompt engineering techniques in large language models.TechRxiv, 2025

Tonmoy Debnath, Md Nurul Absar Siddiky, Muhammad Enayetur Rahman, Prosenjit Das, Antu Kumar Guha, Muhammad Rezaur Rahman, and HM Kabir. A comprehensive survey of prompt engineering techniques in large language models.TechRxiv, 2025

work page 2025

[11] [11]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023

[12] [12]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024

[13] [13]

Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024. 10

work page 2024

[14] [14]

Agentquest: A modular benchmark framework to measure progress and improve llm agents.arXiv preprint arXiv:2404.06411, 2024

Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. Agentquest: A modular benchmark framework to measure progress and improve llm agents.arXiv preprint arXiv:2404.06411, 2024

work page arXiv 2024

[15] [15]

Robotouille: An asynchronous planning benchmark for llm agents.arXiv preprint arXiv:2502.05227, 2025

Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, and Sanjiban Choudhury. Robotouille: An asynchronous planning benchmark for llm agents.arXiv preprint arXiv:2502.05227, 2025

work page arXiv 2025

[16] [16]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

A., Wutschitz, L., Chen, Y ., Sim, R., and Rajmohan, S

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

work page arXiv 2025

[19] [19]

Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, and Weinan Zhang

Jeonghye Kim, Sojeong Rhee, Minbeom Kim, Dohyung Kim, Sangmook Lee, Youngchul Sung, and Kyomin Jung. Reflact: World-grounded decision making in llm agents via goal-state reflection.arXiv preprint arXiv:2505.15182, 2025

work page arXiv 2025

[20] [20]

Critic-guided decoding for controlled text generation

Minbeom Kim, Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, and Kyomin Jung. Critic-guided decoding for controlled text generation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 4598–4612, 2023

work page 2023

[21] [21]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023

[22] [22]

Guiding large language models via directional stimulus prompting.Advances in Neural Information Processing Systems, 36:62630–62656, 2023

Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. Guiding large language models via directional stimulus prompting.Advances in Neural Information Processing Systems, 36:62630–62656, 2023

work page 2023

[23] [23]

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang, Bingxiang He, and Yi R Fung. Costbench: Evaluating multi-turn cost-optimal planning and adaptation in dynamic environments for llm tool-use agents.arXiv preprint arXiv:2511.02734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Memgpt: Towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: Towards llms as operating systems. 2023

work page 2023

[26] [26]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023

[27] [27]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

work page 2023

[28] [28]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023. 11

work page 2023

[30] [30]

Distilling reasoning capabilities into smaller language models.Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, 2023

Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models.Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, 2023

work page 2023

[31] [31]

Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025

Siao Tang, Xinyin Ma, Gongfan Fang, and Xinchao Wang. Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025

work page arXiv 2025

[32] [32]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022

[33] [33]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 1, 2023

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

work page arXiv 2023

[34] [34]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024

[35] [35]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[37] [37]

Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067,

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067, 2025

work page arXiv 2025

[38] [38]

KodCode: A di- verse, challenging, and verifiable synthetic dataset for coding.arXiv preprint, arXiv:2503.02951, 2025

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A di- verse, challenging, and verifiable synthetic dataset for coding.arXiv preprint arXiv:2503.02951, 2025

work page arXiv 2025

[39] [39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

From what to why: A multi-agent system for evidence-based chemical reaction condition reasoning.arXiv preprint arXiv:2509.23768, 2025

Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, and Feiwei Qin. From what to why: A multi-agent system for evidence-based chemical reaction condition reasoning.arXiv preprint arXiv:2509.23768, 2025

work page arXiv 2025

[41] [41]

Aria: Training language agents with intention-driven reward aggregation

Ruihan Yang, Yikai Zhang, Aili Chen, Xintao Wang, Siyu Yuan, Jiangjie Chen, Deqing Yang, and Yanghua Xiao. Aria: Training language agents with intention-driven reward aggregation. arXiv preprint arXiv:2506.00539, 2025

work page arXiv 2025

[42] [42]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022

[43] [43]

Dual latent memory for visual multi-agent system.arXiv preprint arXiv:2602.00471, 2026

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, et al. Dual latent memory for visual multi-agent system.arXiv preprint arXiv:2602.00471, 2026

work page arXiv 2026

[44] [44]

Enhancing decision-making for llm agents via step-level q-value models

Yuanzhao Zhai, Tingkai Yang, Kele Xu, Dawei Feng, Cheng Yang, Bo Ding, and Huaimin Wang. Enhancing decision-making for llm agents via step-level q-value models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27161–27169, 2025

work page 2025

[45] [45]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jian- hao Ruan, Jinlin Wang, Maojia Song, et al. Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

work page arXiv 2025

[47] [47]

Harnessing Agentic Evolution

Jiayi Zhang, Yongfeng Gu, Jianhao Ruan, Maojia Song, Yiran Peng, Zhiguang Han, Jinyu Xiang, Zhitao Wang, Caiyin Yang, Yixi Ouyang, Bang Liu, Chenglin Wu, and Yuyu Luo. Harnessing agentic evolution.arXiv preprint arXiv:2605.13821, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

A survey on the memory mechanism of large language model-based agents

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

work page 2025

[49] [49]

Prise: Llm-style sequence compression for learning temporal action abstractions in control.arXiv preprint arXiv:2402.10450, 2024

Ruijie Zheng, Ching-An Cheng, Hal Daumé III, Furong Huang, and Andrey Kolobov. Prise: Llm-style sequence compression for learning temporal action abstractions in control.arXiv preprint arXiv:2402.10450, 2024

work page arXiv 2024

[50] [50]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[51] [51]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models, 2024.URL https://arxiv. org/abs/2404.14294, 2024. A Detailed Experimental Setup and Design Rationale A.1 Agent Models and Training Protocol We conduct experiments usingMet...

work page internal anchor Pith review Pith/arXiv arXiv 2024