pith. sign in

arxiv: 2605.18597 · v2 · pith:G67634FRnew · submitted 2026-05-18 · 💻 cs.AI

Latent Action Reparameterization for Efficient Agent Inference

Pith reviewed 2026-05-20 10:41 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentslatent action reparameterizationaction space learningdecision horizon reductioninference efficiencytrajectory learningmulti-step behaviors
0
0 comments X

The pith

Reparameterizing LLM agent actions into learned latent units shortens the effective decision horizon while preserving expressiveness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that long sequences of low-level textual actions create a key bottleneck for LLM agents by inflating the number of steps in planning and execution. It proposes Latent Action Reparameterization to learn a compact set of latent actions from observed trajectories, where each latent action stands for a multi-step semantic behavior. This reparameterization lets the agent decide and act at a higher level of abstraction, directly integrated into the model without separate hierarchical controllers. If the approach holds, it reduces the tokens processed and wall-clock time for inference under fixed budgets while keeping or raising task success rates. The work frames action representation learning as an underexplored lever for scaling efficient agent systems alongside architecture and hardware advances.

Core claim

We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations.

What carries the argument

Latent Action Reparameterization (LAR), which learns compact latent actions from trajectories to encode multi-step semantic behaviors and reparameterizes the original action space for higher-level decision making.

If this is right

  • The effective action horizon shortens because each latent action covers multiple original steps.
  • Inference runs with fewer action tokens and lower wall-clock time under fixed compute budgets.
  • Task success rates remain stable or rise across LLM-based agent benchmarks.
  • Planning and execution both shift to operate over the abstract latent representations.
  • Action representation learning emerges as a distinct factor that complements model and hardware improvements for scaling agent inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory-derived latent actions might transfer to related but unseen tasks, reducing the need for fresh learning on each new problem.
  • Combining the reparameterization with existing prompt or system optimizations could produce additive reductions in inference cost.
  • If latent units prove stable, future agents could discover and refine them continuously from ongoing interactions rather than from fixed training trajectories.

Load-bearing premise

Latent actions learned from observed trajectories will stay sufficiently expressive and generalize across new tasks without hand-crafted macros or external hierarchical controllers.

What would settle it

Measure whether task success rates fall significantly on held-out benchmarks when agents operate exclusively over the learned latent actions rather than the original low-level action sequences.

Figures

Figures reproduced from arXiv: 2605.18597 by Bang Liu, Chenglin Wu, Cheng Yang, Fang Wu, Huaming Chen, Jiayi Zhang, Jiri Gesi, Qingwen Zeng, Qiyue Chen, Siru Ouyang, Wenhao Huang, Xiangru Tang, Yu Sun, Zijie Guo.

Figure 1
Figure 1. Figure 1: Overview of Latent Action Reparameterization (LAR). LAR reformulates agent decision [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: These two curves indicate two different character [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Progressive abstraction ablation results. Task per [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case analysis of LAR on a TriviaQA example. LAR abstracts low-entropy structural [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Large language model (LLM) agents often rely on long sequences of low-level textual actions, resulting in large effective decision horizons and high inference cost. While prior work has focused on improving inference efficiency through system-level optimizations or prompt engineering, we argue that a key bottleneck lies in the representation of the action space itself. We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations. Across a range of LLM-based agent benchmarks, LAR significantly reduces the effective action horizon and improves inference efficiency under fixed compute budgets. As a consequence, our approach achieves substantial reductions in action tokens and corresponding wall-clock inference time, while maintaining or improving task success rates. These results suggest that action representation learning is a critical and underexplored factor in scaling efficient LLM agent inference, complementary to advances in model architecture and hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Latent Action Reparameterization (LAR) for LLM-based agents. It learns a compact latent action space from observed trajectories such that each latent unit encodes a multi-step semantic behavior. Reparameterizing the original action space into these latents allows planning and execution over a shorter effective horizon while aiming to retain the expressiveness of the full action space. The method is integrated directly into the model without hand-crafted macros or external hierarchies, yielding fewer action tokens, lower wall-clock inference time, and maintained or improved success rates on agent benchmarks.

Significance. If the central claims hold, the work is significant for identifying action representation learning as a complementary lever to system-level optimizations and prompt engineering in scaling LLM agent inference. Learning latents directly from trajectories avoids reliance on external controllers and demonstrates concrete efficiency gains under fixed compute. This framing of the action space as a learnable bottleneck is a useful perspective for the field.

major comments (2)
  1. [§3.2] §3.2 (latent learning objective): the claim that latent actions preserve full expressiveness of the original space is load-bearing for the shorter-horizon benefit, yet the objective appears to optimize only reconstruction on training trajectories; without an explicit coverage or diversity term, it is possible that fine-grained distinctions required for novel tasks are collapsed.
  2. [Experiments] Experiments (benchmark results): success rates are reported as maintained or improved, but no ablation isolates whether gains arise from true semantic abstraction versus trajectory-specific correlations; if the latter, the method would not deliver the claimed generalization across new tasks without retraining.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'substantial reductions' would be clearer if accompanied by concrete factors or percentages for token count and wall-clock time.
  2. [§3.1] Notation: the mapping from latent to low-level actions during execution should be stated explicitly to clarify any added decoding overhead.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where the feedback identifies areas for improvement.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (latent learning objective): the claim that latent actions preserve full expressiveness of the original space is load-bearing for the shorter-horizon benefit, yet the objective appears to optimize only reconstruction on training trajectories; without an explicit coverage or diversity term, it is possible that fine-grained distinctions required for novel tasks are collapsed.

    Authors: We agree that the reconstruction objective in §3.2 does not include an explicit diversity or coverage regularizer, which leaves open the theoretical possibility of collapsed distinctions on out-of-distribution tasks. Our defense rests on the fact that the training trajectories are drawn from a broad distribution of successful agent executions across multiple environments and task types, which in practice encourages the latent space to encode semantically distinct behaviors rather than collapsing them. The decoder is trained to reconstruct the original action sequences, preserving the ability to express any observed behavior. That said, we acknowledge this is an assumption rather than a formally proven guarantee. In the revised manuscript we will expand §3.2 with a dedicated paragraph discussing this limitation, add a quantitative analysis of latent-space coverage (e.g., entropy of the latent distribution and nearest-neighbor distances between distinct original actions), and clarify that novel tasks are handled by composing multiple latent actions. We will also note the absence of an explicit diversity term as a direction for future work. revision: partial

  2. Referee: Experiments (benchmark results): success rates are reported as maintained or improved, but no ablation isolates whether gains arise from true semantic abstraction versus trajectory-specific correlations; if the latter, the method would not deliver the claimed generalization across new tasks without retraining.

    Authors: This is a fair critique of our experimental design. While the reported benchmarks already include tasks held out from the latent-learning phase, we did not include an explicit ablation that contrasts semantic abstraction against mere trajectory-specific correlations. To address the concern directly, the revised version will add a new ablation subsection. We will (1) retrain LAR on a deliberately restricted trajectory set that lacks diversity and evaluate generalization on disjoint task families, and (2) compare against a non-semantic baseline that performs simple action clustering without the learned latent objective. The results of these controls will be reported transparently; if they support the semantic-abstraction interpretation we will highlight them, and if they reveal limitations we will discuss them as such. revision: yes

Circularity Check

0 steps flagged

No circularity: framework proposal with empirical claims, no derivation chain or self-referential reductions present

full rationale

The provided abstract and description outline a proposed framework (LAR) that learns compact latent actions from trajectories to enable shorter-horizon planning while preserving expressiveness. No equations, derivations, or mathematical steps are shown. Claims rest on empirical outcomes (reduced tokens, maintained success rates across benchmarks) rather than any closed-form prediction or first-principles result that reduces to its inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are referenced in the text. The learning-from-trajectories step is presented as an input to the method, not as a tautological output. This is a standard non-circular empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim implicitly rests on the unstated assumption that a learned latent space can be both compact and lossless.

pith-pipeline@v0.9.0 · 5787 in / 1117 out tokens · 32565 ms · 2026-05-20T10:41:30.661028+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 14 internal anchors

  1. [1]

    Hierarchical reinforcement learning: a survey.International journal of computing and digital systems, 4(02), 2015

    Mostafa Al-Emran. Hierarchical reinforcement learning: a survey.International journal of computing and digital systems, 4(02), 2015

  2. [2]

    Modeling and planning with macro-actions in decentralized pomdps.Journal of Artificial Intelligence Research, 64:817–859, 2019

    Christopher Amato, George Konidaris, Leslie P Kaelbling, and Jonathan P How. Modeling and planning with macro-actions in decentralized pomdps.Journal of Artificial Intelligence Research, 64:817–859, 2019

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  4. [4]

    The option-critic architecture

    Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

  5. [5]

    Fastmtp: Accelerating llm inference with enhanced multi-token prediction.arXiv preprint arXiv:2509.18362, 2025

    Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, and Xi Chen. Fastmtp: Accelerating llm inference with enhanced multi-token prediction.arXiv preprint arXiv:2509.18362, 2025

  6. [6]

    Efficient sequential decision making with large language models.arXiv preprint arXiv:2406.12125, 2024

    Dingyang Chen, Qi Zhang, and Yinglun Zhu. Efficient sequential decision making with large language models.arXiv preprint arXiv:2406.12125, 2024

  7. [7]

    Hardware-aware parallel prompt decoding for memory-efficient acceleration of llm inference.arXiv preprint arXiv:2405.18628, 2024

    Hao Mark Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I Venieris, and Hongxiang Fan. Hardware-aware parallel prompt decoding for memory-efficient acceleration of llm inference.arXiv preprint arXiv:2405.18628, 2024

  8. [8]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  9. [9]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

  10. [10]

    A comprehensive survey of prompt engineering techniques in large language models.TechRxiv, 2025

    Tonmoy Debnath, Md Nurul Absar Siddiky, Muhammad Enayetur Rahman, Prosenjit Das, Antu Kumar Guha, Muhammad Rezaur Rahman, and HM Kabir. A comprehensive survey of prompt engineering techniques in large language models.TechRxiv, 2025

  11. [11]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  12. [12]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  13. [13]

    Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024

    In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024. 10

  14. [14]

    Agentquest: A modular benchmark framework to measure progress and improve llm agents.arXiv preprint arXiv:2404.06411, 2024

    Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. Agentquest: A modular benchmark framework to measure progress and improve llm agents.arXiv preprint arXiv:2404.06411, 2024

  15. [15]

    Robotouille: An asynchronous planning benchmark for llm agents.arXiv preprint arXiv:2502.05227, 2025

    Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, and Sanjiban Choudhury. Robotouille: An asynchronous planning benchmark for llm agents.arXiv preprint arXiv:2502.05227, 2025

  16. [16]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  17. [17]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551, 2017

  18. [18]

    A., Wutschitz, L., Chen, Y ., Sim, R., and Rajmohan, S

    Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

  19. [19]

    Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, and Weinan Zhang

    Jeonghye Kim, Sojeong Rhee, Minbeom Kim, Dohyung Kim, Sangmook Lee, Youngchul Sung, and Kyomin Jung. Reflact: World-grounded decision making in llm agents via goal-state reflection.arXiv preprint arXiv:2505.15182, 2025

  20. [20]

    Critic-guided decoding for controlled text generation

    Minbeom Kim, Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, and Kyomin Jung. Critic-guided decoding for controlled text generation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 4598–4612, 2023

  21. [21]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning, pages 19274–19286. PMLR, 2023

  22. [22]

    Guiding large language models via directional stimulus prompting.Advances in Neural Information Processing Systems, 36:62630–62656, 2023

    Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. Guiding large language models via directional stimulus prompting.Advances in Neural Information Processing Systems, 36:62630–62656, 2023

  23. [23]

    CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

    Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang, Bingxiang He, and Yi R Fung. Costbench: Evaluating multi-turn cost-optimal planning and adaptation in dynamic environments for llm tool-use agents.arXiv preprint arXiv:2511.02734, 2025

  24. [24]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

  25. [25]

    Memgpt: Towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: Towards llms as operating systems. 2023

  26. [26]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  27. [27]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

  28. [28]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  29. [29]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023. 11

  30. [30]

    Distilling reasoning capabilities into smaller language models.Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, 2023

    Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. Distilling reasoning capabilities into smaller language models.Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, 2023

  31. [31]

    Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025

    Siao Tang, Xinyin Ma, Gongfan Fang, and Xinchao Wang. Concisehint: Boosting efficient reasoning via continuous concise hints during generation.arXiv preprint arXiv:2506.18810, 2025

  32. [32]

    Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  33. [33]

    Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 1, 2023

    Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

  34. [34]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  35. [35]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  36. [36]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  37. [37]

    Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067,

    Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067, 2025

  38. [38]

    KodCode: A di- verse, challenging, and verifiable synthetic dataset for coding.arXiv preprint, arXiv:2503.02951, 2025

    Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. Kodcode: A di- verse, challenging, and verifiable synthetic dataset for coding.arXiv preprint arXiv:2503.02951, 2025

  39. [39]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  40. [40]

    From what to why: A multi-agent system for evidence-based chemical reaction condition reasoning.arXiv preprint arXiv:2509.23768, 2025

    Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, and Feiwei Qin. From what to why: A multi-agent system for evidence-based chemical reaction condition reasoning.arXiv preprint arXiv:2509.23768, 2025

  41. [41]

    Aria: Training language agents with intention-driven reward aggregation

    Ruihan Yang, Yikai Zhang, Aili Chen, Xintao Wang, Siyu Yuan, Jiangjie Chen, Deqing Yang, and Yanghua Xiao. Aria: Training language agents with intention-driven reward aggregation. arXiv preprint arXiv:2506.00539, 2025

  42. [42]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  43. [43]

    Dual latent memory for visual multi-agent system.arXiv preprint arXiv:2602.00471, 2026

    Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, et al. Dual latent memory for visual multi-agent system.arXiv preprint arXiv:2602.00471, 2026

  44. [44]

    Enhancing decision-making for llm agents via step-level q-value models

    Yuanzhao Zhai, Tingkai Yang, Kele Xu, Dawei Feng, Cheng Yang, Bo Ding, and Huaimin Wang. Enhancing decision-making for llm agents via step-level q-value models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27161–27169, 2025

  45. [45]

    AFlow: Automating Agentic Workflow Generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024. 12

  46. [46]

    Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

    Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jian- hao Ruan, Jinlin Wang, Maojia Song, et al. Autoenv: Automated environments for measuring cross-environment agent learning.arXiv preprint arXiv:2511.19304, 2025

  47. [47]

    Harnessing Agentic Evolution

    Jiayi Zhang, Yongfeng Gu, Jianhao Ruan, Maojia Song, Yiran Peng, Zhiguang Han, Jinyu Xiang, Zhitao Wang, Caiyin Yang, Yixi Ouyang, Bang Liu, Chenglin Wu, and Yuyu Luo. Harnessing agentic evolution.arXiv preprint arXiv:2605.13821, 2026

  48. [48]

    A survey on the memory mechanism of large language model-based agents

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

  49. [49]

    Prise: Llm-style sequence compression for learning temporal action abstractions in control.arXiv preprint arXiv:2402.10450, 2024

    Ruijie Zheng, Ching-An Cheng, Hal Daumé III, Furong Huang, and Andrey Kolobov. Prise: Llm-style sequence compression for learning temporal action abstractions in control.arXiv preprint arXiv:2402.10450, 2024

  50. [50]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022

  51. [51]

    A Survey on Efficient Inference for Large Language Models

    Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models, 2024.URL https://arxiv. org/abs/2404.14294, 2024. A Detailed Experimental Setup and Design Rationale A.1 Agent Models and Training Protocol We conduct experiments usingMet...