arxiv: 2605.07339 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Tools as Continuous Flow for Evolving Agentic Reasoning

Tairan Huang , Siyu Shang , Qiang Chen , Xiu Su , Yi Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:27 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic reasoningtool chainingcontinuous flow matchinglong-horizon planningsemantic trajectorieserror attenuationgeneralization bounds

0 comments

The pith

Treating tool sequences as continuous trajectories in semantic space reduces error buildup and improves generalization in long-horizon agentic reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current step-by-step methods for orchestrating tools with language models accumulate errors over long sequences and struggle to adapt to unseen tools because they lack a global view. The paper instead models tool chaining as the generation of continuous latent trajectories inside a semantic space, using conditional flow matching to produce coherent plans from start to finish. This continuous formulation is claimed to deliver formal bounds on utility convergence along with built-in guarantees of robust generalization and error attenuation. A new plan-level closed-loop benchmark is introduced to measure performance in dynamic real-world environments. If the approach holds, agents could maintain reliable performance across extended tasks that currently break discrete planners.

Core claim

FlowAgent reconceptualizes tool chaining as continuous trajectory generation within a semantic space. Conditional flow matching produces these trajectories to supply a global planning perspective that ensures coherent and robust tool execution. The authors establish formal bounds on utility convergence and prove that the continuous formulation fundamentally guarantees robust generalization and error attenuation relative to step-wise discrete methods. Empirical results show superior robustness and adaptability on long-horizon reasoning tasks.

What carries the argument

Conditional flow matching that generates continuous latent trajectories for tool chaining, supplying global planning instead of local step-wise decisions.

If this is right

Error accumulation remains bounded as reasoning horizons lengthen.
Performance on unseen tools improves because the model learns trajectory patterns rather than fixed step templates.
Global coherence in planning yields higher overall task success rates in dynamic settings.
Utility convergence bounds provide a theoretical basis for reliable scaling to longer sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The continuous representation may allow smoother integration with gradient-based optimization loops that current discrete planners cannot use.
Projecting generated trajectories back to executable tool calls at runtime could preserve the claimed benefits while satisfying hard API constraints.
Similar trajectory modeling might transfer to other sequential domains such as multi-step code generation or embodied planning where continuity reduces compounding mistakes.

Load-bearing premise

Tool chaining and decision sequences in agentic reasoning can be represented as continuous trajectories in semantic space without losing essential discrete branching or state-dependent constraints.

What would settle it

A controlled test in which a discrete branching choice required by the environment causes the continuous model to produce invalid or lower-utility tool sequences while a standard step-wise planner succeeds on the same tasks.

Figures

Figures reproduced from arXiv: 2605.07339 by Qiang Chen, Siyu Shang, Tairan Huang, Xiu Su, Yi Chen.

**Figure 2.** Figure 2: The overall framework of FlowAgent, which reconceptualizes discrete tool chaining as continuous trajectory generation to shift from myopic step-wise selection to plan-level reasoning. engineering [21, 22] or interleaving chain-of-thought reasoning with API invocations [23, 24]. Recent advancements have transitioned towards autonomous tool integration [25–27], formalizing the selection and execution of dive… view at source ↗

**Figure 3.** Figure 3: Generalization performance on unseen tools. Generalization Results. To evaluate framework generalization, we assess planning accuracy on a strictly held out set of unseen tools as depicted in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Convergence Speed. Convergence Analysis. In this study, we delve into the convergence speed analysis of FlowAgent with baseline method ToolRL on the Retail Success metric. As depicted in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in orchestrating tools for reasoning tasks. However, existing methods rely on a step-wise paradigm that lacks a global perspective, which causes error accumulation over long horizons and restricts generalization to unseen tools. To overcome these limitations, we propose Tools as Continuous Flow for Evolving Agentic Reasoning (FlowAgent), which reconceptualizes tool chaining as continuous trajectory generation within a semantic space. To systematically evaluate this paradigm, we introduce the first plan-level closed-loop benchmark dedicated to plan-level agentic reasoning in dynamic real-world environments. Specifically, the proposed FlowAgent leverages conditional flow matching to generate continuous latent trajectories, providing a global planning perspective to ensure coherent and robust tool execution. Theoretically, we establish formal bounds on utility convergence and prove that our continuous formulation fundamentally guarantees robust generalization and error attenuation. Empirical evaluations show that FlowAgent achieves superior robustness and adaptability in long-horizon reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlowAgent reframes tool chaining as continuous latent trajectories via flow matching and adds a plan-level benchmark, but the formal bounds and proofs need the actual derivations to be convincing.

read the letter

The paper's central move is to stop treating tool use in LLM agents as a sequence of discrete decisions and instead model it as generating continuous trajectories in semantic space with conditional flow matching. They also introduce a plan-level closed-loop benchmark aimed at dynamic real-world environments. This is positioned as a fix for error accumulation and weak generalization in existing step-wise methods. The reconceptualization and the benchmark are the genuinely new pieces; they do not collapse into the prior discrete chaining work referenced in the abstract. If the continuous view actually supplies a usable global perspective without losing key structure, it could improve long-horizon reliability. The paper does a straightforward job of laying out the limitations of local, step-by-step planning and offering an alternative modeling choice. That framing is clear and directly responsive to the problems it names. The soft spots sit mainly in the theoretical and empirical support. The abstract asserts formal bounds on utility convergence plus proofs that the continuous formulation guarantees generalization and error attenuation, yet supplies none of the derivations, assumptions, or controls. Without those, it is difficult to judge whether the guarantees are independent or built into the modeling choices. The concern that continuous trajectories might smooth over discrete branching and state-dependent tool constraints is worth pressing; if the embedding loses information about when a tool becomes available or how results affect later branches, the claimed robustness may not carry over to execution. The work is aimed at researchers focused on LLM-based agents and tool orchestration. Anyone already thinking about alternatives to chain-of-thought or ReAct-style planning would find the new benchmark and the flow-matching angle worth examining, even if they ultimately prefer discrete methods. It shows honest engagement with the cited limitations rather than just restating them. I would send it to referees. The conceptual shift and the benchmark are substantive enough to justify review time, provided the derivations and experimental details are checked carefully in the full manuscript.

Referee Report

3 major / 1 minor

Summary. The paper proposes FlowAgent, which reconceptualizes LLM tool chaining as continuous latent trajectory generation in semantic space via conditional flow matching. This is claimed to provide a global planning perspective that mitigates error accumulation, with formal bounds on utility convergence, proofs of robust generalization and error attenuation, a new plan-level closed-loop benchmark for dynamic environments, and empirical superiority in long-horizon reasoning tasks.

Significance. If the formal bounds and proofs are rigorously derived and the continuous trajectories faithfully preserve discrete tool constraints, the work could establish a new paradigm for agentic reasoning that improves robustness and generalization beyond step-wise methods. The introduction of a dedicated plan-level benchmark is a clear positive contribution to evaluation in this area.

major comments (3)

[Theoretical Analysis] The abstract and theoretical claims assert formal bounds on utility convergence and a proof that the continuous formulation guarantees robust generalization and error attenuation, yet the manuscript supplies none of the derivations, assumptions, or equations defining these bounds. This is load-bearing for the central theoretical contribution, as it leaves open whether the guarantees are independent or rest on modeling choices that may encode the desired properties (per the reader's circularity concern).
[Method] The core modeling assumption—that tool chaining and decision sequences can be faithfully represented as continuous trajectories in semantic space without losing essential discrete branching or state-dependent constraints (e.g., a tool becoming available only after a specific prior output)—is not verified. No analysis or examples are provided showing how the flow model retains these conditional structures, so the transfer of theoretical bounds to real execution remains unestablished (directly engaging the skeptic's concern).
[Experiments] Empirical evaluations claim superior robustness and adaptability, but the manuscript provides no details on the new benchmark's task definitions, baseline implementations, experimental controls, or statistical tests. Without these, the superiority claims cannot be assessed and do not support the generalization assertions.

minor comments (1)

[Abstract] The abstract is overly dense; separating the theoretical claims, benchmark introduction, and empirical results into distinct sentences would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where the manuscript requires expansion to fully substantiate the claims. We will incorporate the requested details and clarifications in the revised version.

read point-by-point responses

Referee: [Theoretical Analysis] The abstract and theoretical claims assert formal bounds on utility convergence and a proof that the continuous formulation guarantees robust generalization and error attenuation, yet the manuscript supplies none of the derivations, assumptions, or equations defining these bounds. This is load-bearing for the central theoretical contribution, as it leaves open whether the guarantees are independent or rest on modeling choices that may encode the desired properties (per the reader's circularity concern).

Authors: We agree that the full derivations, assumptions, and equations for the utility convergence bounds and generalization proofs are not present in the submitted manuscript. The claims in the abstract and introduction are based on the conditional flow matching objective, but the detailed proofs were omitted for brevity. In the revision, we will add a new subsection (or expanded appendix) containing the complete derivations, including all assumptions on the latent space and flow dynamics, to demonstrate that the bounds follow directly from the continuous formulation rather than being circular. revision: yes
Referee: [Method] The core modeling assumption—that tool chaining and decision sequences can be faithfully represented as continuous trajectories in semantic space without losing essential discrete branching or state-dependent constraints (e.g., a tool becoming available only after a specific prior output)—is not verified. No analysis or examples are provided showing how the flow model retains these conditional structures, so the transfer of theoretical bounds to real execution remains unestablished (directly engaging the skeptic's concern).

Authors: The manuscript relies on the conditioning mechanism of conditional flow matching to encode state-dependent constraints, but we acknowledge that no explicit verification, analysis, or illustrative examples of discrete structure preservation are provided. In the revision, we will add a dedicated analysis subsection with concrete examples (e.g., trajectories for conditional tool availability) and a discussion of how the learned vector field respects branching points and state transitions, thereby bridging the continuous model to discrete execution. revision: yes
Referee: [Experiments] Empirical evaluations claim superior robustness and adaptability, but the manuscript provides no details on the new benchmark's task definitions, baseline implementations, experimental controls, or statistical tests. Without these, the superiority claims cannot be assessed and do not support the generalization assertions.

Authors: We recognize that the current manuscript lacks sufficient detail on the plan-level closed-loop benchmark, including task definitions, baseline re-implementations, controls for environment dynamics, and statistical testing procedures. In the revision, we will expand the experimental section with these specifics, add tables summarizing controls and significance tests (e.g., paired t-tests or bootstrap confidence intervals), and include additional ablation results to strengthen support for the robustness and generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper claims to establish formal bounds on utility convergence and prove guarantees from the continuous formulation via conditional flow matching, alongside a new benchmark and empirical results. No specific equations, self-citations, or derivation steps are available in the provided text that reduce the claimed predictions or bounds to inputs by construction, self-definition, or fitted renaming. The central theoretical statements are presented as independent derivations rather than tautological restatements, and the work introduces external evaluation elements that stand apart from the modeling choices.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that discrete tool-use sequences can be faithfully captured by continuous trajectories in semantic space; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Tool chaining can be represented as continuous trajectory generation within a semantic space
This modeling choice underpins the entire FlowAgent proposal and the claimed guarantees of error attenuation and generalization.

pith-pipeline@v0.9.0 · 5458 in / 1452 out tokens · 54646 ms · 2026-05-11T01:27:04.244577+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reconceptualizes tool chaining as continuous trajectory generation within a semantic space... conditional flow matching to generate continuous latent trajectories... formal bounds on utility convergence and prove that our continuous formulation fundamentally guarantees robust generalization and error attenuation
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assumption 1 (Semantic Smoothness). The tool semantic embeddings {et} reside on a smooth manifold M ⊂ R^d

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 5 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Llm agents making agent tools

Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelovic, and Jakob Nikolas Kather. Llm agents making agent tools. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26092–26130, 2025

work page 2025
[3]

A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning

Xinzhe Li. A review of prominent paradigms for llm-based agents: Tool use (including rag), planning, and feedback learning.arXiv preprint arXiv:2406.05804, 2024

work page arXiv 2024
[4]

Llm with tools: A survey.arXiv preprint arXiv:2409.18807,

Zhuocheng Shen. Llm with tools: A survey.arXiv preprint arXiv:2409.18807, 2024

work page arXiv 2024
[5]

Llm-based agents for tool learning: A survey: W

Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey: W. xu et al.Data Science and Engineering, pages 1–31, 2025

work page 2025
[6]

Small llms are weak tool learners: A multi-llm agent

Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 16658– 16680, 2024

work page 2024
[7]

Flowbench: Revisiting and benchmarking workflow-guided planning for llm-based agents

Ruixuan Xiao, Wentao Ma, Ke Wang, Yuchuan Wu, Junbo Zhao, Haobo Wang, Fei Huang, and Yongbin Li. Flowbench: Revisiting and benchmarking workflow-guided planning for llm-based agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10883–10900, 2024

work page 2024
[8]

Self-organizing agent network for llm-based workflow automation.arXiv preprint arXiv:2508.13732, 2025

Yiming Xiong, Jian Wang, Bing Li, Yuhan Zhu, and Yuqi Zhao. Self-organizing agent network for llm-based workflow automation.arXiv preprint arXiv:2508.13732, 2025

work page arXiv 2025
[9]

Agent-s: Llm agentic workflow to automate standard operating procedures

Mandar Kulkarni. Agent-s: Llm agentic workflow to automate standard operating procedures. arXiv preprint arXiv:2503.15520, 2025

work page arXiv 2025
[10]

Odysseybench: Evaluating llm agents on long-horizon complex office application workflows

Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, and Saravan Rajmohan. Odysseybench: Evaluating llm agents on long-horizon complex office application workflows.arXiv preprint arXiv:2508.09124, 2025

work page arXiv 2025
[11]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review arXiv 2025
[12]

Enhancing llm-based agents via global planning and hierarchical execution,

Junjie Chen, Haitao Li, Jingli Yang, Yiqun Liu, and Qingyao Ai. Enhancing llm-based agents via global planning and hierarchical execution.arXiv preprint arXiv:2504.16563, 2025

work page arXiv 2025
[13]

arXiv preprint arXiv:2506.04651 , year=

Nikolas Belle, Dakota Barnes, Alfonso Amayuelas, Ivan Bercovich, Xin Eric Wang, and William Wang. Agents of change: Self-evolving llm agents for strategic planning.arXiv preprint arXiv:2506.04651, 2025

work page arXiv 2025
[14]

Routine: A structural planning framework for llm agent system in enterprise,

Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, et al. Routine: A structural planning framework for llm agent system in enterprise.arXiv preprint arXiv:2507.14447, 2025

work page arXiv 2025
[15]

Agile: A novel reinforcement learning framework of llm agents.Advances in Neural Information Processing Systems, 37:5244–5284, 2024

Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, and Hang Li. Agile: A novel reinforcement learning framework of llm agents.Advances in Neural Information Processing Systems, 37:5244–5284, 2024

work page 2024
[16]

arXiv preprint arXiv:2511.14460 , year=

Mingyue Cheng, Jie Ouyang, Shuo Yu, Ruiran Yan, Yucong Luo, Zirui Liu, Daoyu Wang, Qi Liu, and Enhong Chen. Agent-r1: Training powerful llm agents with end-to-end reinforcement learning.arXiv preprint arXiv:2511.14460, 2025

work page arXiv 2025
[17]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024. 10

work page 2024
[18]

Easytool: Enhancing llm-based agents with concise tool instruction

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, and Deqing Yang. Easytool: Enhancing llm-based agents with concise tool instruction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages ...

work page 2025
[19]

arXiv preprint arXiv:2603.00718 , year=

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, et al. Skillcraft: Can llm agents learn to use tools skillfully?arXiv preprint arXiv:2603.00718, 2026

work page arXiv 2026
[20]

Evaluation and benchmarking of llm agents: A survey

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmarking of llm agents: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6129–6139, 2025

work page 2025
[21]

Secure prompt engineering patterns for cloud llm agents

Shalini Sudarsan, Advait Patel, Charit Upadhyay, Vaishnavi Gudur, and Prashanthi Matam. Secure prompt engineering patterns for cloud llm agents. In2026 IEEE 5th International Conference on AI in Cybersecurity (ICAIC), pages 1–6. IEEE, 2026

work page 2026
[22]

Fine-tuning and prompt engineering of llms, for the creation of multi-agent ai for addressing sustainable protein production challenges.arXiv preprint arXiv:2506.20598, 2025

Alexander D Kalian, Jaewook Lee, Stefan P Johannesson, Lennart Otte, Christer Hogstrand, and Miao Guo. Fine-tuning and prompt engineering of llms, for the creation of multi-agent ai for addressing sustainable protein production challenges.arXiv preprint arXiv:2506.20598, 2025

work page arXiv 2025
[23]

Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks.arXiv preprint arXiv:2601.23086, 2026

Nathaniel Mitrani Hadida, Sassan Bhanji, Cameron Tice, and Puria Radmard. Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks.arXiv preprint arXiv:2601.23086, 2026

work page arXiv 2026
[24]

Catch: A novel data synthesis framework for high therapy fidelity and memory-driven planning chain of thought in ai counseling.arXiv preprint arXiv:2509.25733, 2025

Mingyu Chen, Jingkai Lin, Zhaojie Chu, Xiaofen Xing, Yirong Chen, and Xiangmin Xu. Catch: A novel data synthesis framework for high therapy fidelity and memory-driven planning chain of thought in ai counseling.arXiv preprint arXiv:2509.25733, 2025

work page arXiv 2025
[25]

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678, 2025

work page internal anchor Pith review arXiv 2025
[26]

Autotir: Autonomous tools integrated reasoning via reinforcement learning.arXiv preprint arXiv:2507.21836,

Yifan Wei, Xiaoyan Yu, Yixuan Weng, Tengfei Pan, Angsheng Li, and Li Du. Au- totir: Autonomous tools integrated reasoning via reinforcement learning.arXiv preprint arXiv:2507.21836, 2025

work page arXiv 2025
[27]

Unitoolbench: A benchmark for tool-augmented llms in cross-domain, universal task automation

Xiaojie Guo, Yang Zhang, Bing Zhang, Ryo Kawahara, Mikio Takeuchi, and Yada Zhu. Unitoolbench: A benchmark for tool-augmented llms in cross-domain, universal task automation. InFindings of the Association for Computational Linguistics: EACL 2026, pages 4726–4736, 2026

work page 2026
[28]

In-the-flow agentic system optimization for effective planning and tool use.arXiv preprint arXiv:2510.05592, 2025

Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective planning and tool use.arXiv preprint arXiv:2510.05592, 2025

work page arXiv 2025
[29]

Flowagent: Achieving compliance and flexibility for workflow agents

Yuchen Shi, Siqi Cai, Zihan Xu, Yuei Qin, Gang Li, Hang Shao, Jiawei Chen, Deqing Yang, Ke Li, and Xing Sun. Flowagent: Achieving compliance and flexibility for workflow agents. arXiv preprint arXiv:2502.14345, 2025

work page arXiv 2025
[30]

Coarse- to-fine grounded memory for llm agent planning.arXiv preprint arXiv:2508.15305, 2025

Wei Yang, Jinwei Xiao, Hongming Zhang, Qingyang Zhang, Yanna Wang, and Bo Xu. Coarse- to-fine grounded memory for llm agent planning.arXiv preprint arXiv:2508.15305, 2025

work page arXiv 2025
[31]

Deepagent: A general reasoning agent with scalable toolsets

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, et al. Deepagent: A general reasoning agent with scalable toolsets. InProceedings of the ACM Web Conference 2026, pages 2219–2230, 2026

work page 2026
[32]

Scaling graph chain-of-thought reasoning: A multi-agent framework with efficient llm serving.arXiv preprint arXiv:2511.01633, 2025

Chengying Huan, Ziheng Meng, Yongchao Liu, Zhengyi Yang, Yun Zhu, Yue Yun, Shipeng Li, Rong Gu, Xiabao Wu, Haitao Zhang, et al. Scaling graph chain-of-thought reasoning: A multi-agent framework with efficient llm serving.arXiv preprint arXiv:2511.01633, 2025. 11

work page arXiv 2025
[33]

Task planning and decision-making methods for intelligent agents based on large language models

Xiang Li. Task planning and decision-making methods for intelligent agents based on large language models. InProceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing, pages 817–822, 2025

work page 2025
[34]

Self-corrective task planning by inverse prompting with large language models

Jiho Lee, Hayun Lee, Jonghyeon Kim, Kyungjae Lee, and Eunwoo Kim. Self-corrective task planning by inverse prompting with large language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11017–11023. IEEE, 2025

work page 2025
[35]

Correctionplanner: Self-correction planner with reinforcement learning in autonomous driving.arXiv preprint arXiv:2603.15771, 2026

Yihong Guo, Dongqiangzi Ye, Sijia Chen, Anqi Liu, and Xianming Liu. Correctionplanner: Self-correction planner with reinforcement learning in autonomous driving.arXiv preprint arXiv:2603.15771, 2026

work page arXiv 2026
[36]

Agentsquare: Automatic llm agent search in modular design space, 2025

Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space.arXiv preprint arXiv:2410.06153, 2024

work page arXiv 2024
[37]

From fragmentation to systematic design: Architecting llm-based multi-agent systems.Authorea Preprints, 2026

Caishen Zhou, Yihong Tang, Kehai Chen, Xuefeng Bai, Shuhan Qi, Li Shen, and Min Zhang. From fragmentation to systematic design: Architecting llm-based multi-agent systems.Authorea Preprints, 2026

work page 2026
[38]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation

Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jian-Guang Lou, Qingwei Lin, Ping Luo, and Saravan Rajmohan. Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 496–507, 2025

work page 2025
[40]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review arXiv 2025
[41]

Ml- tool-bench: Tool-augmented planning for ml tasks.arXiv preprint arXiv:2512.00672, 2025

Yaswanth Chittepu, Raghavendra Addanki, Tung Mai, Anup Rao, and Branislav Kveton. Ml- tool-bench: Tool-augmented planning for ml tasks.arXiv preprint arXiv:2512.00672, 2025

work page arXiv 2025
[42]

Api-bank: A comprehensive benchmark for tool-augmented llms

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 3102–3116, 2023

work page 2023
[43]

Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37: 75749–75790, 2024

Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37: 75749–75790, 2024. 12 Appendix Contents: • A Proofs • B Experimental Details • C Additional Experiments • D Broader Impact A Proofs This appendix provides the complete proofs ...

work page 2024