arxiv: 2506.19500 · v2 · pith:A3QYWFLZnew · submitted 2025-06-24 · 💻 cs.AI · cs.CL· cs.LG

NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration

Yan Jiang , Hao Zhou , LiZhong GU , Ai Han , Tianlong Li This is my paper

Pith reviewed 2026-05-19 08:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LLM agentstool orchestrationbilevel planningtool navigation graphfunction callingscalabilityagent architecturetool ecosystem

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{A3QYWFLZ}

Prints a linked pith:A3QYWFLZ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

NaviAgent uses bilevel planning on a tool navigation graph to orchestrate thousands of interdependent tools without error buildup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models that call tools one step at a time run into accumulating errors once tools start depending on one another and the total count reaches thousands. The paper introduces NaviAgent, which splits the work into two levels: an upper level where the model chooses broad actions such as invoking a whole toolchain or asking for clarification, and a lower level that uses a graph of tool relations to pick the actual sequence. A Tool World Navigation Model updates itself from real execution feedback to keep those relations accurate. If the separation works, agents could manage far larger tool collections while staying reliable, turning ad-hoc function calling into systematic navigation of complex ecosystems.

Core claim

The paper claims that modeling the tool set as a navigation graph and maintaining a continuously evolving Tool World Navigation Model that encodes structural and behavioral relations among tools allows the agent to generate scalable invocation sequences. At the planning level the model decides among direct answers, clarification, toolchain use, or output execution; at the execution level the navigation model guides concrete calls. Experiments show this architecture attains the highest success rates across models and tasks, with the navigation model adding gains of up to 17 points on complex tasks.

What carries the argument

The Tool World Navigation Model (TWNM), a continuously updated graph encoding that captures how tools relate to one another structurally and behaviorally so the agent can plan sequences without stepping through calls one at a time.

If this is right

Task success rates become highest across different language models and task difficulties.
Complex multi-tool workflows show measurable gains once the navigation model is active.
Closed-loop updates from real executions improve both planning and execution over time.
Agent behavior shifts from isolated tool calls to adaptive navigation of an entire tool ecosystem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bilevel split could be applied to other settings where many components must be composed, such as library selection in code generation or service chaining in cloud workflows.
Maintaining an explicit relation graph may reduce the cognitive load placed on the language model itself during long-horizon planning.
If the model can be kept accurate at scale, the approach suggests a route toward agents that treat tool use as graph search rather than sequential guessing.

Load-bearing premise

Feedback from actual tool runs can keep updating the navigation model so that it continues to represent relations among thousands of tools accurately and without adding new sources of error or hitting scaling limits.

What would settle it

A controlled test that increases the tool count from hundreds to several thousand while tracking whether success rates stay above step-by-step baselines or begin to fall once the navigation model has received the same volume of real feedback.

Figures

Figures reproduced from arXiv: 2506.19500 by Ai Han, Hao Zhou, LiZhong GU, Tianlong Li, Yan Jiang.

**Figure 2.** Figure 2: Case Study of the Collaboration Workflow. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Evaluation of Frameworks on ToolBench Across Task Complexity. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of SFT on TSR. Adaptability through Fine-tuning. Notably, with supervised fine-tuning, the smaller Qwen2.5-14B model achieves performance comparable to the larger 32B model (TCR 81.8% vs 83.7%, TSR 49.5% vs 44.9%, see [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of TSR Distribution Between Multi-Path Decider and Baselines. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Alpha-Beta Backward Pruning Heuristic Graph Search with Dynamic Pruning. The algorithm is parameterized by a sextuple (T0, η,P, dmax,Mθ, Fω) (see Algorithm 2), where T0 = 300 is the initial temperature that determines the probability of accepting suboptimal solutions and balances exploration and exploitation, η = 0.7 is the cooling rate that controls the annealing schedule Tk+1 = η 1+k/5Tk, P = 40 is the … view at source ↗

**Figure 7.** Figure 7: Hybrid Heuristic Pruning Algorithm C Cases The following three cases exemplify the Decider-Navigator collaborative mechanism through four core actions executed by the Decider: 1) Direct Response: resolves user queries using pre-trained knowledge. 2) Intent Clarification: initiates interactive dialogue to disambiguate vague requests. 3) Tool Retrieval: collaborates with the Navigator module to generate a pr… view at source ↗

**Figure 8.** Figure 8: Pruned Tool Dependency Subgraph of Case1 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Pruned Tool Dependency Subgraph of Case2 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Pruned Tool Dependency Subgraph of Case3 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

Large language models (LLMs) have recently demonstrated the ability to act as function call agents by invoking external tools, enabling them to solve tasks beyond their static knowledge. However, existing agents typically call tools step by step at a time without a global view of task structure. As tools depend on each other, this leads to error accumulation and limited scalability, particularly when scaling to thousands of tools. To address these limitations, we propose NaviAgent, a novel bilevel architecture that decouples task planning from tool execution through graph-based modeling of the tool ecosystem. At the task-planning level, the LLM-based agent decides whether to respond directly, clarify user intent, invoke a toolchain, or execute tool outputs, ensuring broad coverage of interaction scenarios independent of inter-tool complexity. At the execution level, a continuously evolving Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, guiding the agent to generate scalable and robust invocation sequences. By incorporating feedback from real tool interactions, NaviAgent supports closed-loop optimization of planning and execution, moving beyond tool calling toward adaptive navigation of large-scale tool ecosystems. Experiments show that NaviAgent achieves the best task success rates across models and tasks, and integrating TWMN further boosts performance by up to 17 points on complex tasks, underscoring its key role in toolchain orchestration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NaviAgent's bilevel graph planning with an evolving TWNM targets real scalability issues in tool-using agents, but the performance claims need full experimental details to assess.

read the letter

Hi, the core takeaway is that this paper puts forward a bilevel architecture for LLM agents that separates high-level task decisions from low-level tool sequencing on a navigation graph, with a feedback-updated Tool World Navigation Model meant to handle thousands of interdependent tools without the usual error buildup. That framing directly tackles a practical bottleneck in current step-by-step agents. What is new here is the explicit split between planning (deciding to respond, clarify, or call a toolchain) and execution guided by the TWNM, which encodes structural and behavioral tool relations and refines itself from real interactions. The abstract positions this as moving beyond isolated tool calls toward adaptive navigation of large ecosystems. It does a solid job naming the problems of error accumulation and limited scalability when tools depend on each other, and the bilevel design tries to keep coverage broad regardless of inter-tool complexity. The reported gains, including up to 17-point lifts on complex tasks, suggest the approach can improve success rates across models. The soft spots sit mostly in the evidence. The abstract states best-in-class results and the TWNM boost but gives no baselines, task definitions, model sizes, or ablation breakdowns, so it is hard to judge whether the numbers reflect the architecture or something else. Maintaining an accurate TWNM at scale could also introduce its own maintenance overhead or fresh error sources, and the paper would need to show how that is managed. This work is aimed at researchers and engineers building agent systems that integrate many external tools in production settings. A reader focused on practical orchestration or planning layers could pick up useful architectural ideas even before the numbers are fully verified. I would send it to peer review so the experiments and any code can be checked properly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes NaviAgent, a bilevel architecture for LLM tool agents that decouples high-level task planning (deciding to respond, clarify, invoke a toolchain, or execute outputs) from low-level execution. A continuously evolving Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools via real-interaction feedback to enable scalable, robust invocation sequences on large tool graphs. Experiments are claimed to show that NaviAgent achieves the best task success rates across models and tasks, with TWNM integration yielding up to 17-point gains on complex tasks.

Significance. If the reported performance improvements hold under rigorous evaluation, the bilevel graph-navigation approach could meaningfully advance scalable tool orchestration for LLM agents by mitigating error accumulation and providing closed-loop adaptation, addressing a recognized bottleneck when tool counts reach thousands.

major comments (2)

[Experiments] Experiments section: the central claim of 'best task success rates' and 'up to 17-point boosts' is load-bearing yet unsupported by any reported baselines, metrics, task definitions, number of runs, or statistical tests in the provided text, preventing verification that the data actually supports superiority over prior agents.
[§3] §3 (TWNM description): the claim that feedback from real tool interactions allows the model to 'accurately encode structural and behavioral relations among thousands of interdependent tools without introducing new error sources' lacks a concrete update rule, graph construction algorithm, or scalability analysis, making the weakest assumption untestable from the manuscript.

minor comments (2)

[Abstract] Abstract, final sentence: 'TWMN' appears to be a typo for 'TWNM'.
[§2] Notation: the distinction between 'toolchain' and 'tool invocation sequence' is used without a formal definition or diagram, which could be clarified for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of the bilevel graph-navigation approach to address scalability in large tool ecosystems. We address each major comment below and will perform a major revision to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim of 'best task success rates' and 'up to 17-point boosts' is load-bearing yet unsupported by any reported baselines, metrics, task definitions, number of runs, or statistical tests in the provided text, preventing verification that the data actually supports superiority over prior agents.

Authors: We agree that the Experiments section in the current manuscript lacks sufficient detail to allow full verification of the claims. In the revised version, we will expand this section to include: explicit descriptions of all baselines (ReAct, Plan-and-Execute, Toolformer, and other relevant agents), precise definitions of metrics (task success rate as primary, with secondary metrics such as average tool calls and error rate), task definitions and datasets (ToolBench, API-Bank, and our custom large-scale tool graph with 1000+ tools), number of runs (5 independent runs with different random seeds), and statistical analysis (paired t-tests with p-values and confidence intervals). We will also include tables reporting raw success rates with standard deviations to substantiate the up to 17-point gains from TWNM integration. revision: yes
Referee: [§3] §3 (TWNM description): the claim that feedback from real tool interactions allows the model to 'accurately encode structural and behavioral relations among thousands of interdependent tools without introducing new error sources' lacks a concrete update rule, graph construction algorithm, or scalability analysis, making the weakest assumption untestable from the manuscript.

Authors: We acknowledge that Section 3 would benefit from greater concreteness. In the revision, we will add: (1) the precise update rule for real-interaction feedback (an incremental edge-weight update formula based on observed success/failure and co-invocation frequency), (2) the graph construction algorithm (nodes as tools with feature vectors, directed edges initialized from API documentation and refined via execution traces using a thresholded dependency score), and (3) a scalability analysis (O(n log n) update complexity per interaction batch with empirical curves for tool counts from 100 to 5000, plus memory footprint measurements). These additions will make the claim that the closed-loop mechanism avoids new error sources directly testable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture and experiments are self-contained

full rationale

The paper proposes NaviAgent as a bilevel architecture decoupling task planning from tool execution via a graph-based Tool World Navigation Model (TWNM) that evolves from real-interaction feedback. Central claims rest on this design choice and reported experimental success rates (including up to 17-point gains on complex tasks), without any equations, fitted parameters renamed as predictions, self-citations invoked for uniqueness theorems, or ansatzes smuggled in. No derivation step reduces by construction to its own inputs; the work is an architectural proposal validated externally by experiments rather than a closed mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on standard assumptions that LLMs can perform high-level planning and that tool dependencies can be modeled as a navigable graph updated via interaction feedback; no free parameters or invented entities beyond the TWNM are explicitly quantified in the abstract.

axioms (2)

domain assumption LLMs can reliably decide among broad interaction scenarios (respond, clarify, invoke toolchain, execute outputs) independent of inter-tool complexity.
Invoked in the task-planning level description.
domain assumption Structural and behavioral relations among tools can be encoded in a continuously evolving graph model that guides scalable invocation sequences.
Central to the execution-level TWNM.

invented entities (1)

Tool World Navigation Model (TWNM) no independent evidence
purpose: Encodes structural and behavioral relations among tools to guide invocation sequences and support closed-loop optimization.
Newly introduced component that decouples execution from planning.

pith-pipeline@v0.9.0 · 5780 in / 1336 out tokens · 20673 ms · 2026-05-19T08:08:29.660110+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NaviAgent features a bilevel planning architecture that integrates a Multi-Path Decider and a Graph-Encoded Navigator... constructs and navigates a Tool Dependency Heterogeneous Graph (TDHG)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Graph-Encoded Navigator... hybrid loss... heuristic search strategy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · 7 internal anchors

[1]

Talm: Tool augmented language models

Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022

work page arXiv 2022
[2]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023

work page 2023
[3]

Towards tool use alignment of large language models

Zhi-Yuan Chen, Shiqi Shen, Guangyao Shen, Gong Zhi, Xu Chen, and Yankai Lin. Towards tool use alignment of large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1382–1400, 2024

work page 2024
[4]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36:38154–38180, 2023

work page 2023
[5]

Gpt4tools: Teaching large language model to use tools via self-instruction

Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction. Advances in Neural Information Processing Systems, 36:71995–72007, 2023

work page 2023
[6]

Tool learning with large language models: A survey

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. Frontiers of Computer Science, 19(8):198343, 2025

work page 2025
[7]

Chameleon: Plug-and-play compositional reasoning with large language models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36:43447–43478, 2023

work page 2023
[8]

Toolverifier: Generalization to new tools via self-verification

Dheeraj Mekala, Jason Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, Jingbo Shang, and Jane Dwivedi-Yu. Toolverifier: Generalization to new tools via self-verification. arXiv preprint arXiv:2402.14158, 2024

work page arXiv 2024
[9]

Confucius: Iterative tool learning from introspection feedback by easy-to- difficult curriculum

Shen Gao, Zhengliang Shi, Minghang Zhu, Bowen Fang, Xin Xin, Pengjie Ren, Zhumin Chen, Jun Ma, and Zhaochun Ren. Confucius: Iterative tool learning from introspection feedback by easy-to- difficult curriculum. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18030–18038, 2024

work page 2024
[10]

Multi-agent systems: A survey about its components, framework and workflow

Diego Maldonado, Edison Cruz, Jackeline Abad Torres, Patricio J Cruz, and Silvana Gamboa. Multi-agent systems: A survey about its components, framework and workflow. IEEE Access, 2024

work page 2024
[11]

Ai agents: Evolution, architecture, and real-world applications

Naveen Krishnan. Ai agents: Evolution, architecture, and real-world applications. arXiv preprint arXiv:2503.12687, 2025

work page arXiv 2025
[12]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms. arXiv preprint arXiv:2501.06322, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Appagentx: Evolving gui agents as proficient smartphone users

Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Appagentx: Evolving gui agents as proficient smartphone users. arXiv preprint arXiv:2503.02268, 2025. 10

work page arXiv 2025
[14]

Exploring autonomous agents through the lens of large language models: A review

Saikat Barua. Exploring autonomous agents through the lens of large language models: A review. arXiv preprint arXiv:2404.04442, 2024

work page arXiv 2024
[15]

Chain of tools: Large language model is an automatic multi-tool learner

Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Suzan Verberne, and Zhaochun Ren. Chain of tools: Large language model is an automatic multi-tool learner. arXiv preprint arXiv:2405.16533, 2024

work page arXiv 2024
[16]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Toolchain*: Efficient action space navigation in large language models with a* search

Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A Rossi, Somdeb Sarkhel, and Chao Zhang. Toolchain*: Efficient action space navigation in large language models with a* search. arXiv preprint arXiv:2310.13227, 2023

work page arXiv 2023
[18]

Toolnet: Connecting large language models with massive tools via tool graph

Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, and Dongkuan Xu. Toolnet: Connecting large language models with massive tools via tool graph. arXiv preprint arXiv:2403.00839, 2024

work page arXiv 2024
[19]

Cold-start recommendation towards the era of large language models (llms): A comprehensive survey and roadmap

Weizhi Zhang, Yuanchen Bei, Liangwei Yang, Henry Peng Zou, Peilin Zhou, Aiwei Liu, Yinghui Li, Hao Chen, Jianling Wang, Yu Wang, et al. Cold-start recommendation towards the era of large language models (llms): A comprehensive survey and roadmap. arXiv preprint arXiv:2501.01945, 2025

work page arXiv 2025
[20]

Llmtreerec: Unleashing the power of large language models for cold-start recommendations

Wenlin Zhang, Chuhan Wu, Xiangyang Li, Yuhao Wang, Kuicai Dong, Yichao Wang, Xinyi Dai, Xiangyu Zhao, Huifeng Guo, and Ruiming Tang. Llmtreerec: Unleashing the power of large language models for cold-start recommendations. arXiv preprint arXiv:2404.00702, 2024

work page arXiv 2024
[21]

Can graph learning improve planning in llm-based agents? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Xixi Wu, Yifei Shen, Caihua Shan, Kaitao Song, Siwei Wang, Bohang Zhang, Jiarui Feng, Hong Cheng, Wei Chen, Yun Xiong, et al. Can graph learning improve planning in llm-based agents? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[22]

arXiv preprint arXiv:2401.06201

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and De- qing Yang. Easytool: Enhancing llm-based agents with concise tool instruction. arXiv preprint arXiv:2401.06201, 2024

work page arXiv 2024
[23]

Concise and precise context compression for tool-using language models

Yang Xu, Yunlong Feng, Honglin Mu, Yutai Hou, Yitong Li, Xinghao Wang, Wanjun Zhong, Zhongyang Li, Dandan Tu, Qingfu Zhu, et al. Concise and precise context compression for tool-using language models. arXiv preprint arXiv:2407.02043, 2024

work page arXiv 2024
[24]

Small llms are weak tool learners: A multi-llm agent, 2024

Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324, 2024

work page arXiv 2024
[25]

Making language models better tool learners with execution feedback

Shuofei Qiao, Honghao Gui, Chengfei Lv, Qianghuai Jia, Huajun Chen, and Ningyu Zhang. Making language models better tool learners with execution feedback. arXiv preprint arXiv:2305.13068, 2023

work page arXiv 2023
[26]

Toolfactory: Automating tool generation by leveraging llm to understand rest api documentations

Xinyi Ni, Qiuyang Wang, Yukun Zhang, and Pengyu Hong. Toolfactory: Automating tool generation by leveraging llm to understand rest api documentations. arXiv preprint arXiv:2501.16945, 2025

work page arXiv 2025
[27]

Re- act: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- act: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

work page 2023
[28]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023
[29]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[30]

Controlllm: Augment language models with tools by searching on graphs

Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, et al. Controlllm: Augment language models with tools by searching on graphs. In European Conference on Computer Vision, pages 89–105. Springer, 2024

work page 2024
[31]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024. 11

work page 2024
[32]

Inductive representation learning on large graphs

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017

work page 2017
[33]

Link prediction based on graph neural networks

Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural information processing systems, 31, 2018

work page 2018
[34]

Graph neural networks: A review of methods and applications

Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI open, 1:57–81, 2020

work page 2020
[35]

An analysis of alpha-beta pruning

Donald E Knuth and Ronald W Moore. An analysis of alpha-beta pruning. Artificial intelligence, 6(4):293–326, 1975

work page 1975
[36]

Optimization by simulated annealing

Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983

work page 1983
[37]

Genetic algorithms in machine learning

Jonathan Shapiro. Genetic algorithms in machine learning. In Advanced course on artificial intelligence, pages 146–168. Springer, 1999

work page 1999
[38]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Saad Tahmid and Sourav Sarker. Qwen2. 5-32b: Leveraging self-consistent tool-integrated reasoning for bengali mathematical olympiad problem solving. arXiv preprint arXiv:2411.05934, 2024

work page arXiv 2024
[41]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging, 35(5):1299–1312, 2016

Nima Tajbakhsh, Jae Y Shin, Suryakanth R Gurudu, R Todd Hurst, Christopher B Kendall, Michael B Gotway, and Jianming Liang. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging, 35(5):1299–1312, 2016

work page 2016
[45]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020. 12 A Graph Construction Original API Original ParameterParameter...

work page 2020
[46]

analysis

Gastroenteritis Inquiry ACTION: Direct Response CONTEXT: { "analysis": "1. Common sense judgment: Gastroenteritis is a medical condition within general knowledge. 2. No API required for definition explanation. 3. Parameter verification not applicable", "action": "direct_answer", "intent": "1. Provide medical definition of gastroenteritis 2. User’s explici...

work page
[47]

analysis

Health Data Logging ACTION: Tool Retrieval CONTEXT: { "analysis": "1. Health data recording requires specialized API 2. No matching API in candidate list 3. Critical parameters provided: user_id=12345, blood_pressure=130/80, heart_rate=90, timestamp=2025-03-05", "action": "retrieve_api", "intent": "1. Store user health metrics 2. Provided parameters: bloo...

work page 2025
[48]

, "action

Successful data storage confirmation", "action": "direct_answer", "answer": "Your health data has been successfully recorded..." } Last Round: Response Synthesis ACTION: Direct Response CONTEXT: 1.Inflammation of gastrointestinal tract Common causes: Viral/bacterial infections Key symptoms: Diarrhea, vomiting, abdominal cramps Management: Hydration and rest

work page
[49]

No data integrity issues detected

Health Data Record Confirmation: All parameters validated successfully through api-RecordHealthData. No data integrity issues detected. 18 C.2 Case 2 Query • Is Beijing in China? What is the current weather forecast for the location at latitude 40.7128 and longitude -74.0060 in metric units? First Round: Intent Analysis

work page
[50]

The specific problem addressed by this action is to determine if Beijing is located within China

Common Sense ACTION: Direct Response CONTEXT: {’dependency_rank’: 1, ’intent’: ’Is Beijing in China?’, ’analysis’: ’1. The specific problem addressed by this action is to determine if Beijing is located within China. 2. The user input directly asks about the geographical location of Beijing. 3. This is a common sense question and does not require an API c...

work page
[51]

action":

Get Weather ACTION: Tool Retrieval CONTEXT: {’dependency_rank’: 2, ’intent’: ’What is the current weather forecast for the location at latitude 40.7128 and longitude -74.0060 in metric units?’, ’analysis’: ’1. The specific problem addressed by this action is to retrieve the current weather forecast for a specific location using latitude and longitude coor...

work page
[52]

The target API ’api-current_weather_data_of_a_location_for_weather’ has already been successfully called and its result is available. 4. The required parameters (lat and lon) are complete and the API call has been confirmed successful.", ’action’: ’direct_answer’, ’answer’: ’The current weather in New York is clear sky with a temperature of 11.0°C, feels ...

work page
[53]

analysis

Intent Clarification 20 ACTION: Intent Clarification CONTEXT: { "analysis": "1. Weather data needs to be queried in real time -> not common sense\n2. Required parameter (location) is missing", "action": "clarify_intent", "recall_description": "", "answer": "Which city do you want to query tomorrow’s weather?" }

work page
[54]

Second Round: Intent Analysis

User Answer User Answer: I’m in BeiJing. Second Round: Intent Analysis

work page
[55]

action":

Get Weather ACTION: Tool Retrieval CONTEXT: {’dependency_rank’: 1, ’intent’: ’1. Query the weather forecast for tomorrow in Beijing 2. Extract location: Beijing and time: tomorrow from user input’, ’analysis’: ’1. The specific problem addressed by this action is to retrieve the weather forecast for tomorrow in Beijing. 2. The user input directly provides ...

work page
[56]

Core Requirements: - Generate a natural-language question where: • Must explicitly contain initial parameters for leaf-node APIs • Implicitly requires chained API calls from leaf to root node • Root node API’s output directly resolves the user’s problem

work page
[57]

• All input values must originate from either: Explicitly stated in the question context Generated by previous API outputs (no synthetic values)

Dependency Chain Rules: - Build parameter-passing paths where: • Parent API outputs must exactly match child API inputs (same parameter names & data types) • Root node API must be called last in the chain • The output of every leaf-node API must be utilized in downstream APIs or final results. • All input values must originate from either: Explicitly stat...

work page
[58]

Parameter Constraints: - Enforce strict value inheritance: • Path/query parameters must use verbatim values from: - User’s question text - Preceding API response.data fields • Prohibit value transformation/format conversion - Root API output must contain realistic values matching its schema

work page
[59]

Validation Requirements: - Reject generation if: • Missing parameter dependency between APIs • Input sources can’t be traced to question/prior responses • Output fields don’t fulfill next API’s input requirements

work page
[60]

query":

Response Structure: { "query": "<Real-world scenario requiring sequential API calls>", 22 "answer": "<Solution derived from root API output>", "call_chains": [ { "api_name": "<Leaf-node API>", "input": { "<param>": "<value explicitly stated in user query or previous API output>" }, "output": { "status": "success", "data": {"<field>": "<output used by next...

work page
[61]

**Intent Analysis** - Decompose compound requests into independent ordered sub-intents • Sequential dependencies first, Must execute in declared order • Parallelizable sub-intents last • Dependency_rank numbering for ordered execution - Validate parallel execution eligibility: • No overlapping data requirements • No sequential dependencies • Distinct para...

work page
[62]

**Atomic Action Formation** • For each validated sub-intent: - Create self-contained decision unit, action must implement full Decision Logic Flow - Maintain state separation between parallel processes - Focus analysis scope per sub-intent - Each action’s analysis focuses only on its own intent - Each action analysis only solves one intent - Must execute ...

work page
[63]

**Common Sense Judgment Phase** - Input question -> Knowledge base matching Belongs to common sense -> action=direct_answer Requires external data -> Proceed to Phase 2

work page
[64]

**API Matching Phase**

work page
[65]

If candidate_apis is empty -> action=retrieve_api

work page
[66]

Match intent with API list: API prioritization: - Complete parameters from user input - Minimal missing parameters - Shortest dependency chain API matching success: - Validate Observation in user input to confirm target API success: -> If successful -> action=direct_answer -> No explicit success indication: a) Complete parameters -> action=call_api (execu...

work page
[67]

dependency_rank

**Parameter Completion Phase** - Check required parameter set: All parameters ready -> action=call_api The target API does not require parameters -> action=call_api Missing parameters exist: a) Can be completed via dependent APIs -> Execute Rule 3.1 b) Use Retrieval APIs resolve parameter deficiencies in API dependencies -> action=retrieve_api c) Requires...

work page
[68]

, "analysis

<extract data segments directly related to the subtask from user input>", "analysis": "<Four-level reasoning: 1.Explicitly state the specific decision-making sub-intent addressed by this action 2.Common sense judgment basis 3.API matching logic (if applicable) 4.Parameter completeness verification>", "action": "call_api|direct_answer|retrieve_api|clarify_...

work page
[69]

Parameter names must strictly match API documentation

work page
[70]

The ’answer’ field for clarify_intent must contain question words

work page
[71]

Prioritize calling parent node APIs

work page
[72]

When action in [retrieve_api]: - The recall_description field serves exclusively as an API retrieval identifier from predefined repositories. - parameter descriptions must distinguish between input and output parameters, retaining only essential parameters - Each recall_description can only recall one api,multiple APIs require 25 multiple actions

work page
[73]

APIs absent from Candidate APIs MUST NOT be invented

work page
[74]

When action=call_api is permitted only when candidate APIs exist and the target_api is present in the candidate APIs

work page
[75]

The "action" field must be strictly limited to one of the following four predefined operation types: call_api, direct_answer,retrieve_api or clarify_intent

work page
[76]

Use retrieve_api only when: - Required parameters unavailable in call_api action

work page
[77]

User input:{user_input}\nPlease generate the final response based on the following data: {observation} : Requirements:

Use call_api only when: - The target_api is not in the list of successfully executed APIs --------- # Candidate API Information: E.3.2 Input Generation Prompts Input generation prompts: Integrate current queries with observational data to formulate the final input, ensuring informational completeness. User input:{user_input}\nPlease generate the final res...

work page
[78]

Integrate all available data

work page
[79]

Indicate data limitations (if any failed APIs exist)

work page
[80]

They achieve automated emulation of API chains through standardized JSON responses

Use natural and fluent English E.3.3 API Simulator Prompts API simulator prompts are based on historical data reuse (Case1) and intelligent simulation gen- eration (Case2/3). They achieve automated emulation of API chains through standardized JSON responses. The priority strategy is as follows: historical matching > structural cloning > contextual simulat...

work page

Showing first 80 references.