pith. sign in

arxiv: 2506.19500 · v2 · pith:A3QYWFLZnew · submitted 2025-06-24 · 💻 cs.AI · cs.CL· cs.LG

NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration

Pith reviewed 2026-05-19 08:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords LLM agentstool orchestrationbilevel planningtool navigation graphfunction callingscalabilityagent architecturetool ecosystem
0
0 comments X p. Extension
pith:A3QYWFLZ Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{A3QYWFLZ}

Prints a linked pith:A3QYWFLZ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

NaviAgent uses bilevel planning on a tool navigation graph to orchestrate thousands of interdependent tools without error buildup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models that call tools one step at a time run into accumulating errors once tools start depending on one another and the total count reaches thousands. The paper introduces NaviAgent, which splits the work into two levels: an upper level where the model chooses broad actions such as invoking a whole toolchain or asking for clarification, and a lower level that uses a graph of tool relations to pick the actual sequence. A Tool World Navigation Model updates itself from real execution feedback to keep those relations accurate. If the separation works, agents could manage far larger tool collections while staying reliable, turning ad-hoc function calling into systematic navigation of complex ecosystems.

Core claim

The paper claims that modeling the tool set as a navigation graph and maintaining a continuously evolving Tool World Navigation Model that encodes structural and behavioral relations among tools allows the agent to generate scalable invocation sequences. At the planning level the model decides among direct answers, clarification, toolchain use, or output execution; at the execution level the navigation model guides concrete calls. Experiments show this architecture attains the highest success rates across models and tasks, with the navigation model adding gains of up to 17 points on complex tasks.

What carries the argument

The Tool World Navigation Model (TWNM), a continuously updated graph encoding that captures how tools relate to one another structurally and behaviorally so the agent can plan sequences without stepping through calls one at a time.

If this is right

  • Task success rates become highest across different language models and task difficulties.
  • Complex multi-tool workflows show measurable gains once the navigation model is active.
  • Closed-loop updates from real executions improve both planning and execution over time.
  • Agent behavior shifts from isolated tool calls to adaptive navigation of an entire tool ecosystem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bilevel split could be applied to other settings where many components must be composed, such as library selection in code generation or service chaining in cloud workflows.
  • Maintaining an explicit relation graph may reduce the cognitive load placed on the language model itself during long-horizon planning.
  • If the model can be kept accurate at scale, the approach suggests a route toward agents that treat tool use as graph search rather than sequential guessing.

Load-bearing premise

Feedback from actual tool runs can keep updating the navigation model so that it continues to represent relations among thousands of tools accurately and without adding new sources of error or hitting scaling limits.

What would settle it

A controlled test that increases the tool count from hundreds to several thousand while tracking whether success rates stay above step-by-step baselines or begin to fall once the navigation model has received the same volume of real feedback.

Figures

Figures reproduced from arXiv: 2506.19500 by Ai Han, Hao Zhou, LiZhong GU, Tianlong Li, Yan Jiang.

Figure 1
Figure 1. Figure 1: The overview of NaviAgent [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Case Study of the Collaboration Workflow. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation of Frameworks on ToolBench Across Task Complexity. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of SFT on TSR. Adaptability through Fine-tuning. Notably, with supervised fine-tuning, the smaller Qwen2.5-14B model achieves per￾formance comparable to the larger 32B model (TCR 81.8% vs 83.7%, TSR 49.5% vs 44.9%, see [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of TSR Distribution Between Multi-Path Decider and Baselines. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Alpha-Beta Backward Pruning Heuristic Graph Search with Dynamic Pruning. The algorithm is parameterized by a sextuple (T0, η,P, dmax,Mθ, Fω) (see Algorithm 2), where T0 = 300 is the initial temperature that deter￾mines the probability of accepting suboptimal solutions and balances exploration and exploitation, η = 0.7 is the cooling rate that controls the annealing schedule Tk+1 = η 1+k/5Tk, P = 40 is the … view at source ↗
Figure 7
Figure 7. Figure 7: Hybrid Heuristic Pruning Algorithm C Cases The following three cases exemplify the Decider-Navigator collaborative mechanism through four core actions executed by the Decider: 1) Direct Response: resolves user queries using pre-trained knowledge. 2) Intent Clarification: initiates interactive dialogue to disambiguate vague requests. 3) Tool Retrieval: collaborates with the Navigator module to generate a pr… view at source ↗
Figure 8
Figure 8. Figure 8: Pruned Tool Dependency Subgraph of Case1 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pruned Tool Dependency Subgraph of Case2 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pruned Tool Dependency Subgraph of Case3 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
read the original abstract

Large language models (LLMs) have recently demonstrated the ability to act as function call agents by invoking external tools, enabling them to solve tasks beyond their static knowledge. However, existing agents typically call tools step by step at a time without a global view of task structure. As tools depend on each other, this leads to error accumulation and limited scalability, particularly when scaling to thousands of tools. To address these limitations, we propose NaviAgent, a novel bilevel architecture that decouples task planning from tool execution through graph-based modeling of the tool ecosystem. At the task-planning level, the LLM-based agent decides whether to respond directly, clarify user intent, invoke a toolchain, or execute tool outputs, ensuring broad coverage of interaction scenarios independent of inter-tool complexity. At the execution level, a continuously evolving Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, guiding the agent to generate scalable and robust invocation sequences. By incorporating feedback from real tool interactions, NaviAgent supports closed-loop optimization of planning and execution, moving beyond tool calling toward adaptive navigation of large-scale tool ecosystems. Experiments show that NaviAgent achieves the best task success rates across models and tasks, and integrating TWMN further boosts performance by up to 17 points on complex tasks, underscoring its key role in toolchain orchestration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes NaviAgent, a bilevel architecture for LLM tool agents that decouples high-level task planning (deciding to respond, clarify, invoke a toolchain, or execute outputs) from low-level execution. A continuously evolving Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools via real-interaction feedback to enable scalable, robust invocation sequences on large tool graphs. Experiments are claimed to show that NaviAgent achieves the best task success rates across models and tasks, with TWNM integration yielding up to 17-point gains on complex tasks.

Significance. If the reported performance improvements hold under rigorous evaluation, the bilevel graph-navigation approach could meaningfully advance scalable tool orchestration for LLM agents by mitigating error accumulation and providing closed-loop adaptation, addressing a recognized bottleneck when tool counts reach thousands.

major comments (2)
  1. [Experiments] Experiments section: the central claim of 'best task success rates' and 'up to 17-point boosts' is load-bearing yet unsupported by any reported baselines, metrics, task definitions, number of runs, or statistical tests in the provided text, preventing verification that the data actually supports superiority over prior agents.
  2. [§3] §3 (TWNM description): the claim that feedback from real tool interactions allows the model to 'accurately encode structural and behavioral relations among thousands of interdependent tools without introducing new error sources' lacks a concrete update rule, graph construction algorithm, or scalability analysis, making the weakest assumption untestable from the manuscript.
minor comments (2)
  1. [Abstract] Abstract, final sentence: 'TWMN' appears to be a typo for 'TWNM'.
  2. [§2] Notation: the distinction between 'toolchain' and 'tool invocation sequence' is used without a formal definition or diagram, which could be clarified for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of the bilevel graph-navigation approach to address scalability in large tool ecosystems. We address each major comment below and will perform a major revision to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim of 'best task success rates' and 'up to 17-point boosts' is load-bearing yet unsupported by any reported baselines, metrics, task definitions, number of runs, or statistical tests in the provided text, preventing verification that the data actually supports superiority over prior agents.

    Authors: We agree that the Experiments section in the current manuscript lacks sufficient detail to allow full verification of the claims. In the revised version, we will expand this section to include: explicit descriptions of all baselines (ReAct, Plan-and-Execute, Toolformer, and other relevant agents), precise definitions of metrics (task success rate as primary, with secondary metrics such as average tool calls and error rate), task definitions and datasets (ToolBench, API-Bank, and our custom large-scale tool graph with 1000+ tools), number of runs (5 independent runs with different random seeds), and statistical analysis (paired t-tests with p-values and confidence intervals). We will also include tables reporting raw success rates with standard deviations to substantiate the up to 17-point gains from TWNM integration. revision: yes

  2. Referee: [§3] §3 (TWNM description): the claim that feedback from real tool interactions allows the model to 'accurately encode structural and behavioral relations among thousands of interdependent tools without introducing new error sources' lacks a concrete update rule, graph construction algorithm, or scalability analysis, making the weakest assumption untestable from the manuscript.

    Authors: We acknowledge that Section 3 would benefit from greater concreteness. In the revision, we will add: (1) the precise update rule for real-interaction feedback (an incremental edge-weight update formula based on observed success/failure and co-invocation frequency), (2) the graph construction algorithm (nodes as tools with feature vectors, directed edges initialized from API documentation and refined via execution traces using a thresholded dependency score), and (3) a scalability analysis (O(n log n) update complexity per interaction batch with empirical curves for tool counts from 100 to 5000, plus memory footprint measurements). These additions will make the claim that the closed-loop mechanism avoids new error sources directly testable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture and experiments are self-contained

full rationale

The paper proposes NaviAgent as a bilevel architecture decoupling task planning from tool execution via a graph-based Tool World Navigation Model (TWNM) that evolves from real-interaction feedback. Central claims rest on this design choice and reported experimental success rates (including up to 17-point gains on complex tasks), without any equations, fitted parameters renamed as predictions, self-citations invoked for uniqueness theorems, or ansatzes smuggled in. No derivation step reduces by construction to its own inputs; the work is an architectural proposal validated externally by experiments rather than a closed mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on standard assumptions that LLMs can perform high-level planning and that tool dependencies can be modeled as a navigable graph updated via interaction feedback; no free parameters or invented entities beyond the TWNM are explicitly quantified in the abstract.

axioms (2)
  • domain assumption LLMs can reliably decide among broad interaction scenarios (respond, clarify, invoke toolchain, execute outputs) independent of inter-tool complexity.
    Invoked in the task-planning level description.
  • domain assumption Structural and behavioral relations among tools can be encoded in a continuously evolving graph model that guides scalable invocation sequences.
    Central to the execution-level TWNM.
invented entities (1)
  • Tool World Navigation Model (TWNM) no independent evidence
    purpose: Encodes structural and behavioral relations among tools to guide invocation sequences and support closed-loop optimization.
    Newly introduced component that decouples execution from planning.

pith-pipeline@v0.9.0 · 5780 in / 1336 out tokens · 20673 ms · 2026-05-19T08:08:29.660110+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · 7 internal anchors

  1. [1]

    Talm: Tool augmented language models

    Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022

  2. [2]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023

  3. [3]

    Towards tool use alignment of large language models

    Zhi-Yuan Chen, Shiqi Shen, Guangyao Shen, Gong Zhi, Xu Chen, and Yankai Lin. Towards tool use alignment of large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1382–1400, 2024

  4. [4]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36:38154–38180, 2023

  5. [5]

    Gpt4tools: Teaching large language model to use tools via self-instruction

    Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction. Advances in Neural Information Processing Systems, 36:71995–72007, 2023

  6. [6]

    Tool learning with large language models: A survey

    Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. Frontiers of Computer Science, 19(8):198343, 2025

  7. [7]

    Chameleon: Plug-and-play compositional reasoning with large language models

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36:43447–43478, 2023

  8. [8]

    Toolverifier: Generalization to new tools via self-verification

    Dheeraj Mekala, Jason Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, Jingbo Shang, and Jane Dwivedi-Yu. Toolverifier: Generalization to new tools via self-verification. arXiv preprint arXiv:2402.14158, 2024

  9. [9]

    Confucius: Iterative tool learning from introspection feedback by easy-to- difficult curriculum

    Shen Gao, Zhengliang Shi, Minghang Zhu, Bowen Fang, Xin Xin, Pengjie Ren, Zhumin Chen, Jun Ma, and Zhaochun Ren. Confucius: Iterative tool learning from introspection feedback by easy-to- difficult curriculum. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18030–18038, 2024

  10. [10]

    Multi-agent systems: A survey about its components, framework and workflow

    Diego Maldonado, Edison Cruz, Jackeline Abad Torres, Patricio J Cruz, and Silvana Gamboa. Multi-agent systems: A survey about its components, framework and workflow. IEEE Access, 2024

  11. [11]

    Ai agents: Evolution, architecture, and real-world applications

    Naveen Krishnan. Ai agents: Evolution, architecture, and real-world applications. arXiv preprint arXiv:2503.12687, 2025

  12. [12]

    Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms. arXiv preprint arXiv:2501.06322, 2025

  13. [13]

    Appagentx: Evolving gui agents as proficient smartphone users,

    Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Appagentx: Evolving gui agents as proficient smartphone users. arXiv preprint arXiv:2503.02268, 2025. 10

  14. [14]

    Exploring autonomous agents through the lens of large language models: A review

    Saikat Barua. Exploring autonomous agents through the lens of large language models: A review. arXiv preprint arXiv:2404.04442, 2024

  15. [15]

    Chain of tools: Large language model is an automatic multi-tool learner

    Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Suzan Verberne, and Zhaochun Ren. Chain of tools: Large language model is an automatic multi-tool learner. arXiv preprint arXiv:2405.16533, 2024

  16. [16]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

  17. [17]

    Toolchain*: Efficient action space navigation in large language models with a* search

    Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A Rossi, Somdeb Sarkhel, and Chao Zhang. Toolchain*: Efficient action space navigation in large language models with a* search. arXiv preprint arXiv:2310.13227, 2023

  18. [18]

    Toolnet: Connecting large language models with massive tools via tool graph

    Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, and Dongkuan Xu. Toolnet: Connecting large language models with massive tools via tool graph. arXiv preprint arXiv:2403.00839, 2024

  19. [19]

    Cold-start recommendation towards the era of large language models (llms): A comprehensive survey and roadmap

    Weizhi Zhang, Yuanchen Bei, Liangwei Yang, Henry Peng Zou, Peilin Zhou, Aiwei Liu, Yinghui Li, Hao Chen, Jianling Wang, Yu Wang, et al. Cold-start recommendation towards the era of large language models (llms): A comprehensive survey and roadmap. arXiv preprint arXiv:2501.01945, 2025

  20. [20]

    Llmtreerec: Unleashing the power of large language models for cold-start recommendations

    Wenlin Zhang, Chuhan Wu, Xiangyang Li, Yuhao Wang, Kuicai Dong, Yichao Wang, Xinyi Dai, Xiangyu Zhao, Huifeng Guo, and Ruiming Tang. Llmtreerec: Unleashing the power of large language models for cold-start recommendations. arXiv preprint arXiv:2404.00702, 2024

  21. [21]

    Can graph learning improve planning in llm-based agents? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

    Xixi Wu, Yifei Shen, Caihua Shan, Kaitao Song, Siwei Wang, Bohang Zhang, Jiarui Feng, Hong Cheng, Wei Chen, Yun Xiong, et al. Can graph learning improve planning in llm-based agents? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  22. [22]

    arXiv preprint arXiv:2401.06201

    Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and De- qing Yang. Easytool: Enhancing llm-based agents with concise tool instruction. arXiv preprint arXiv:2401.06201, 2024

  23. [23]

    Concise and precise context compression for tool-using language models

    Yang Xu, Yunlong Feng, Honglin Mu, Yutai Hou, Yitong Li, Xinghao Wang, Wanjun Zhong, Zhongyang Li, Dandan Tu, Qingfu Zhu, et al. Concise and precise context compression for tool-using language models. arXiv preprint arXiv:2407.02043, 2024

  24. [24]

    Small llms are weak tool learners: A multi-llm agent, 2024

    Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324, 2024

  25. [25]

    Making language models better tool learners with execution feedback

    Shuofei Qiao, Honghao Gui, Chengfei Lv, Qianghuai Jia, Huajun Chen, and Ningyu Zhang. Making language models better tool learners with execution feedback. arXiv preprint arXiv:2305.13068, 2023

  26. [26]

    Toolfactory: Automating tool generation by leveraging llm to understand rest api documentations

    Xinyi Ni, Qiuyang Wang, Yukun Zhang, and Pengyu Hong. Toolfactory: Automating tool generation by leveraging llm to understand rest api documentations. arXiv preprint arXiv:2501.16945, 2025

  27. [27]

    Re- act: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- act: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

  28. [28]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  29. [29]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023

  30. [30]

    Controlllm: Augment language models with tools by searching on graphs

    Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, et al. Controlllm: Augment language models with tools by searching on graphs. In European Conference on Computer Vision, pages 89–105. Springer, 2024

  31. [31]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024. 11

  32. [32]

    Inductive representation learning on large graphs

    Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017

  33. [33]

    Link prediction based on graph neural networks

    Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural information processing systems, 31, 2018

  34. [34]

    Graph neural networks: A review of methods and applications

    Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI open, 1:57–81, 2020

  35. [35]

    An analysis of alpha-beta pruning

    Donald E Knuth and Ronald W Moore. An analysis of alpha-beta pruning. Artificial intelligence, 6(4):293–326, 1975

  36. [36]

    Optimization by simulated annealing

    Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983

  37. [37]

    Genetic algorithms in machine learning

    Jonathan Shapiro. Genetic algorithms in machine learning. In Advanced course on artificial intelligence, pages 146–168. Springer, 1999

  38. [38]

    API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023

  39. [39]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  40. [40]

    Saad Tahmid and Sourav Sarker. Qwen2. 5-32b: Leveraging self-consistent tool-integrated reasoning for bengali mathematical olympiad problem solving. arXiv preprint arXiv:2411.05934, 2024

  41. [41]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  42. [42]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  43. [43]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  44. [44]

    Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging, 35(5):1299–1312, 2016

    Nima Tajbakhsh, Jae Y Shin, Suryakanth R Gurudu, R Todd Hurst, Christopher B Kendall, Michael B Gotway, and Jianming Liang. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging, 35(5):1299–1312, 2016

  45. [45]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020. 12 A Graph Construction Original API Original ParameterParameter...

  46. [46]

    analysis

    Gastroenteritis Inquiry ACTION: Direct Response CONTEXT: { "analysis": "1. Common sense judgment: Gastroenteritis is a medical condition within general knowledge. 2. No API required for definition explanation. 3. Parameter verification not applicable", "action": "direct_answer", "intent": "1. Provide medical definition of gastroenteritis 2. User’s explici...

  47. [47]

    analysis

    Health Data Logging ACTION: Tool Retrieval CONTEXT: { "analysis": "1. Health data recording requires specialized API 2. No matching API in candidate list 3. Critical parameters provided: user_id=12345, blood_pressure=130/80, heart_rate=90, timestamp=2025-03-05", "action": "retrieve_api", "intent": "1. Store user health metrics 2. Provided parameters: bloo...

  48. [48]

    , "action

    Successful data storage confirmation", "action": "direct_answer", "answer": "Your health data has been successfully recorded..." } Last Round: Response Synthesis ACTION: Direct Response CONTEXT: 1.Inflammation of gastrointestinal tract Common causes: Viral/bacterial infections Key symptoms: Diarrhea, vomiting, abdominal cramps Management: Hydration and rest

  49. [49]

    No data integrity issues detected

    Health Data Record Confirmation: All parameters validated successfully through api-RecordHealthData. No data integrity issues detected. 18 C.2 Case 2 Query • Is Beijing in China? What is the current weather forecast for the location at latitude 40.7128 and longitude -74.0060 in metric units? First Round: Intent Analysis

  50. [50]

    The specific problem addressed by this action is to determine if Beijing is located within China

    Common Sense ACTION: Direct Response CONTEXT: {’dependency_rank’: 1, ’intent’: ’Is Beijing in China?’, ’analysis’: ’1. The specific problem addressed by this action is to determine if Beijing is located within China. 2. The user input directly asks about the geographical location of Beijing. 3. This is a common sense question and does not require an API c...

  51. [51]

    action":

    Get Weather ACTION: Tool Retrieval CONTEXT: {’dependency_rank’: 2, ’intent’: ’What is the current weather forecast for the location at latitude 40.7128 and longitude -74.0060 in metric units?’, ’analysis’: ’1. The specific problem addressed by this action is to retrieve the current weather forecast for a specific location using latitude and longitude coor...

  52. [52]

    The target API ’api-current_weather_data_of_a_location_for_weather’ has already been successfully called and its result is available. 4. The required parameters (lat and lon) are complete and the API call has been confirmed successful.", ’action’: ’direct_answer’, ’answer’: ’The current weather in New York is clear sky with a temperature of 11.0°C, feels ...

  53. [53]

    analysis

    Intent Clarification 20 ACTION: Intent Clarification CONTEXT: { "analysis": "1. Weather data needs to be queried in real time -> not common sense\n2. Required parameter (location) is missing", "action": "clarify_intent", "recall_description": "", "answer": "Which city do you want to query tomorrow’s weather?" }

  54. [54]

    Second Round: Intent Analysis

    User Answer User Answer: I’m in BeiJing. Second Round: Intent Analysis

  55. [55]

    action":

    Get Weather ACTION: Tool Retrieval CONTEXT: {’dependency_rank’: 1, ’intent’: ’1. Query the weather forecast for tomorrow in Beijing 2. Extract location: Beijing and time: tomorrow from user input’, ’analysis’: ’1. The specific problem addressed by this action is to retrieve the weather forecast for tomorrow in Beijing. 2. The user input directly provides ...

  56. [56]

    Core Requirements: - Generate a natural-language question where: • Must explicitly contain initial parameters for leaf-node APIs • Implicitly requires chained API calls from leaf to root node • Root node API’s output directly resolves the user’s problem

  57. [57]

    • All input values must originate from either: Explicitly stated in the question context Generated by previous API outputs (no synthetic values)

    Dependency Chain Rules: - Build parameter-passing paths where: • Parent API outputs must exactly match child API inputs (same parameter names & data types) • Root node API must be called last in the chain • The output of every leaf-node API must be utilized in downstream APIs or final results. • All input values must originate from either: Explicitly stat...

  58. [58]

    Parameter Constraints: - Enforce strict value inheritance: • Path/query parameters must use verbatim values from: - User’s question text - Preceding API response.data fields • Prohibit value transformation/format conversion - Root API output must contain realistic values matching its schema

  59. [59]

    Validation Requirements: - Reject generation if: • Missing parameter dependency between APIs • Input sources can’t be traced to question/prior responses • Output fields don’t fulfill next API’s input requirements

  60. [60]

    query":

    Response Structure: { "query": "<Real-world scenario requiring sequential API calls>", 22 "answer": "<Solution derived from root API output>", "call_chains": [ { "api_name": "<Leaf-node API>", "input": { "<param>": "<value explicitly stated in user query or previous API output>" }, "output": { "status": "success", "data": {"<field>": "<output used by next...

  61. [61]

    **Intent Analysis** - Decompose compound requests into independent ordered sub-intents • Sequential dependencies first, Must execute in declared order • Parallelizable sub-intents last • Dependency_rank numbering for ordered execution - Validate parallel execution eligibility: • No overlapping data requirements • No sequential dependencies • Distinct para...

  62. [62]

    **Atomic Action Formation** • For each validated sub-intent: - Create self-contained decision unit, action must implement full Decision Logic Flow - Maintain state separation between parallel processes - Focus analysis scope per sub-intent - Each action’s analysis focuses only on its own intent - Each action analysis only solves one intent - Must execute ...

  63. [63]

    **Common Sense Judgment Phase** - Input question -> Knowledge base matching Belongs to common sense -> action=direct_answer Requires external data -> Proceed to Phase 2

  64. [64]

    **API Matching Phase**

  65. [65]

    If candidate_apis is empty -> action=retrieve_api

  66. [66]

    Match intent with API list: API prioritization: - Complete parameters from user input - Minimal missing parameters - Shortest dependency chain API matching success: - Validate Observation in user input to confirm target API success: -> If successful -> action=direct_answer -> No explicit success indication: a) Complete parameters -> action=call_api (execu...

  67. [67]

    dependency_rank

    **Parameter Completion Phase** - Check required parameter set: All parameters ready -> action=call_api The target API does not require parameters -> action=call_api Missing parameters exist: a) Can be completed via dependent APIs -> Execute Rule 3.1 b) Use Retrieval APIs resolve parameter deficiencies in API dependencies -> action=retrieve_api c) Requires...

  68. [68]

    , "analysis

    <extract data segments directly related to the subtask from user input>", "analysis": "<Four-level reasoning: 1.Explicitly state the specific decision-making sub-intent addressed by this action 2.Common sense judgment basis 3.API matching logic (if applicable) 4.Parameter completeness verification>", "action": "call_api|direct_answer|retrieve_api|clarify_...

  69. [69]

    Parameter names must strictly match API documentation

  70. [70]

    The ’answer’ field for clarify_intent must contain question words

  71. [71]

    Prioritize calling parent node APIs

  72. [72]

    When action in [retrieve_api]: - The recall_description field serves exclusively as an API retrieval identifier from predefined repositories. - parameter descriptions must distinguish between input and output parameters, retaining only essential parameters - Each recall_description can only recall one api,multiple APIs require 25 multiple actions

  73. [73]

    APIs absent from Candidate APIs MUST NOT be invented

  74. [74]

    When action=call_api is permitted only when candidate APIs exist and the target_api is present in the candidate APIs

  75. [75]

    The "action" field must be strictly limited to one of the following four predefined operation types: call_api, direct_answer,retrieve_api or clarify_intent

  76. [76]

    Use retrieve_api only when: - Required parameters unavailable in call_api action

  77. [77]

    User input:{user_input}\nPlease generate the final response based on the following data: {observation} : Requirements:

    Use call_api only when: - The target_api is not in the list of successfully executed APIs --------- # Candidate API Information: E.3.2 Input Generation Prompts Input generation prompts: Integrate current queries with observational data to formulate the final input, ensuring informational completeness. User input:{user_input}\nPlease generate the final res...

  78. [78]

    Integrate all available data

  79. [79]

    Indicate data limitations (if any failed APIs exist)

  80. [80]

    They achieve automated emulation of API chains through standardized JSON responses

    Use natural and fluent English E.3.3 API Simulator Prompts API simulator prompts are based on historical data reuse (Case1) and intelligent simulation gen- eration (Case2/3). They achieve automated emulation of API chains through standardized JSON responses. The priority strategy is as follows: historical matching > structural cloning > contextual simulat...

Showing first 80 references.