pith. machine review for the scientific record. sign in

arxiv: 2512.17052 · v4 · submitted 2025-12-18 · 💻 cs.LG

Recognition: no theorem link

Dynamic Tool Dependency Retrieval for Lightweight Function Calling

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:11 UTC · model grok-4.3

classification 💻 cs.LG
keywords function callingtool retrievalLLM agentsdynamic retrievaltool dependenciesadaptive selectionon-device agentslightweight retrieval
0
0 comments X

The pith

Dynamic Tool Dependency Retrieval improves function calling success rates by 23 to 104 percent over static retrievers by adapting to the evolving plan.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Dynamic Tool Dependency Retrieval to help LLM agents select external tools for complex tasks without bloating context. Existing static methods rely only on the initial query and miss how one tool call affects which tools become relevant next. DTDR instead learns dependencies from demonstration data and conditions retrieval on both the query and the current plan as it unfolds. This adaptive approach reduces irrelevant tools that mislead the agent. Benchmarks across datasets and model backbones show clear gains in retrieval precision, task success, and efficiency.

Core claim

DTDR is a lightweight retrieval method that models tool dependencies from function calling demonstrations. It conditions retrieval on the initial query together with the evolving tool calling plan, enabling the system to fetch relevant tools adaptively rather than from a fixed set. The approach improves both retrieval precision and downstream function calling accuracy while remaining computationally light.

What carries the argument

Dynamic Tool Dependency Retrieval (DTDR), a module that learns conditional tool dependencies from demonstration sequences and uses them to adapt retrieval as the plan changes.

If this is right

  • Higher success rates on function calling tasks that require sequencing multiple tools.
  • Shorter effective context lengths for on-device agents because fewer irrelevant tools enter the prompt.
  • Improved robustness when the same task is executed with different LLM backbones.
  • Practical strategies for deciding how retrieved tools are inserted into the agent's prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dependency-modeling idea could be applied to other sequential agent behaviors such as planning or multi-turn dialogue.
  • Static tool lists may create systematic errors precisely on the longer-horizon tasks where agents are most useful.
  • One could measure whether the learned dependency graph transfers to entirely new tool libraries without additional demonstrations.

Load-bearing premise

Tool dependencies extracted from demonstration data will generalize reliably to new queries and support accurate adaptive retrieval as plans evolve.

What would settle it

A controlled test on a held-out set of multi-step queries where DTDR's dynamic selections produce lower end-to-end success rates than a strong static top-k retriever.

Figures

Figures reproduced from arXiv: 2512.17052 by Aleksandr Ermolov, Amir Jalalirad, Bence Major, Bhrij Patel, Davide Belli, Maximilian Arnold.

Figure 1
Figure 1. Figure 1: Dynamic Tool Dependency Retrieval. Given demonstration data for a set of tools, previous work retrieve tools based on either a) the natural language query (highlighted in orange) or b) the latest executed tool call in the plan (the red triangle, highlighted in teal). We instead propose a retrieval method which is dynamically conditioned on both the query and the current history of tool calls (blue square, … view at source ↗
Figure 2
Figure 2. Figure 2: System diagram for DTDR. On the left, the user query and tool history are input to DTDR to retrieve the most likely next tools. The LLM π selects the next tool among this set. On the right, we show the two alternative instantiations for the retriever: a) DTDR-C, based on a clustering step to retrieve an explicit graph of tool dependencies; and b) DTDR-L, based on a learned linear classifier implicitly mode… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of efficient ICL methods against ICL with Raw Demonstrations and the baseline without ICL. All results [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt length across different methods and datasets. Our method reduces the prompt length by: 1) efficiently encoding [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablations on: a) history length, b) # of k-means clusters, and c) # of demonstrations. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Function calling agents powered by Large Language Models (LLMs) select external tools to automate complex tasks. On-device agents typically use a retrieval module to select relevant tools, improving performance and reducing context length. However, existing retrieval methods rely on static and limited inputs, failing to capture multi-step tool dependencies and evolving task context. This limitation often introduces irrelevant tools that mislead the agent, degrading efficiency and accuracy. We propose Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method that conditions on both the initial query and the evolving tool calling plan. DTDR models tool dependencies from function calling demonstrations, enabling adaptive retrieval as plans unfold. We benchmark DTDR against state-of-the-art retrieval methods across multiple datasets and LLM backbones, evaluating retrieval precision, downstream task accuracy, and computational efficiency. Additionally, we explore strategies to integrate retrieved tools into prompts. Our results show that DTDR improves function calling success rates between $23\%$ and $104\%$ compared to state-of-the-art static retrievers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method for LLM-powered function calling agents. DTDR conditions tool retrieval on both the initial user query and the evolving multi-step tool-calling plan, modeling dependencies directly from function-calling demonstrations to enable adaptive retrieval. The authors evaluate DTDR against state-of-the-art static retrievers across multiple datasets and LLM backbones, reporting gains in function-calling success rates of 23% to 104%, together with measurements of retrieval precision, downstream task accuracy, computational efficiency, and prompt-integration strategies.

Significance. If the reported gains hold under the described experimental conditions, the work is significant for on-device agent design: it directly targets the failure mode of static retrievers that ignore plan evolution and tool dependencies, while remaining lightweight. The breadth of the evaluation (multiple datasets, backbones, and integration strategies) supplies concrete empirical support for the central claim and offers practical guidance on prompt construction.

minor comments (3)
  1. [§4] §4 (Experimental Setup): the description of how tool-dependency graphs are extracted from demonstrations should include the exact prompting template and any filtering rules applied to the demonstration traces, as these choices directly affect reproducibility of the 23–104% range.
  2. [Table 2, Figure 3] Table 2 and Figure 3: error bars or standard deviations across the N runs are not reported for the success-rate columns; adding them would strengthen the claim that DTDR consistently outperforms the static baselines.
  3. [§5.2] §5.2 (Integration Strategies): the ablation that isolates the contribution of dynamic conditioning versus static top-k retrieval should be presented with the same metric suite used in the main results to allow direct comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work on Dynamic Tool Dependency Retrieval (DTDR) and for recognizing its significance for on-device agent design. We are grateful for the recommendation of minor revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces DTDR as an empirical retrieval method that models tool dependencies from function-calling demonstrations and evaluates it via direct benchmarking on retrieval precision, task accuracy, and efficiency across datasets and LLM backbones. No equations, derivations, or parameter-fitting steps are present that could reduce any claimed prediction to its inputs by construction. Performance gains (23-104%) are reported as measured outcomes rather than self-referential outputs, and no load-bearing self-citations or uniqueness theorems appear in the abstract or described method. The central claims rest on external empirical comparisons, making the work self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The method implicitly assumes that demonstration data contains learnable tool dependencies that transfer to new tasks.

pith-pipeline@v0.9.0 · 5482 in / 1161 out tokens · 42834 ms · 2026-05-16T21:11:35.055420+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    cs.SE 2026-04 accept novelty 5.0

    LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Anthropic. 2025. Claude 4 System Card: Opus and Sonnet Models. https://www.anthropic.com/research. Comprehensive technical and safety report for Claude 4 models

  4. [4]

    Barres, V.; Dong, H.; Ray, S.; Si, X.; and Narasimhan, K. 2025. ^ 2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv preprint arXiv:2506.07982

  5. [5]

    Braunschweiler, N.; Doddipatla, R.; and Zorila, T.-C. 2025. ToolReAGt: Tool Retrieval for LLM-based Complex Task Solution via Retrieval Augmented Generation. In Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM), 75--83

  6. [6]

    Chen, W.; Li, W.; Yao, D.; Meng, X.; Gong, C.; and Bi, J. 2025. GTool: Graph Enhanced Tool Planning with Large Language Model. arXiv preprint arXiv:2508.12725

  7. [7]

    P.; Tianjun, Z.; Ion, S.; and Joseph, E

    Cheng-Jie Ji, C.; Huanzhi, M.; Fanjia, Y.; Shishir, G. P.; Tianjun, Z.; Ion, S.; and Joseph, E. G. 2024. Gorilla OpenFunctions v2

  8. [8]

    Dang, H.; Liu, T.; Wu, Z.; Yang, J.; Jiang, H.; Yang, T.; Chen, P.; Wang, Z.; Wang, H.; Li, H.; et al. 2025. Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates. arXiv preprint arXiv:2509.18076

  9. [9]

    Ding, K.; Yu, J.; Huang, J.; Yang, Y.; Zhang, Q.; and Chen, H. 2025. SciToolAgent: a knowledge-graph-driven scientific agent for multitool integration. Nature Computational Science, 1--11

  10. [10]

    E.; Lee, N.; Jha, S.; Kim, S.; Tabrizi, R.; Moon, S.; Hooper, C.; Anumanchipalli, G.; Keutzer, K.; and Gholami, A

    Erdogan, L. E.; Lee, N.; Jha, S.; Kim, S.; Tabrizi, R.; Moon, S.; Hooper, C.; Anumanchipalli, G.; Keutzer, K.; and Gholami, A. 2024. Tinyagent: Function calling at the edge. arXiv preprint arXiv:2409.00608

  11. [11]

    Federici, M.; Belli, D.; Van Baalen, M.; Jalalirad, A.; Skliar, A.; Major, B.; Nagel, M.; and Whatmough, P. 2025. Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking. In Eighth Conference on Machine Learning and Systems

  12. [12]

    Gao, L.; Wang, Y.; Peng, M.; Tang, J.; Shang, Y.; Sun, M.; and Su, J. 2025. Tool Graph Retriever: Exploring Dependency Graph-based Tool Retrieval for Large Language Models. arXiv preprint arXiv:2508.05152

  13. [13]

    GPT-4o System Card

    Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  14. [14]

    Lin, Q.; Wen, M.; Peng, Q.; Nie, G.; Liao, J.; Wang, J.; Mo, X.; Zhou, J.; Cheng, C.; Zhao, Y.; et al. 2024. Hammer: Robust function-calling for on-device language models via function masking. arXiv preprint arXiv:2410.04587

  15. [15]

    Liu, X.; Peng, Z.; Yi, X.; Xie, X.; Xiang, L.; Liu, Y.; and Xu, D. 2024 a . Toolnet: Connecting large language models with massive tools via tool graph. arXiv preprint arXiv:2403.00839

  16. [16]

    Liu, Z.; Lai, Z.; Gao, Z.; Cui, E.; Li, Z.; Zhu, X.; Lu, L.; Chen, Q.; Qiao, Y.; Dai, J.; et al. 2024 b . Controlllm: Augment language models with tools by searching on graphs. In European Conference on Computer Vision, 89--105. Springer

  17. [17]

    OpenAI. 2025. GPT-5 System Card. https://openai.com/research/index/publication/. Describes the architecture, safety measures, and capabilities of GPT-5

  18. [18]

    Paramanayakam, V.; Karatzas, A.; Anagnostopoulos, I.; and Stamoulis, D. 2025. Less is more: Optimizing function calling for llm execution on edge devices. In 2025 Design, Automation & Test in Europe Conference (DATE), 1--7. IEEE

  19. [19]

    Paranjape, B.; Lundberg, S.; Singh, S.; Hajishirzi, H.; Zettlemoyer, L.; and Ribeiro, M. T. 2023. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014

  20. [20]

    Patel, B.; Jagmohan, A.; and Vempaty, A. 2025. Learning API Functionality from In-Context Demonstrations for Tool-based Agents. Empirical Methods of Natural Language Processing

  21. [21]

    G.; Mao, H.; Yan, F.; Ji, C

    Patil, S. G.; Mao, H.; Yan, F.; Ji, C. C.-J.; Suresh, V.; Stoica, I.; and Gonzalez, J. E. 2025. The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models. In Forty-second International Conference on Machine Learning

  22. [22]

    G.; Zhang, T.; Wang, X.; and Gonzalez, J

    Patil, S. G.; Zhang, T.; Wang, X.; and Gonzalez, J. E. 2024. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37: 126544--126565

  23. [23]

    Qiao, S.; Gui, H.; Lv, C.; Jia, Q.; Chen, H.; and Zhang, N. 2023. Making language models better tool learners with execution feedback. arXiv preprint arXiv:2305.13068

  24. [24]

    Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Cong, X.; Tang, X.; Qian, B.; et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789

  25. [25]

    Rabinovich, E.; and Anaby-Tavor, A. 2025. On the robustness of agentic function calling. arXiv preprint arXiv:2504.00914

  26. [26]

    Reimers, N.; and Gurevych, I. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

  27. [27]

    Robertson, S.; Zaragoza, H.; et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval , 3(4): 333--389

  28. [28]

    Sarukkai, V.; Xie, Z.; and Fatahalian, K. 2025. Self-generated in-context examples improve llm agents for sequential decision-making tasks. arXiv preprint arXiv:2505.00234

  29. [29]

    Schick, T.; Dwivedi-Yu, J.; Dess \` , R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; and Scialom, T. 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36: 68539--68551

  30. [30]

    Shen, Y.; Song, K.; Tan, X.; Li, D.; Lu, W.; and Zhuang, Y. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36: 38154--38180

  31. [31]

    Shen, Y.; Song, K.; Tan, X.; Zhang, W.; Ren, K.; Yuan, S.; Lu, W.; Li, D.; and Zhuang, Y. 2024. Taskbench: Benchmarking large language models for task automation. Advances in Neural Information Processing Systems, 37: 4540--4574

  32. [32]

    Song, Q.; Liao, P.; Zhao, W.; Wang, Y.; Hu, S.; Zhen, H.-L.; Jiang, N.; and Yuan, M. 2025. Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PC. arXiv preprint arXiv:2505.15030

  33. [33]

    Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

  34. [34]

    Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; and Cao, Y. 2023. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR)

  35. [35]

    Zeng, A.; Lv, X.; Zheng, Q.; Hou, Z.; Chen, B.; Xie, C.; Wang, C.; Yin, D.; Zeng, H.; Zhang, J.; et al. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471