arxiv: 2512.17052 · v4 · submitted 2025-12-18 · 💻 cs.LG

Recognition: no theorem link

Dynamic Tool Dependency Retrieval for Lightweight Function Calling

Bhrij Patel , Davide Belli , Amir Jalalirad , Maximilian Arnold , Aleksandr Ermolov , Bence Major

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords function callingtool retrievalLLM agentsdynamic retrievaltool dependenciesadaptive selectionon-device agentslightweight retrieval

0 comments

The pith

Dynamic Tool Dependency Retrieval improves function calling success rates by 23 to 104 percent over static retrievers by adapting to the evolving plan.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Dynamic Tool Dependency Retrieval to help LLM agents select external tools for complex tasks without bloating context. Existing static methods rely only on the initial query and miss how one tool call affects which tools become relevant next. DTDR instead learns dependencies from demonstration data and conditions retrieval on both the query and the current plan as it unfolds. This adaptive approach reduces irrelevant tools that mislead the agent. Benchmarks across datasets and model backbones show clear gains in retrieval precision, task success, and efficiency.

Core claim

DTDR is a lightweight retrieval method that models tool dependencies from function calling demonstrations. It conditions retrieval on the initial query together with the evolving tool calling plan, enabling the system to fetch relevant tools adaptively rather than from a fixed set. The approach improves both retrieval precision and downstream function calling accuracy while remaining computationally light.

What carries the argument

Dynamic Tool Dependency Retrieval (DTDR), a module that learns conditional tool dependencies from demonstration sequences and uses them to adapt retrieval as the plan changes.

If this is right

Higher success rates on function calling tasks that require sequencing multiple tools.
Shorter effective context lengths for on-device agents because fewer irrelevant tools enter the prompt.
Improved robustness when the same task is executed with different LLM backbones.
Practical strategies for deciding how retrieved tools are inserted into the agent's prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dependency-modeling idea could be applied to other sequential agent behaviors such as planning or multi-turn dialogue.
Static tool lists may create systematic errors precisely on the longer-horizon tasks where agents are most useful.
One could measure whether the learned dependency graph transfers to entirely new tool libraries without additional demonstrations.

Load-bearing premise

Tool dependencies extracted from demonstration data will generalize reliably to new queries and support accurate adaptive retrieval as plans evolve.

What would settle it

A controlled test on a held-out set of multi-step queries where DTDR's dynamic selections produce lower end-to-end success rates than a strong static top-k retriever.

Figures

Figures reproduced from arXiv: 2512.17052 by Aleksandr Ermolov, Amir Jalalirad, Bence Major, Bhrij Patel, Davide Belli, Maximilian Arnold.

**Figure 1.** Figure 1: Dynamic Tool Dependency Retrieval. Given demonstration data for a set of tools, previous work retrieve tools based on either a) the natural language query (highlighted in orange) or b) the latest executed tool call in the plan (the red triangle, highlighted in teal). We instead propose a retrieval method which is dynamically conditioned on both the query and the current history of tool calls (blue square, … view at source ↗

**Figure 2.** Figure 2: System diagram for DTDR. On the left, the user query and tool history are input to DTDR to retrieve the most likely next tools. The LLM π selects the next tool among this set. On the right, we show the two alternative instantiations for the retriever: a) DTDR-C, based on a clustering step to retrieve an explicit graph of tool dependencies; and b) DTDR-L, based on a learned linear classifier implicitly mode… view at source ↗

**Figure 3.** Figure 3: Comparison of efficient ICL methods against ICL with Raw Demonstrations and the baseline without ICL. All results [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt length across different methods and datasets. Our method reduces the prompt length by: 1) efficiently encoding [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablations on: a) history length, b) # of k-means clusters, and c) # of demonstrations. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Function calling agents powered by Large Language Models (LLMs) select external tools to automate complex tasks. On-device agents typically use a retrieval module to select relevant tools, improving performance and reducing context length. However, existing retrieval methods rely on static and limited inputs, failing to capture multi-step tool dependencies and evolving task context. This limitation often introduces irrelevant tools that mislead the agent, degrading efficiency and accuracy. We propose Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method that conditions on both the initial query and the evolving tool calling plan. DTDR models tool dependencies from function calling demonstrations, enabling adaptive retrieval as plans unfold. We benchmark DTDR against state-of-the-art retrieval methods across multiple datasets and LLM backbones, evaluating retrieval precision, downstream task accuracy, and computational efficiency. Additionally, we explore strategies to integrate retrieved tools into prompts. Our results show that DTDR improves function calling success rates between $23\%$ and $104\%$ compared to state-of-the-art static retrievers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DTDR adapts retrieval to evolving plans from demonstrations and delivers clear gains over static baselines in the reported benchmarks.

read the letter

The core advance is conditioning retrieval on the initial query plus the tool-calling plan as it develops, rather than treating each step as an isolated lookup. This directly targets the multi-step dependency problem that static retrievers miss in on-device agents. The authors model those dependencies from function-calling demonstrations and then retrieve adaptively, which is a straightforward but useful shift from prior work. They back it with comparisons across several datasets, multiple LLM backbones, and different prompt-integration strategies, tracking retrieval precision, downstream accuracy, and efficiency. The 23-104% success-rate improvements over state-of-the-art static methods are the main empirical result, and the stress-test confirms the benchmarking setup supplies direct support without internal contradictions or obvious fitting artifacts. No equations or derivations appear, so circularity is not an issue. The softer spot is how well the learned dependencies generalize when test queries diverge from the demonstration distribution; the paper shows gains in the tested regimes, but the upper-end improvements may reflect easier cases. Plan representation details and any error bars or significance tests would strengthen the claims further, though nothing load-bearing seems missing. This work is aimed at people building lightweight agents who need better tool selection without cloud calls. A reader focused on practical retrieval for function calling will find usable comparisons and a clear adaptive method. It deserves a serious referee because the idea is grounded, the experiments address the stated problem, and the results are falsifiable.

Referee Report

0 major / 3 minor

Summary. The paper proposes Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method for LLM-powered function calling agents. DTDR conditions tool retrieval on both the initial user query and the evolving multi-step tool-calling plan, modeling dependencies directly from function-calling demonstrations to enable adaptive retrieval. The authors evaluate DTDR against state-of-the-art static retrievers across multiple datasets and LLM backbones, reporting gains in function-calling success rates of 23% to 104%, together with measurements of retrieval precision, downstream task accuracy, computational efficiency, and prompt-integration strategies.

Significance. If the reported gains hold under the described experimental conditions, the work is significant for on-device agent design: it directly targets the failure mode of static retrievers that ignore plan evolution and tool dependencies, while remaining lightweight. The breadth of the evaluation (multiple datasets, backbones, and integration strategies) supplies concrete empirical support for the central claim and offers practical guidance on prompt construction.

minor comments (3)

[§4] §4 (Experimental Setup): the description of how tool-dependency graphs are extracted from demonstrations should include the exact prompting template and any filtering rules applied to the demonstration traces, as these choices directly affect reproducibility of the 23–104% range.
[Table 2, Figure 3] Table 2 and Figure 3: error bars or standard deviations across the N runs are not reported for the success-rate columns; adding them would strengthen the claim that DTDR consistently outperforms the static baselines.
[§5.2] §5.2 (Integration Strategies): the ablation that isolates the contribution of dynamic conditioning versus static top-k retrieval should be presented with the same metric suite used in the main results to allow direct comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work on Dynamic Tool Dependency Retrieval (DTDR) and for recognizing its significance for on-device agent design. We are grateful for the recommendation of minor revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces DTDR as an empirical retrieval method that models tool dependencies from function-calling demonstrations and evaluates it via direct benchmarking on retrieval precision, task accuracy, and efficiency across datasets and LLM backbones. No equations, derivations, or parameter-fitting steps are present that could reduce any claimed prediction to its inputs by construction. Performance gains (23-104%) are reported as measured outcomes rather than self-referential outputs, and no load-bearing self-citations or uniqueness theorems appear in the abstract or described method. The central claims rest on external empirical comparisons, making the work self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The method implicitly assumes that demonstration data contains learnable tool dependencies that transfer to new tasks.

pith-pipeline@v0.9.0 · 5482 in / 1161 out tokens · 42834 ms · 2026-05-16T21:11:35.055420+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Anthropic. 2025. Claude 4 System Card: Opus and Sonnet Models. https://www.anthropic.com/research. Comprehensive technical and safety report for Claude 4 models

work page 2025
[4]

Barres, V.; Dong, H.; Ray, S.; Si, X.; and Narasimhan, K. 2025. ^ 2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv preprint arXiv:2506.07982

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Braunschweiler, N.; Doddipatla, R.; and Zorila, T.-C. 2025. ToolReAGt: Tool Retrieval for LLM-based Complex Task Solution via Retrieval Augmented Generation. In Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM), 75--83

work page 2025
[6]

Chen, W.; Li, W.; Yao, D.; Meng, X.; Gong, C.; and Bi, J. 2025. GTool: Graph Enhanced Tool Planning with Large Language Model. arXiv preprint arXiv:2508.12725

work page arXiv 2025
[7]

P.; Tianjun, Z.; Ion, S.; and Joseph, E

Cheng-Jie Ji, C.; Huanzhi, M.; Fanjia, Y.; Shishir, G. P.; Tianjun, Z.; Ion, S.; and Joseph, E. G. 2024. Gorilla OpenFunctions v2

work page 2024
[8]

Dang, H.; Liu, T.; Wu, Z.; Yang, J.; Jiang, H.; Yang, T.; Chen, P.; Wang, Z.; Wang, H.; Li, H.; et al. 2025. Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates. arXiv preprint arXiv:2509.18076

work page arXiv 2025
[9]

Ding, K.; Yu, J.; Huang, J.; Yang, Y.; Zhang, Q.; and Chen, H. 2025. SciToolAgent: a knowledge-graph-driven scientific agent for multitool integration. Nature Computational Science, 1--11

work page 2025
[10]

E.; Lee, N.; Jha, S.; Kim, S.; Tabrizi, R.; Moon, S.; Hooper, C.; Anumanchipalli, G.; Keutzer, K.; and Gholami, A

Erdogan, L. E.; Lee, N.; Jha, S.; Kim, S.; Tabrizi, R.; Moon, S.; Hooper, C.; Anumanchipalli, G.; Keutzer, K.; and Gholami, A. 2024. Tinyagent: Function calling at the edge. arXiv preprint arXiv:2409.00608

work page arXiv 2024
[11]

Federici, M.; Belli, D.; Van Baalen, M.; Jalalirad, A.; Skliar, A.; Major, B.; Nagel, M.; and Whatmough, P. 2025. Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking. In Eighth Conference on Machine Learning and Systems

work page 2025
[12]

Gao, L.; Wang, Y.; Peng, M.; Tang, J.; Shang, Y.; Sun, M.; and Su, J. 2025. Tool Graph Retriever: Exploring Dependency Graph-based Tool Retrieval for Large Language Models. arXiv preprint arXiv:2508.05152

work page arXiv 2025
[13]

GPT-4o System Card

Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Lin, Q.; Wen, M.; Peng, Q.; Nie, G.; Liao, J.; Wang, J.; Mo, X.; Zhou, J.; Cheng, C.; Zhao, Y.; et al. 2024. Hammer: Robust function-calling for on-device language models via function masking. arXiv preprint arXiv:2410.04587

work page arXiv 2024
[15]

Liu, X.; Peng, Z.; Yi, X.; Xie, X.; Xiang, L.; Liu, Y.; and Xu, D. 2024 a . Toolnet: Connecting large language models with massive tools via tool graph. arXiv preprint arXiv:2403.00839

work page arXiv 2024
[16]

Liu, Z.; Lai, Z.; Gao, Z.; Cui, E.; Li, Z.; Zhu, X.; Lu, L.; Chen, Q.; Qiao, Y.; Dai, J.; et al. 2024 b . Controlllm: Augment language models with tools by searching on graphs. In European Conference on Computer Vision, 89--105. Springer

work page 2024
[17]

OpenAI. 2025. GPT-5 System Card. https://openai.com/research/index/publication/. Describes the architecture, safety measures, and capabilities of GPT-5

work page 2025
[18]

Paramanayakam, V.; Karatzas, A.; Anagnostopoulos, I.; and Stamoulis, D. 2025. Less is more: Optimizing function calling for llm execution on edge devices. In 2025 Design, Automation & Test in Europe Conference (DATE), 1--7. IEEE

work page 2025
[19]

Paranjape, B.; Lundberg, S.; Singh, S.; Hajishirzi, H.; Zettlemoyer, L.; and Ribeiro, M. T. 2023. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Patel, B.; Jagmohan, A.; and Vempaty, A. 2025. Learning API Functionality from In-Context Demonstrations for Tool-based Agents. Empirical Methods of Natural Language Processing

work page 2025
[21]

G.; Mao, H.; Yan, F.; Ji, C

Patil, S. G.; Mao, H.; Yan, F.; Ji, C. C.-J.; Suresh, V.; Stoica, I.; and Gonzalez, J. E. 2025. The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models. In Forty-second International Conference on Machine Learning

work page 2025
[22]

G.; Zhang, T.; Wang, X.; and Gonzalez, J

Patil, S. G.; Zhang, T.; Wang, X.; and Gonzalez, J. E. 2024. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37: 126544--126565

work page 2024
[23]

Qiao, S.; Gui, H.; Lv, C.; Jia, Q.; Chen, H.; and Zhang, N. 2023. Making language models better tool learners with execution feedback. arXiv preprint arXiv:2305.13068

work page arXiv 2023
[24]

Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Cong, X.; Tang, X.; Qian, B.; et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Rabinovich, E.; and Anaby-Tavor, A. 2025. On the robustness of agentic function calling. arXiv preprint arXiv:2504.00914

work page arXiv 2025
[26]

Reimers, N.; and Gurevych, I. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

work page 2019
[27]

Robertson, S.; Zaragoza, H.; et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval , 3(4): 333--389

work page 2009
[28]

Sarukkai, V.; Xie, Z.; and Fatahalian, K. 2025. Self-generated in-context examples improve llm agents for sequential decision-making tasks. arXiv preprint arXiv:2505.00234

work page arXiv 2025
[29]

Schick, T.; Dwivedi-Yu, J.; Dess \` , R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; and Scialom, T. 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36: 68539--68551

work page 2023
[30]

Shen, Y.; Song, K.; Tan, X.; Li, D.; Lu, W.; and Zhuang, Y. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36: 38154--38180

work page 2023
[31]

Shen, Y.; Song, K.; Tan, X.; Zhang, W.; Ren, K.; Yuan, S.; Lu, W.; Li, D.; and Zhuang, Y. 2024. Taskbench: Benchmarking large language models for task automation. Advances in Neural Information Processing Systems, 37: 4540--4574

work page 2024
[32]

Song, Q.; Liao, P.; Zhao, W.; Wang, Y.; Hu, S.; Zhen, H.-L.; Jiang, N.; and Yuan, M. 2025. Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PC. arXiv preprint arXiv:2505.15030

work page arXiv 2025
[33]

Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; and Cao, Y. 2023. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR)

work page 2023
[35]

Zeng, A.; Lv, X.; Zheng, Q.; Hou, Z.; Chen, B.; Xie, C.; Wang, C.; Yin, D.; Zeng, H.; Zhang, J.; et al. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471

work page internal anchor Pith review Pith/arXiv arXiv 2025