arxiv: 2605.02411 · v1 · submitted 2026-05-04 · 💻 cs.AI · cs.IR· cs.LG· cs.MA

Recognition: 3 theorem links

· Lean Theorem

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

Kyle Zheng , Han Zhang , Renliang Sun , Chenchen Ye , Wei Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:59 UTC · model grok-4.3

classification 💻 cs.AI cs.IRcs.LGcs.MA

keywords agent tool usedynamic retrievalmemetic retrievalpseudo-tool descriptionsevolutionary selectiontraining-freesemantic gapAPI ecosystems

0 comments

The pith

Embedding evolutionary retrieval inside an agent's reasoning loop allows dynamic adaptation to evolving task needs in large tool ecosystems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that a semantic gap exists between user task descriptions and tool documentation, which static retrieval cannot bridge as agent understanding evolves. FitText solves this by embedding retrieval in the reasoning loop: it generates natural-language pseudo-tool descriptions as probes, refines them iteratively with feedback, and uses memetic evolutionary selection to explore alternatives while avoiding redundancy via tool memory. This training-free method yields large improvements on benchmarks with tens of thousands of tools, raising average retrieval rank from 8.81 to 2.78 and task pass rate by 24 points. A reader would care because it allows agents to scale effectively with expanding API ecosystems without retraining models.

Core claim

FitText is a training-free framework that integrates dynamic retrieval into the agent's reasoning loop by generating pseudo-tool descriptions, iteratively refining them using retrieval feedback, and applying evolutionary selection through Memetic Retrieval guided by a tool memory, achieving an average retrieval rank of 2.78 on ToolRet with 43k tools and a 0.73 pass rate on StableToolBench with 16k APIs, representing a 24-point gain over static retrieval.

What carries the argument

Memetic Retrieval: evolutionary selection over candidate pseudo-tool descriptions guided by tool memory to apply selection pressure without redundant search.

Load-bearing premise

The base model must be capable of acting as a competent semantic operator for generating and refining pseudo-tool descriptions.

What would settle it

Running the method with a weaker base model on the same ToolRet and StableToolBench datasets and observing if retrieval ranks worsen and task pass rates fall below the static baseline.

Figures

Figures reproduced from arXiv: 2605.02411 by Chenchen Ye, Han Zhang, Kyle Zheng, Renliang Sun, Wei Wang.

**Figure 1.** Figure 1: Comparison between Previous Query-based Retrieval and Dynamic Retrieval. The view at source ↗

**Figure 2.** Figure 2: Illustration of our proposed dynamic tool retrieval framework. The figure com view at source ↗

**Figure 3.** Figure 3: Overview of Memetic Retrieval. The framework progresses through five phases: view at source ↗

**Figure 4.** Figure 4: Effect of query analysis model on retrieval performance. Using GPT-4.1 as the view at source ↗

**Figure 5.** Figure 5: Effect of refinement turns on retrieval performance. The largest improvement view at source ↗

read the original abstract

A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText generates natural-language pseudo-tool descriptions as retrieval probes, refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (43k tools, 4 domains), FitText improves average retrieval rank from 8.81 to 2.78; on StableToolBench (16,464 APIs), it achieves a 0.73 average pass rate--a 24-point absolute gain over static query retrieval. The gains transfer across base models capable of acting as competent semantic operators; under weaker base models, Memetic's evolutionary search inverts--amplifying noise rather than refining signal--surfacing model capacity as a prerequisite for evolutionary tool exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FitText improves tool retrieval by evolving pseudo-descriptions inside the agent's loop and posts clear gains on large benchmarks, but only when the base model can handle semantic refinement.

read the letter

The main thing to know is that FitText treats retrieval as an evolving process rather than a single lookup. It generates natural-language probes for tools, refines them with feedback from what it retrieves, throws in some stochastic variants, and applies memetic selection to keep the useful ones while a tool memory cuts down on repeats. That combination sits inside the agent's reasoning and produces measurable lifts without any training step.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FitText, a training-free framework that embeds dynamic retrieval into an agent's reasoning loop. It generates natural-language pseudo-tool descriptions as retrieval probes, refines them iteratively via retrieval feedback and stochastic generation, and applies memetic retrieval with evolutionary selection pressure over candidates, guided by a tool memory to avoid redundancy. On ToolRet (43k tools across 4 domains) it reports improving average retrieval rank from 8.81 to 2.78; on StableToolBench (16,464 APIs) it reports a 0.73 average pass rate, a 24-point absolute gain over static query retrieval. Gains are stated to transfer across sufficiently capable base models but invert under weaker models that amplify noise.

Significance. If the empirical claims are reproducible, FitText would provide a practical, training-free route to adaptive tool use in large-scale API ecosystems, directly addressing the semantic gap between evolving agent intent and static tool documentation. The explicit conditioning on base-model semantic competence is a strength that clarifies scope.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental Evaluation): the headline quantitative results (rank drop from 8.81 to 2.78 on ToolRet; 24-point pass-rate gain on StableToolBench) are presented without reported variance, number of runs, statistical significance tests, or precise re-implementations of the static-retrieval baselines. This absence prevents assessment of whether the observed deltas are robust or sensitive to implementation details.
[§3.2] §3.2 (Memetic Retrieval): the evolutionary selection mechanism is described at a high level, but no explicit fitness function, population size, or termination criterion is given; without these it is difficult to determine whether the reported gains arise from the claimed memetic dynamics or from other uncontrolled factors in the generation loop.
[§5] §5 (Discussion of model capacity): the claim that weaker models cause the loop to amplify noise is acknowledged, yet no quantitative threshold, ablation across model sizes, or diagnostic metric for “competent semantic operator” is supplied. This leaves the central prerequisite untested and the scope of the method unclear.

minor comments (2)

[Figure 1 and §3.1] Figure 1 and §3.1: the diagram of the iterative refinement loop would benefit from explicit annotation of which steps are stochastic versus deterministic and how the tool memory is updated.
[Table 1] Table 1: column headers for the two benchmarks should include the exact number of tools/APIs and domains to allow immediate comparison with the abstract numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity, reproducibility, and empirical rigor.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Evaluation): the headline quantitative results (rank drop from 8.81 to 2.78 on ToolRet; 24-point pass-rate gain on StableToolBench) are presented without reported variance, number of runs, statistical significance tests, or precise re-implementations of the static-retrieval baselines. This absence prevents assessment of whether the observed deltas are robust or sensitive to implementation details.

Authors: We agree that additional statistical details are necessary for robust evaluation. In the revised manuscript we will report averages and standard deviations over 5 independent runs, include paired t-test p-values against the static baselines, and provide exact implementation details (including prompt templates and retrieval parameters) for the static query retrieval baselines. revision: yes
Referee: [§3.2] §3.2 (Memetic Retrieval): the evolutionary selection mechanism is described at a high level, but no explicit fitness function, population size, or termination criterion is given; without these it is difficult to determine whether the reported gains arise from the claimed memetic dynamics or from other uncontrolled factors in the generation loop.

Authors: We acknowledge the description in §3.2 is conceptual. The revised section will explicitly state the fitness function (retrieval rank improvement combined with downstream task success), population size (8 candidates), and termination criterion (maximum 5 iterations or no rank improvement for 2 consecutive steps), accompanied by pseudocode. revision: yes
Referee: [§5] §5 (Discussion of model capacity): the claim that weaker models cause the loop to amplify noise is acknowledged, yet no quantitative threshold, ablation across model sizes, or diagnostic metric for “competent semantic operator” is supplied. This leaves the central prerequisite untested and the scope of the method unclear.

Authors: We agree that the scope requires clearer empirical grounding. The revision will add an ablation across model sizes (7B, 13B, 70B) in §5, report a diagnostic metric based on semantic coherence of generated pseudo-descriptions, and identify a quantitative threshold (e.g., coherence score > 0.75) below which the method is not recommended. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical, training-free framework (FitText with Memetic Retrieval) whose central claims consist of measured performance gains on external public benchmarks (ToolRet rank improvement from 8.81 to 2.78; StableToolBench pass-rate gain of 24 points). No equations, first-principles derivations, or parameter-fitting steps are present that would reduce these results to quantities defined by the method itself. The evolutionary search and pseudo-description generation are presented as procedural mechanisms whose effectiveness is conditioned on base-model competence and validated externally; no self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the premise that the base language model can reliably generate and evaluate natural-language tool probes; this is treated as a domain assumption rather than derived.

axioms (1)

domain assumption Large language models can generate natural-language pseudo-tool descriptions that serve as effective retrieval probes when refined iteratively.
This capability is required for the core loop of probe generation, feedback, and evolutionary selection to improve retrieval.

pith-pipeline@v0.9.0 · 5520 in / 1290 out tokens · 74059 ms · 2026-05-08T18:59:45.883170+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (J(x)=½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fitness function f(d_i) = s_ret(d_i,R_i) - 0.5·max Jaccard penalty; s_ret = 0.7σ_1 + 0.3σ̄_3 combines top-1 and mean top-3 cosine similarities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 33 canonical work pages · 7 internal anchors

[1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning. arXiv preprint arXi...

work page internal anchor Pith review arXiv 2025
[2]

arXiv preprint arXiv:2211.09260 , year=

Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260, 2023

work page arXiv 2023
[3]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W

Sijia Chen, Yibo Wang, Yi-Feng Wu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun Zhang. Advancing tool-augmented large language models: Integrating insights from errors in inference trees. arXiv preprint arXiv:2406.07115, 2025

work page arXiv 2025
[4]

Re-invoke: Tool invocation rewriting for zero-shot tool retrieval

Yanfei Chen, Jinsung Yoon, Devendra Sachan, Qingze Wang, Vincent Cohen-Addad, Mohammadhossein Bateni, Chen-Yu Lee, and Tomas Pfister. Re-invoke: Tool invocation rewriting for zero-shot tool retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 4705--4726, 2024

2024
[5]

Gonzalez , title =

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and Joseph E. Gonzalez. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks. arXiv preprint arXiv:2502.08235, 2025

work page arXiv 2025
[6]

See: Strategic exploration and exploitation for cohesive in-context prompt optimization

Wendi Cui, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley Malin, Sricharan Kumar, and Jiaxin Zhang. See: Strategic exploration and exploitation for cohesive in-context prompt optimization. arXiv preprint arXiv:2402.11347, 2025

work page arXiv 2025
[7]

AnyTool: Self-reflective, hierarchical agents for large-scale API calls,

Yu Du, Fangyun Wei, and Hongyang Zhang. Anytool: Self-reflective, hierarchical agents for large-scale api calls. arXiv preprint arXiv:2402.04253, 2024

work page arXiv 2024
[8]

Mcp-zero: Active tool discovery for autonomous llm agents,

Xiang Fei, Xiawu Zheng, and Hao Feng. Mcp-zero: Proactive toolchain construction for llm agents from scratch. arXiv preprint arXiv:2506.01056, 2025

work page arXiv 2025
[9]

Precise zero-shot dense retrieval without relevance labels,

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496, 2022

work page arXiv 2022
[10]

Simcse: Simple contrastive learning of sentence embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pp.\ 6894--6910. Association for Computational Linguistics (ACL), 2021

2021
[11]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2025

work page arXiv 2025
[12]

Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. In ACL (Findings), 2024

2024
[13]

Understanding the planning of LLM agents: A survey

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716, 2024

work page internal anchor Pith review arXiv 2024
[14]

MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024

Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128, 2023

work page arXiv 2023
[15]

Bruce Croft

Victor Lavrenko and W. Bruce Croft. Relevance based language models. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 120--127, 2001

2001
[16]

Corpus-steered query expansion with large language models

Yibin Lei, Yu Cao, Tianyi Zhou, Tao Shen, and Andrew Yates. Corpus-steered query expansion with large language models. arXiv preprint arXiv:2402.18031, 2024

work page arXiv 2024
[17]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33: 0 9459--9474, 2020

2020
[18]

Exploring solution divergence and its effect on large language model problem solving

Hang Li, Kaiqi Yang, Yucheng Chu, Hui Liu, and Jiliang Tang. Exploring solution divergence and its effect on large language model problem solving. arXiv preprint arXiv:2509.22480, 2025 a

work page arXiv 2025
[19]

A survey of automatic prompt engineering: An optimization perspective.arXiv preprint arXiv:2502.11560,

Wenwu Li, Xiangfeng Wang, Wenhao Li, and Bo Jin. A survey of automatic prompt engineering: An optimization perspective. arXiv preprint arXiv:2502.11560, 2025 b

work page arXiv 2025
[20]

arXiv preprint arXiv:2510.21618 , year=

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, and Zhicheng Dou. Deepagent: A general reasoning agent with scalable toolsets. arXiv preprint arXiv:2510.21618, 2025 c

work page arXiv 2025
[21]

Dmqr-rag: Diverse multi-query rewriting for rag

Zhicong Li, Jiahao Wang, Zhishu Jiang, Hangyu Mao, Zhongxia Chen, Jiazhen Du, Yuanxing Zhang, Fuzheng Zhang, Di Zhang, and Yong Liu. Dmqr-rag: Diverse multi-query rewriting for rag. arXiv preprint arXiv:2411.13154, 2024

work page arXiv 2024
[22]

Liu and B

Jie Liu and Barzan Mozafari. Query rewriting via large language models. arXiv preprint arXiv:2403.09060, 2024

work page arXiv 2024
[23]

Tool-planner: Task planning with clusters across multiple tools,

Yanming Liu, Xinyue Peng, Jiannan Cao, Shi Bo, Yuwei Zhang, Xuhong Zhang, Sheng Cheng, Xun Wang, Jianwei Yin, and Tianyu Du. Tool-planner: Task planning with clusters across multiple tools. arXiv preprint arXiv:2406.03807, 2025

work page arXiv 2025
[24]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023

work page internal anchor Pith review arXiv 2023
[25]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review arXiv 2024
[26]

Efficient and scalable estimation of tool representations in vector space

Suhong Moon, Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Woosang Lim, Kurt Keutzer, and Amir Gholami. Efficient and scalable estimation of tool representations in vector space. arXiv preprint arXiv:2409.02141, 2024

work page arXiv 2024
[27]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review arXiv 2021
[28]

Alexander Novikov, Ng \^a n V \ u , Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific an...

work page internal anchor Pith review arXiv 2025
[29]

Art: Automatic multi-step reasoning and tool-use for large language models

Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023

work page arXiv 2023
[30]

Gorilla: Large language model connected with massive apis

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37: 0 126544--126565, 2024

2024
[31]

Tool learning with foundation models

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Xuanhe Zhou, Yufei Huang, Chaojun Xiao, et al. Tool learning with foundation models. ACM Computing Surveys, 57 0 (4): 0 1--40, 2024 a

2024
[32]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In The Twelfth International Conference on Learning Representations, 2024 b

2024
[33]

Colt: Towards completeness-oriented tool retrieval for large language models

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Colt: Towards completeness-oriented tool retrieval for large language models. arXiv e-prints, pp.\ arXiv--2405, 2024 a

2024
[34]

Towards completeness-oriented tool retrieval for large language models

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Towards completeness-oriented tool retrieval for large language models. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pp.\ 1930--1940, 2024 b

1930
[35]

Tool learning with large language models: A survey

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. Frontiers of Computer Science, 19 0 (8): 0 198343, 2025

2025
[36]

The probabilistic relevance framework: Bm25 and beyond

Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval , 3 0 (4): 0 333--389, 2009

2009
[37]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36: 0 68539--68551, 2023

2023
[38]

Tooldreamer: Instilling llm reasoning into tool retrievers

Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, and Zhe Feng. Tooldreamer: Instilling llm reasoning into tool retrievers. arXiv preprint arXiv:2510.19791, 2025

work page arXiv 2025
[39]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36: 0 38154--38180, 2023

2023
[40]

Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models.arXiv preprint arXiv:2503.01763, 2025

Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, and Zhaochun Ren. Retrieval models aren't tool-savvy: Benchmarking tool retrieval for large language models. arXiv preprint arXiv:2503.01763, 2025

work page arXiv 2025
[41]

arXiv preprint arXiv:2509.09677 , year=

Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in llms. arXiv preprint arXiv:2509.09677, 2026

work page arXiv 2026
[42]

Large language model reasoning failures

Peiyang Song, Pengrui Han, and Noah Goodman. Large language model reasoning failures. arXiv preprint, 2026

2026
[43]

Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023

Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, et al. Restgpt: Connecting large language models with real-world restful apis. arXiv preprint arXiv:2306.06624, 2023

work page arXiv 2023
[44]

A statistical interpretation of term specificity and its application in retrieval

Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28 0 (1): 0 11--21, 1972

1972
[45]

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory , journal =

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. arXiv preprint arXiv:2504.07952, 2025

work page arXiv 2025
[46]

Query2doc: Query expansion with large language models

Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678, 2023

work page arXiv 2023
[47]

Tool- Gen: Unified tool retrieval and calling via generation,

Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. Toolgen: Unified tool retrieval and calling via generation. arXiv preprint arXiv:2410.03439, 2025

work page arXiv 2025
[48]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024

2024
[49]

Enhancing tool retrieval with iterative feedback from large language models

Qiancheng Xu, Yongqi Li, Heming Xia, and Wenjie Li. Enhancing tool retrieval with iterative feedback from large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 9609--9619, 2024

2024
[50]

Large Language Models as Optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2024

work page internal anchor Pith review arXiv 2024
[51]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

2022
[52]

Toolrerank: Adaptive and hierarchy-aware reranking for tool retrieval

Yuanhang Zheng, Peng Li, Wei Liu, Yang Liu, Jian Luan, and Bin Wang. Toolrerank: Adaptive and hierarchy-aware reranking for tool retrieval. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp.\ 16263--16273, 2024

2024
[53]

arXiv preprint arXiv:2310.13227 , year=

Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, and Chao Zhang. Toolchain*: Efficient action space navigation in large language models with a* search. arXiv preprint arXiv:2310.13227, 2023

work page arXiv 2023