Recognition: 3 theorem links
· Lean TheoremFitText: Evolving Agent Tool Ecologies via Memetic Retrieval
Pith reviewed 2026-05-08 18:59 UTC · model grok-4.3
The pith
Embedding evolutionary retrieval inside an agent's reasoning loop allows dynamic adaptation to evolving task needs in large tool ecosystems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FitText is a training-free framework that integrates dynamic retrieval into the agent's reasoning loop by generating pseudo-tool descriptions, iteratively refining them using retrieval feedback, and applying evolutionary selection through Memetic Retrieval guided by a tool memory, achieving an average retrieval rank of 2.78 on ToolRet with 43k tools and a 0.73 pass rate on StableToolBench with 16k APIs, representing a 24-point gain over static retrieval.
What carries the argument
Memetic Retrieval: evolutionary selection over candidate pseudo-tool descriptions guided by tool memory to apply selection pressure without redundant search.
Load-bearing premise
The base model must be capable of acting as a competent semantic operator for generating and refining pseudo-tool descriptions.
What would settle it
Running the method with a weaker base model on the same ToolRet and StableToolBench datasets and observing if retrieval ranks worsen and task pass rates fall below the static baseline.
Figures
read the original abstract
A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText generates natural-language pseudo-tool descriptions as retrieval probes, refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (43k tools, 4 domains), FitText improves average retrieval rank from 8.81 to 2.78; on StableToolBench (16,464 APIs), it achieves a 0.73 average pass rate--a 24-point absolute gain over static query retrieval. The gains transfer across base models capable of acting as competent semantic operators; under weaker base models, Memetic's evolutionary search inverts--amplifying noise rather than refining signal--surfacing model capacity as a prerequisite for evolutionary tool exploration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FitText, a training-free framework that embeds dynamic retrieval into an agent's reasoning loop. It generates natural-language pseudo-tool descriptions as retrieval probes, refines them iteratively via retrieval feedback and stochastic generation, and applies memetic retrieval with evolutionary selection pressure over candidates, guided by a tool memory to avoid redundancy. On ToolRet (43k tools across 4 domains) it reports improving average retrieval rank from 8.81 to 2.78; on StableToolBench (16,464 APIs) it reports a 0.73 average pass rate, a 24-point absolute gain over static query retrieval. Gains are stated to transfer across sufficiently capable base models but invert under weaker models that amplify noise.
Significance. If the empirical claims are reproducible, FitText would provide a practical, training-free route to adaptive tool use in large-scale API ecosystems, directly addressing the semantic gap between evolving agent intent and static tool documentation. The explicit conditioning on base-model semantic competence is a strength that clarifies scope.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experimental Evaluation): the headline quantitative results (rank drop from 8.81 to 2.78 on ToolRet; 24-point pass-rate gain on StableToolBench) are presented without reported variance, number of runs, statistical significance tests, or precise re-implementations of the static-retrieval baselines. This absence prevents assessment of whether the observed deltas are robust or sensitive to implementation details.
- [§3.2] §3.2 (Memetic Retrieval): the evolutionary selection mechanism is described at a high level, but no explicit fitness function, population size, or termination criterion is given; without these it is difficult to determine whether the reported gains arise from the claimed memetic dynamics or from other uncontrolled factors in the generation loop.
- [§5] §5 (Discussion of model capacity): the claim that weaker models cause the loop to amplify noise is acknowledged, yet no quantitative threshold, ablation across model sizes, or diagnostic metric for “competent semantic operator” is supplied. This leaves the central prerequisite untested and the scope of the method unclear.
minor comments (2)
- [Figure 1 and §3.1] Figure 1 and §3.1: the diagram of the iterative refinement loop would benefit from explicit annotation of which steps are stochastic versus deterministic and how the tool memory is updated.
- [Table 1] Table 1: column headers for the two benchmarks should include the exact number of tools/APIs and domains to allow immediate comparison with the abstract numbers.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity, reproducibility, and empirical rigor.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Evaluation): the headline quantitative results (rank drop from 8.81 to 2.78 on ToolRet; 24-point pass-rate gain on StableToolBench) are presented without reported variance, number of runs, statistical significance tests, or precise re-implementations of the static-retrieval baselines. This absence prevents assessment of whether the observed deltas are robust or sensitive to implementation details.
Authors: We agree that additional statistical details are necessary for robust evaluation. In the revised manuscript we will report averages and standard deviations over 5 independent runs, include paired t-test p-values against the static baselines, and provide exact implementation details (including prompt templates and retrieval parameters) for the static query retrieval baselines. revision: yes
-
Referee: [§3.2] §3.2 (Memetic Retrieval): the evolutionary selection mechanism is described at a high level, but no explicit fitness function, population size, or termination criterion is given; without these it is difficult to determine whether the reported gains arise from the claimed memetic dynamics or from other uncontrolled factors in the generation loop.
Authors: We acknowledge the description in §3.2 is conceptual. The revised section will explicitly state the fitness function (retrieval rank improvement combined with downstream task success), population size (8 candidates), and termination criterion (maximum 5 iterations or no rank improvement for 2 consecutive steps), accompanied by pseudocode. revision: yes
-
Referee: [§5] §5 (Discussion of model capacity): the claim that weaker models cause the loop to amplify noise is acknowledged, yet no quantitative threshold, ablation across model sizes, or diagnostic metric for “competent semantic operator” is supplied. This leaves the central prerequisite untested and the scope of the method unclear.
Authors: We agree that the scope requires clearer empirical grounding. The revision will add an ablation across model sizes (7B, 13B, 70B) in §5, report a diagnostic metric based on semantic coherence of generated pseudo-descriptions, and identify a quantitative threshold (e.g., coherence score > 0.75) below which the method is not recommended. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical, training-free framework (FitText with Memetic Retrieval) whose central claims consist of measured performance gains on external public benchmarks (ToolRet rank improvement from 8.81 to 2.78; StableToolBench pass-rate gain of 24 points). No equations, first-principles derivations, or parameter-fitting steps are present that would reduce these results to quantities defined by the method itself. The evolutionary search and pseudo-description generation are presented as procedural mechanisms whose effectiveness is conditioned on base-model competence and validated externally; no self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided description.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can generate natural-language pseudo-tool descriptions that serve as effective retrieval probes when refined iteratively.
Lean theorems connected to this paper
-
Cost.FunctionalEquation (J(x)=½(x+x⁻¹)−1)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Fitness function f(d_i) = s_ret(d_i,R_i) - 0.5·max Jaccard penalty; s_ret = 0.7σ_1 + 0.3σ̄_3 combines top-1 and mean top-3 cosine similarities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning. arXiv preprint arXi...
work page internal anchor Pith review arXiv 2025
-
[2]
arXiv preprint arXiv:2211.09260 , year=
Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260, 2023
-
[3]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W
Sijia Chen, Yibo Wang, Yi-Feng Wu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun Zhang. Advancing tool-augmented large language models: Integrating insights from errors in inference trees. arXiv preprint arXiv:2406.07115, 2025
-
[4]
Re-invoke: Tool invocation rewriting for zero-shot tool retrieval
Yanfei Chen, Jinsung Yoon, Devendra Sachan, Qingze Wang, Vincent Cohen-Addad, Mohammadhossein Bateni, Chen-Yu Lee, and Tomas Pfister. Re-invoke: Tool invocation rewriting for zero-shot tool retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 4705--4726, 2024
2024
-
[5]
Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and Joseph E. Gonzalez. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks. arXiv preprint arXiv:2502.08235, 2025
-
[6]
See: Strategic exploration and exploitation for cohesive in-context prompt optimization
Wendi Cui, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley Malin, Sricharan Kumar, and Jiaxin Zhang. See: Strategic exploration and exploitation for cohesive in-context prompt optimization. arXiv preprint arXiv:2402.11347, 2025
-
[7]
AnyTool: Self-reflective, hierarchical agents for large-scale API calls,
Yu Du, Fangyun Wei, and Hongyang Zhang. Anytool: Self-reflective, hierarchical agents for large-scale api calls. arXiv preprint arXiv:2402.04253, 2024
-
[8]
Mcp-zero: Active tool discovery for autonomous llm agents,
Xiang Fei, Xiawu Zheng, and Hao Feng. Mcp-zero: Proactive toolchain construction for llm agents from scratch. arXiv preprint arXiv:2506.01056, 2025
-
[9]
Precise zero-shot dense retrieval without relevance labels,
Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496, 2022
-
[10]
Simcse: Simple contrastive learning of sentence embeddings
Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pp.\ 6894--6910. Association for Computational Linguistics (ACL), 2021
2021
-
[11]
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Evoprompt: Connecting llms with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2025
-
[12]
Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models
Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. In ACL (Findings), 2024
2024
-
[13]
Understanding the planning of LLM agents: A survey
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716, 2024
work page internal anchor Pith review arXiv 2024
-
[14]
MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024
Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128, 2023
-
[15]
Bruce Croft
Victor Lavrenko and W. Bruce Croft. Relevance based language models. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 120--127, 2001
2001
-
[16]
Corpus-steered query expansion with large language models
Yibin Lei, Yu Cao, Tianyi Zhou, Tao Shen, and Andrew Yates. Corpus-steered query expansion with large language models. arXiv preprint arXiv:2402.18031, 2024
-
[17]
u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33: 0 9459--9474, 2020
2020
-
[18]
Exploring solution divergence and its effect on large language model problem solving
Hang Li, Kaiqi Yang, Yucheng Chu, Hui Liu, and Jiliang Tang. Exploring solution divergence and its effect on large language model problem solving. arXiv preprint arXiv:2509.22480, 2025 a
-
[19]
Wenwu Li, Xiangfeng Wang, Wenhao Li, and Bo Jin. A survey of automatic prompt engineering: An optimization perspective. arXiv preprint arXiv:2502.11560, 2025 b
-
[20]
arXiv preprint arXiv:2510.21618 , year=
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, and Zhicheng Dou. Deepagent: A general reasoning agent with scalable toolsets. arXiv preprint arXiv:2510.21618, 2025 c
-
[21]
Dmqr-rag: Diverse multi-query rewriting for rag
Zhicong Li, Jiahao Wang, Zhishu Jiang, Hangyu Mao, Zhongxia Chen, Jiazhen Du, Yuanxing Zhang, Fuzheng Zhang, Di Zhang, and Yong Liu. Dmqr-rag: Diverse multi-query rewriting for rag. arXiv preprint arXiv:2411.13154, 2024
- [22]
-
[23]
Tool-planner: Task planning with clusters across multiple tools,
Yanming Liu, Xinyue Peng, Jiannan Cao, Shi Bo, Yuwei Zhang, Xuhong Zhang, Sheng Cheng, Xun Wang, Jianwei Yin, and Tianyu Du. Tool-planner: Task planning with clusters across multiple tools. arXiv preprint arXiv:2406.03807, 2025
-
[24]
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023
work page internal anchor Pith review arXiv 2023
-
[25]
Large Language Models: A Survey
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024
work page internal anchor Pith review arXiv 2024
-
[26]
Efficient and scalable estimation of tool representations in vector space
Suhong Moon, Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Woosang Lim, Kurt Keutzer, and Amir Gholami. Efficient and scalable estimation of tool representations in vector space. arXiv preprint arXiv:2409.02141, 2024
-
[27]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review arXiv 2021
-
[28]
Alexander Novikov, Ng \^a n V \ u , Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific an...
work page internal anchor Pith review arXiv 2025
-
[29]
Art: Automatic multi-step reasoning and tool-use for large language models
Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023
-
[30]
Gorilla: Large language model connected with massive apis
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37: 0 126544--126565, 2024
2024
-
[31]
Tool learning with foundation models
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Xuanhe Zhou, Yufei Huang, Chaojun Xiao, et al. Tool learning with foundation models. ACM Computing Surveys, 57 0 (4): 0 1--40, 2024 a
2024
-
[32]
Toolllm: Facilitating large language models to master 16000+ real-world apis
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In The Twelfth International Conference on Learning Representations, 2024 b
2024
-
[33]
Colt: Towards completeness-oriented tool retrieval for large language models
Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Colt: Towards completeness-oriented tool retrieval for large language models. arXiv e-prints, pp.\ arXiv--2405, 2024 a
2024
-
[34]
Towards completeness-oriented tool retrieval for large language models
Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Towards completeness-oriented tool retrieval for large language models. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pp.\ 1930--1940, 2024 b
1930
-
[35]
Tool learning with large language models: A survey
Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. Frontiers of Computer Science, 19 0 (8): 0 198343, 2025
2025
-
[36]
The probabilistic relevance framework: Bm25 and beyond
Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval , 3 0 (4): 0 333--389, 2009
2009
-
[37]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36: 0 68539--68551, 2023
2023
-
[38]
Tooldreamer: Instilling llm reasoning into tool retrievers
Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, and Zhe Feng. Tooldreamer: Instilling llm reasoning into tool retrievers. arXiv preprint arXiv:2510.19791, 2025
-
[39]
Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36: 0 38154--38180, 2023
2023
-
[40]
Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, and Zhaochun Ren. Retrieval models aren't tool-savvy: Benchmarking tool retrieval for large language models. arXiv preprint arXiv:2503.01763, 2025
-
[41]
arXiv preprint arXiv:2509.09677 , year=
Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in llms. arXiv preprint arXiv:2509.09677, 2026
-
[42]
Large language model reasoning failures
Peiyang Song, Pengrui Han, and Noah Goodman. Large language model reasoning failures. arXiv preprint, 2026
2026
-
[43]
Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, et al. Restgpt: Connecting large language models with real-world restful apis. arXiv preprint arXiv:2306.06624, 2023
-
[44]
A statistical interpretation of term specificity and its application in retrieval
Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28 0 (1): 0 11--21, 1972
1972
-
[45]
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory , journal =
Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. arXiv preprint arXiv:2504.07952, 2025
-
[46]
Query2doc: Query expansion with large language models
Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678, 2023
-
[47]
Tool- Gen: Unified tool retrieval and calling via generation,
Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. Toolgen: Unified tool retrieval and calling via generation. arXiv preprint arXiv:2410.03439, 2025
-
[48]
Executable code actions elicit better llm agents
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning, 2024
2024
-
[49]
Enhancing tool retrieval with iterative feedback from large language models
Qiancheng Xu, Yongqi Li, Heming Xia, and Wenjie Li. Enhancing tool retrieval with iterative feedback from large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 9609--9619, 2024
2024
-
[50]
Large Language Models as Optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2024
work page internal anchor Pith review arXiv 2024
-
[51]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022
2022
-
[52]
Toolrerank: Adaptive and hierarchy-aware reranking for tool retrieval
Yuanhang Zheng, Peng Li, Wei Liu, Yang Liu, Jian Luan, and Bin Wang. Toolrerank: Adaptive and hierarchy-aware reranking for tool retrieval. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp.\ 16263--16273, 2024
2024
-
[53]
arXiv preprint arXiv:2310.13227 , year=
Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, and Chao Zhang. Toolchain*: Efficient action space navigation in large language models with a* search. arXiv preprint arXiv:2310.13227, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.