How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

Ameya Gawde; Harshvardhan Singh; Joey Blackwell II; Vyzantinos Repantis

arxiv: 2605.24660 · v2 · pith:T7ED5ZEXnew · submitted 2026-05-23 · 💻 cs.IR · cs.AI· cs.LG

How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

Vyzantinos Repantis , Ameya Gawde , Harshvardhan Singh , Joey Blackwell II This is my paper

Pith reviewed 2026-06-30 12:19 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG

keywords tool selectionLLM agentsshortlist evaluationchance-corrected metricreinforcement learningretrieval systemsadaptive depth

0 comments

The pith

A chance-corrected metric shows adaptive tool shortlists outperform fixed sizes for LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Bits-over-Random (BoR), a metric that evaluates tool shortlist sizes by measuring whether success at a given depth exceeds what random selection would achieve at that same depth. It argues that fixed shortlists are suboptimal because they either omit the correct tool on hard queries or overwhelm the model with too many options on easy ones. Applying BoR as a reinforcement learning reward produces per-query adaptive depths that nearly match the coverage of large fixed lists while using far fewer tools on average. Downstream tests with an LLM confirm higher rates of correct tool selection, with the largest gains on medium-difficulty queries where the correct tool is present but not top-ranked. A reader would care because tool retrieval is the first step for any LLM agent that must invoke external functions.

Core claim

The paper claims that Bits-over-Random (BoR) supplies a standard metric for deciding tool shortlist depth and that an RL policy trained on BoR rewards yields adaptive depths whose coverage nearly matches showing 50 tools while averaging only 7, that the approach locates tools on hard queries where fixed lists of 5 fail, and that these shorter adaptive lists raise downstream LLM correct selection from 87.1% to 93.1% overall and from 60.9% to 76.8% on medium queries.

What carries the argument

The Bits-over-Random (BoR) metric, which quantifies success at a given shortlist depth relative to random chance at that depth and serves as both an evaluation standard and an RL reward.

If this is right

On BFCL with 370 tools, the adaptive policy reaches 90.3% coverage with an average of 7 tools versus 90.8% for a fixed list of 50.
On ToolBench with 3,251 tools, the adaptive policy finds tools on hard queries where a fixed list of 5 finds none.
Downstream LLM selection accuracy improves with adaptive lists, with the gap widening when the correct tool is not ranked first.
The reward decreases automatically as lists grow, removing the need for an engineered depth penalty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

BoR could be tested as an evaluation standard for shortlist sizes in other retrieval settings such as document or API suggestion systems.
The adaptive depths learned on one benchmark might transfer to new registries if the correlation between BoR and downstream accuracy holds.
Larger-scale experiments could check whether optimal average depths scale predictably with registry size.

Load-bearing premise

The deliberately simple RL agent serves as a valid unbiased probe of the BoR metric without its policy learning process introducing confounding biases.

What would settle it

A head-to-head experiment in which fixed shortlist sizes achieve equal or higher downstream LLM tool selection accuracy than the BoR-adaptive policy on the same benchmarks would falsify the advantage of adaptive lists.

Figures

Figures reproduced from arXiv: 2605.24660 by Ameya Gawde, Harshvardhan Singh, Joey Blackwell II, Vyzantinos Repantis.

**Figure 1.** Figure 1: ToolBench difficulty buckets at N=50. Left: chosen K per difficulty bucket for BoR, the F1 baseline, and FK=5. Right: found rate per bucket. The BoR agent adapts K from 2.5 (easy) to 6.9 (very hard), while the F1 baseline stays near 1.5 and FK=5 stays fixed at 5 [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Scorer ablation on MetaTool. Three scorers (BM25, MiniLM, BGE) on the same data [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose. Show too few and the correct tool may not appear. Most systems apply a fixed shortlist size to every query, but no standard metric exists to evaluate whether that size was appropriate. We treat the number of tools shown to an LLM agent as the object of evaluation and we apply Bits-over-Random (BoR), a chance-corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth. We evaluate BoR across three tool-selection benchmarks, multiple scorers, and registries ranging from 20 to 3,251 tools. We then turn the same principle into a reinforcement learning (RL) reward for choosing tool shortlist depth per query. The RL agent is deliberately simple, serving as a probe of the metric rather than a proposed system. As the shortlist grows, random chance of including the correct tool rises, so the reward naturally decreases, reducing the need for an engineered depth penalty. On BFCL (370 tools), the learned policy nearly matches the coverage of showing 50 tools ($90.3\%$ vs $90.8\%$) while presenting only 7 on average. On ToolBench (3,251 tools), a fixed shortlist of 5 tools achieves higher aggregate coverage ($64.7\%$ vs $61.9\%$) but finds nothing on hard queries (correct tool ranked 6th-20th). The BoR agent finds $16.7\%$ on those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 indicates that shorter adaptive lists also improve the LLM's ability to select the right tool: $93.1\%$ versus $87.1\%$ when always shown 5 tools, widening to $76.8\%$ vs $60.9\%$ on medium-difficulty queries where the correct tool is present but not ranked first.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BoR gives a clean chance-corrected way to score tool shortlist depth and the adaptive RL probe delivers measurable gains in coverage and LLM selection accuracy, though the validation loop needs checking for dependence on the metric itself.

read the letter

BoR is the main new piece: a chance-corrected metric that asks how much better than random a shortlist is at including the right tool at that exact depth. They turn it into an RL reward so the policy naturally avoids extra depth as random inclusion probability rises. On BFCL the learned depths hit 90.3% coverage with average length 7 versus fixed-50 at 90.8%. On ToolBench the adaptive policy pulls in 16.7% of the hard queries that fixed-5 misses entirely. The Claude Sonnet downstream test is the clearest result, lifting tool selection from 87.1% to 93.1% overall and from 60.9% to 76.8% on medium queries.

The experiments cover multiple registries and scorers, which is useful. The framing of the RL agent as a simple probe rather than a system is honest and keeps the focus on the metric.

The soft spot is the one the stress-test note flags. Because depth is chosen by directly maximizing BoR on the same benchmarks, the reported accuracy lift could partly be an artifact of that optimization rather than independent confirmation that BoR is the right signal. The probe language helps, but it still leaves the validation tied to the metric under test. The abstract also omits the exact BoR formula and any error bars, so the derivation and variability would need to be checked in the full text.

This is aimed at people working on tool retrieval for agents. The central idea is coherent and the empirical pattern is worth referee time, so it should go to review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes Bits-over-Random (BoR), a chance-corrected metric for evaluating whether a given tool shortlist depth is appropriate for an LLM agent. It evaluates BoR across three benchmarks with registries of 20–3,251 tools, then uses BoR-derived rewards in a deliberately simple RL agent (positioned as a probe, not a proposed system) to select adaptive per-query depths. The adaptive policy nearly matches fixed-50 coverage (90.3% vs 90.8%) at average depth 7 on BFCL, retrieves 16.7% on hard ToolBench queries where fixed-5 fails, and yields downstream gains with Claude Sonnet 4.6 (93.1% vs 87.1% tool-selection accuracy overall; 76.8% vs 60.9% on medium-difficulty queries).

Significance. If the results hold, the work supplies a principled, chance-corrected alternative to fixed shortlist sizes in tool retrieval for agents. The explicit probe framing of the RL agent and the inclusion of downstream LLM validation experiments are strengths. The approach could encourage adaptive retrieval strategies and improve efficiency-accuracy trade-offs in agent systems.

major comments (3)

Abstract and methods: the abstract presents benchmark results and downstream validation but provides no exact definition or derivation of BoR, no error bars, and no full experimental protocol; without these the central claims cannot be verified or reproduced.
RL probe section: the claim that the RL agent serves as an independent probe of BoR is load-bearing for the validation, yet the reward is defined directly from BoR and the policy is learned to maximize it on the same benchmarks; the reported accuracy gains (93.1% vs 87.1%; 76.8% vs 60.9%) may therefore be partly artifacts of the optimization rather than independent evidence for the metric.
Results on BFCL/ToolBench: coverage numbers (e.g., 90.3% vs 90.8% on BFCL; 16.7% on hard queries) and accuracy figures lack error bars, confidence intervals, or statistical tests, weakening the reliability of the comparisons that support the adaptive-depth claim.

minor comments (2)

The BoR formula should be stated explicitly as an equation in the main text to allow readers to confirm the chance-correction property.
Figure and table captions should report the number of runs, random seeds, and any hyper-parameter settings used for the RL experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: Abstract and methods: the abstract presents benchmark results and downstream validation but provides no exact definition or derivation of BoR, no error bars, and no full experimental protocol; without these the central claims cannot be verified or reproduced.

Authors: The provided abstract does contain a concise definition of BoR ('a chance-corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth'), but we agree it lacks a formal statement or pointer to the derivation. We will revise the abstract to include a brief mathematical definition of BoR and a reference to its derivation in Section 3. The full experimental protocol, including benchmark details, scorers, registries, and evaluation procedures, is already described in Sections 4 and 5; we will ensure these sections are explicitly cross-referenced from the abstract and methods. For error bars, we will add them in the revision for all reported metrics. revision: yes
Referee: RL probe section: the claim that the RL agent serves as an independent probe of BoR is load-bearing for the validation, yet the reward is defined directly from BoR and the policy is learned to maximize it on the same benchmarks; the reported accuracy gains (93.1% vs 87.1%; 76.8% vs 60.9%) may therefore be partly artifacts of the optimization rather than independent evidence for the metric.

Authors: This is a fair critique. The RL agent is positioned as a simple probe whose reward is BoR-derived, so the downstream gains with Claude Sonnet necessarily reflect optimization toward the metric rather than fully independent corroboration. The use of an unseen LLM provides partial separation, but we will revise the text to explicitly acknowledge this limitation, reframe the experiment as demonstrating the practical utility of BoR for adaptive selection (rather than independent validation), and avoid any implication of full independence. revision: partial
Referee: Results on BFCL/ToolBench: coverage numbers (e.g., 90.3% vs 90.8% on BFCL; 16.7% on hard queries) and accuracy figures lack error bars, confidence intervals, or statistical tests, weakening the reliability of the comparisons that support the adaptive-depth claim.

Authors: We agree that the absence of error bars, confidence intervals, and statistical tests weakens the presented comparisons. In the revised manuscript we will add error bars (standard deviation across multiple random seeds or bootstrap resampling) to all coverage and accuracy figures and include appropriate statistical tests (e.g., McNemar's test or paired t-tests) for the fixed-vs-adaptive comparisons on BFCL and ToolBench. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines BoR explicitly against an external random baseline at each depth and applies it for both static evaluation and as an RL reward. The RL component is framed as a deliberately simple probe of the metric, not a proposed system, and the central downstream claims (e.g., 93.1% vs 87.1% tool-selection accuracy) are measured on an independent external LLM (Claude Sonnet 4.6) using separate accuracy metrics on the same benchmarks. No equations reduce a reported result to a fitted parameter or self-citation by construction; no self-citations appear load-bearing; and no ansatz or uniqueness claim is imported from prior author work. The derivation chain is self-contained against external benchmarks and random baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, invented entities, or non-standard axioms are stated. The core domain assumption is that success can be compared to random chance at fixed depth.

axioms (1)

domain assumption Success rate at a given shortlist depth can be meaningfully compared against the probability of random inclusion at that same depth
This is the defining premise of the Bits-over-Random metric.

pith-pipeline@v0.9.1-grok · 5937 in / 1254 out tokens · 39524 ms · 2026-06-30T12:19:37.258452+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents
cs.AI 2026-06 unverdicted novelty 7.0

Attention analysis shows that LLM tool selection failures occur at the readout/decision stage, not because the model fails to attend to the correct tool definition.
ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents
cs.AI 2026-06 unverdicted novelty 6.0

CMTF is a causal filtering method that reduces tool exposure to one per step while matching baseline success rates and cutting token use by ~90% in 102-task benchmarks with 100 tools.

Reference graph

Works this paper leans on

41 extracted references · 18 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

Where to Stop Reading a Ranked List?: Threshold Optimization Using Truncated Score Distributions

Avi Arampatzis, Jaap Kamps, and Stephen Robertson. Where to Stop Reading a Ranked List?: Threshold Optimization Using Truncated Score Distributions. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 524–531, 2009

2009
[2]

Self-RAG: Learn- ing to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learn- ing to Retrieve, Generate, and Critique through Self-Reflection. InThe Twelfth International Conference on Learning Representations, 2024

2024
[3]

Choppy: Cut Trans- former for Ranked List Truncation

Dara Bahri, Yi Tay, Che Zheng, Donald Metzler, and Andrew Tomkins. Choppy: Cut Trans- former for Ranked List Truncation. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1721–1724, 2020

2020
[4]

SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction.arXiv preprint arXiv:2503.01478, 2025

Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, and Hui Xiong. SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction.arXiv preprint arXiv:2503.01478, 2025

work page arXiv 2025
[5]

RAG-MCP: Mitigating prompt bloat in LLM tool selection via retrieval-augmented generation.arXiv preprint arXiv:2505.03275,

Tiantian Gan and Qiyao Sun. RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation.arXiv preprint arXiv:2505.03275, 2025

work page arXiv 2025
[6]

Smar- tRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback

Jingsheng Gao, Linxu Li, Ke Ji, Weiyuan Li, Yixin Lian, Yuzhuo Fu, and Bin Dai. Smar- tRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[7]

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11143–11156, 2024

2024
[8]

Shoichet, and John J

Niu Huang, Brian K. Shoichet, and John J. Irwin. Benchmarking Sets for Molecular Docking. Journal of Medicinal Chemistry, 49(23):6789–6801, 2006

2006
[9]

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun. MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use. InThe Twelfth International Conference on Learning Representations, 2024

2024
[10]

Dynamic Context Selection for Retrieval-Augmented Generation: Mitigating Distractors and Positional Bias.arXiv preprint arXiv:2512.14313, 2025

Mohammed Iratni, Mohand Boughanem, and Taoufiq Dkaki. Dynamic Context Selection for Retrieval-Augmented Generation: Mitigating Distractors and Positional Bias.arXiv preprint arXiv:2512.14313, 2025

work page arXiv 2025
[11]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive- RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InProceedings of the 2024 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, pages 7036–7050, 2024

2024
[12]

Curriculum Guided Reinforce- ment Learning for Efficient Multi-Hop Retrieval-Augmented Generation.arXiv preprint arXiv:2505.17391, 2025

Yucheng Ji, Rui Meng, Zhiliang Li, and Daqing He. Curriculum Guided Reinforce- ment Learning for Efficient Multi-Hop Retrieval-Augmented Generation.arXiv preprint arXiv:2505.17391, 2025

work page arXiv 2025
[13]

Active Retrieval Augmented Generation

Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active Retrieval Augmented Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969– 7992, 2023

2023
[14]

LongFuncEval: Measuring the effectiveness of long context models for function calling,

Kiran Kate, Tejaswini Pedapati, Kinjal Basu, Yara Rizk, Vijil Chenthamarakshan, Subhajit Chaudhury, Mayank Agarwal, and Ibrahim Abdelaziz. LongFuncEval: Measuring the Ef- fectiveness of Long Context Models for Function Calling.arXiv preprint arXiv:2505.10570, 2025

work page arXiv 2025
[15]

Adaptive Document Retrieval for Deep Ques- tion Answering

Bernhard Kratzwald and Stefan Feuerriegel. Adaptive Document Retrieval for Deep Ques- tion Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 576–581, 2018

2018
[16]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, 2023. 10

2023
[17]

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, and Dacheng Tao. InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling. InAd- vances in Neural Information Processing Systems, volume 37, pages 134387–134429, 2024

2024
[18]

ToolSandbox: A Stateful, Conversational, Interactive Evalu- ation Benchmark for LLM Tool Use Capabilities

Jiarui Lu, Thomas Holleis, et al. ToolSandbox: A Stateful, Conversational, Interactive Evalu- ation Benchmark for LLM Tool Use Capabilities. InFindings of the Association for Compu- tational Linguistics: NAACL 2025, pages 1160–1183, 2025

2025
[19]

Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, and James A. Burke. MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations.arXiv preprint arXiv:2507.21428, 2025

work page arXiv 2025
[20]

Ranked List Truncation for Large Language Model-based Re-Ranking

Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. Ranked List Truncation for Large Language Model-based Re-Ranking. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

2024
[21]

From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation.arXiv preprint arXiv:2508.09497, 2025

Shuai Meng, Yue Liu, Ding Wang, et al. From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation.arXiv preprint arXiv:2508.09497, 2025

work page arXiv 2025
[22]

Less is More: Optimizing Function Calling for LLM Execution on Edge Devices

Varatheepan Paramanayakam, Andreas Karatzas, Iraklis Anagnostopoulos, and Dimitrios Sta- moulis. Less is More: Optimizing Function Calling for LLM Execution on Edge Devices. arXiv preprint arXiv:2411.15399, 2025

work page arXiv 2025
[23]

Dynamic Tool Dependency Retrieval for Lightweight Function Calling

Bhavin Patel, Davide Belli, Amir Jalalirad, Michael Arnold, Artem Ermovol, and Brendan Major. Dynamic Tool Dependency Retrieval for Efficient Function Calling.arXiv preprint arXiv:2512.17052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models. InProceedings of the 42nd International Conference on Machine Learning. PMLR, 2025

2025
[25]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. ToolRL: Reward is All Tool Learning Needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging.arXiv preprint arXiv:2505.09316, 2025

Hongjin Qian and Zheng Liu. Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging.arXiv preprint arXiv:2505.09316, 2025

work page arXiv 2025
[27]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin et al. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Learning Representations, 2024

2024
[28]

On the Robustness of Agentic Function Calling

Ella Rabinovich and Ateret Anaby-Tavor. On the Robustness of Agentic Function Calling. In Proceedings of the 5th Workshop on Trustworthy Natural Language Processing (TrustNLP), pages 298–304, 2025

2025
[29]

Pro- cess vs

Wenlin Zhang, Xiangyang Li, Kuicai Dong, Yichao Wang, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Derong Xu, Zhaocheng Du, Huifeng Guo, Ruiming Tang, and Xiangyu Zhao. Pro- cess vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning.arXiv preprint arXiv:2505.14069, 2025

work page arXiv 2025
[30]

The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection

Vyzantinos Repantis, Harshvardhan Singh, Tony Joseph, Cien Zhang, Akash Vishwakarma, Svetlana Karslioglu, Michael Wyatt Thot, and Ameya Gawde. The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection. InICLR Blogposts 2026, 2026

2026
[31]

Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Zhengliang Shi, Yue Wang, Langlin Yan, Peiyu Ren, Shuo Wang, Dawei Yin, and Zhaochun Ren. Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

2025
[32]

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. ARTIST: Agen- tic Reasoning and Tool Integration for LLMs via Reinforcement Learning.arXiv preprint arXiv:2505.01441, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Less is More for RAG: Information Gain Pruning for Generator- Aligned Reranking and Evidence Selection.arXiv preprint arXiv:2601.17532, 2026

Zhipeng Song, Yizhi Zhou, Xiangyu Kong, Jiulong Jiao, Xinrui Bao, Xu You, Xueqing Shi, Yuhang Zhou, and Heng Qi. Less is More for RAG: Information Gain Pruning for Generator- Aligned Reranking and Evidence Selection.arXiv preprint arXiv:2601.17532, 2026. 11

work page arXiv 2026
[34]

DynamicRAG: Leveraging Out- puts of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, and Jiawei Han. DynamicRAG: Leveraging Out- puts of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation. InAdvances in Neural Information Processing Systems, 2025

2025
[35]

Efficient Context Selection for Long- Context QA: No Tuning, No Iteration, Just Adaptive-k

Chihiro Taguchi, Seiya Maekawa, and Nikita Bhutani. Efficient Context Selection for Long- Context QA: No Tuning, No Iteration, Just Adaptive-k. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

2025
[36]

Early Recognition

Jean-François Truchon and Christopher I. Bayly. Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem.Journal of Chemical Information and Modeling, 47(2):488–508, 2007

2007
[37]

ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation.arXiv preprint arXiv:2601.21912, 2026

Zhao Wang, Ziliang Zhao, and Zhicheng Dou. ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation.arXiv preprint arXiv:2601.21912, 2026

work page arXiv 2026
[38]

Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications.arXiv preprint arXiv:2511.14769, 2025

Yifan Xu, Vipul Gupta, Rohit Aggarwal, Varsha Mahadevan, and Bhaskar Krishnamachari. Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications.arXiv preprint arXiv:2511.14769, 2025

work page arXiv 2025
[39]

Shunyu Yao et al.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Bayan Bruss

Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert, Igor Melnyk, Senthil Kumar, and C. Bayan Bruss. MIGRASCOPE: Revisiting RAG Retrievers: An Infor- mation Theoretic Benchmark.arXiv preprint arXiv:2602.21553, 2026

work page arXiv 2026
[41]

ToolRerank: Adap- tive and Hierarchy-Aware Reranking for Tool Retrieval

Yuanhang Zheng, Peng Li, Wei Liu, Yang Liu, Jian Luan, and Bin Wang. ToolRerank: Adap- tive and Hierarchy-Aware Reranking for Tool Retrieval. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16263–16273, 2024. A Retrieval Validation We validate BoR on three d...

2024

[1] [1]

Where to Stop Reading a Ranked List?: Threshold Optimization Using Truncated Score Distributions

Avi Arampatzis, Jaap Kamps, and Stephen Robertson. Where to Stop Reading a Ranked List?: Threshold Optimization Using Truncated Score Distributions. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 524–531, 2009

2009

[2] [2]

Self-RAG: Learn- ing to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learn- ing to Retrieve, Generate, and Critique through Self-Reflection. InThe Twelfth International Conference on Learning Representations, 2024

2024

[3] [3]

Choppy: Cut Trans- former for Ranked List Truncation

Dara Bahri, Yi Tay, Che Zheng, Donald Metzler, and Andrew Tomkins. Choppy: Cut Trans- former for Ranked List Truncation. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1721–1724, 2020

2020

[4] [4]

SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction.arXiv preprint arXiv:2503.01478, 2025

Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, and Hui Xiong. SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction.arXiv preprint arXiv:2503.01478, 2025

work page arXiv 2025

[5] [5]

RAG-MCP: Mitigating prompt bloat in LLM tool selection via retrieval-augmented generation.arXiv preprint arXiv:2505.03275,

Tiantian Gan and Qiyao Sun. RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation.arXiv preprint arXiv:2505.03275, 2025

work page arXiv 2025

[6] [6]

Smar- tRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback

Jingsheng Gao, Linxu Li, Ke Ji, Weiyuan Li, Yixin Lian, Yuzhuo Fu, and Bin Dai. Smar- tRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[7] [7]

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11143–11156, 2024

2024

[8] [8]

Shoichet, and John J

Niu Huang, Brian K. Shoichet, and John J. Irwin. Benchmarking Sets for Molecular Docking. Journal of Medicinal Chemistry, 49(23):6789–6801, 2006

2006

[9] [9]

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use

Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun. MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use. InThe Twelfth International Conference on Learning Representations, 2024

2024

[10] [10]

Dynamic Context Selection for Retrieval-Augmented Generation: Mitigating Distractors and Positional Bias.arXiv preprint arXiv:2512.14313, 2025

Mohammed Iratni, Mohand Boughanem, and Taoufiq Dkaki. Dynamic Context Selection for Retrieval-Augmented Generation: Mitigating Distractors and Positional Bias.arXiv preprint arXiv:2512.14313, 2025

work page arXiv 2025

[11] [11]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive- RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InProceedings of the 2024 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, pages 7036–7050, 2024

2024

[12] [12]

Curriculum Guided Reinforce- ment Learning for Efficient Multi-Hop Retrieval-Augmented Generation.arXiv preprint arXiv:2505.17391, 2025

Yucheng Ji, Rui Meng, Zhiliang Li, and Daqing He. Curriculum Guided Reinforce- ment Learning for Efficient Multi-Hop Retrieval-Augmented Generation.arXiv preprint arXiv:2505.17391, 2025

work page arXiv 2025

[13] [13]

Active Retrieval Augmented Generation

Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active Retrieval Augmented Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969– 7992, 2023

2023

[14] [14]

LongFuncEval: Measuring the effectiveness of long context models for function calling,

Kiran Kate, Tejaswini Pedapati, Kinjal Basu, Yara Rizk, Vijil Chenthamarakshan, Subhajit Chaudhury, Mayank Agarwal, and Ibrahim Abdelaziz. LongFuncEval: Measuring the Ef- fectiveness of Long Context Models for Function Calling.arXiv preprint arXiv:2505.10570, 2025

work page arXiv 2025

[15] [15]

Adaptive Document Retrieval for Deep Ques- tion Answering

Bernhard Kratzwald and Stefan Feuerriegel. Adaptive Document Retrieval for Deep Ques- tion Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 576–581, 2018

2018

[16] [16]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, 2023. 10

2023

[17] [17]

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, and Dacheng Tao. InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling. InAd- vances in Neural Information Processing Systems, volume 37, pages 134387–134429, 2024

2024

[18] [18]

ToolSandbox: A Stateful, Conversational, Interactive Evalu- ation Benchmark for LLM Tool Use Capabilities

Jiarui Lu, Thomas Holleis, et al. ToolSandbox: A Stateful, Conversational, Interactive Evalu- ation Benchmark for LLM Tool Use Capabilities. InFindings of the Association for Compu- tational Linguistics: NAACL 2025, pages 1160–1183, 2025

2025

[19] [19]

Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, and James A. Burke. MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations.arXiv preprint arXiv:2507.21428, 2025

work page arXiv 2025

[20] [20]

Ranked List Truncation for Large Language Model-based Re-Ranking

Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. Ranked List Truncation for Large Language Model-based Re-Ranking. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

2024

[21] [21]

From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation.arXiv preprint arXiv:2508.09497, 2025

Shuai Meng, Yue Liu, Ding Wang, et al. From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation.arXiv preprint arXiv:2508.09497, 2025

work page arXiv 2025

[22] [22]

Less is More: Optimizing Function Calling for LLM Execution on Edge Devices

Varatheepan Paramanayakam, Andreas Karatzas, Iraklis Anagnostopoulos, and Dimitrios Sta- moulis. Less is More: Optimizing Function Calling for LLM Execution on Edge Devices. arXiv preprint arXiv:2411.15399, 2025

work page arXiv 2025

[23] [23]

Dynamic Tool Dependency Retrieval for Lightweight Function Calling

Bhavin Patel, Davide Belli, Amir Jalalirad, Michael Arnold, Artem Ermovol, and Brendan Major. Dynamic Tool Dependency Retrieval for Efficient Function Calling.arXiv preprint arXiv:2512.17052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models. InProceedings of the 42nd International Conference on Machine Learning. PMLR, 2025

2025

[25] [25]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. ToolRL: Reward is All Tool Learning Needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging.arXiv preprint arXiv:2505.09316, 2025

Hongjin Qian and Zheng Liu. Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging.arXiv preprint arXiv:2505.09316, 2025

work page arXiv 2025

[27] [27]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin et al. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Learning Representations, 2024

2024

[28] [28]

On the Robustness of Agentic Function Calling

Ella Rabinovich and Ateret Anaby-Tavor. On the Robustness of Agentic Function Calling. In Proceedings of the 5th Workshop on Trustworthy Natural Language Processing (TrustNLP), pages 298–304, 2025

2025

[29] [29]

Pro- cess vs

Wenlin Zhang, Xiangyang Li, Kuicai Dong, Yichao Wang, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Derong Xu, Zhaocheng Du, Huifeng Guo, Ruiming Tang, and Xiangyu Zhao. Pro- cess vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning.arXiv preprint arXiv:2505.14069, 2025

work page arXiv 2025

[30] [30]

The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection

Vyzantinos Repantis, Harshvardhan Singh, Tony Joseph, Cien Zhang, Akash Vishwakarma, Svetlana Karslioglu, Michael Wyatt Thot, and Ameya Gawde. The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection. InICLR Blogposts 2026, 2026

2026

[31] [31]

Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Zhengliang Shi, Yue Wang, Langlin Yan, Peiyu Ren, Shuo Wang, Dawei Yin, and Zhaochun Ren. Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

2025

[32] [32]

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. ARTIST: Agen- tic Reasoning and Tool Integration for LLMs via Reinforcement Learning.arXiv preprint arXiv:2505.01441, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Less is More for RAG: Information Gain Pruning for Generator- Aligned Reranking and Evidence Selection.arXiv preprint arXiv:2601.17532, 2026

Zhipeng Song, Yizhi Zhou, Xiangyu Kong, Jiulong Jiao, Xinrui Bao, Xu You, Xueqing Shi, Yuhang Zhou, and Heng Qi. Less is More for RAG: Information Gain Pruning for Generator- Aligned Reranking and Evidence Selection.arXiv preprint arXiv:2601.17532, 2026. 11

work page arXiv 2026

[34] [34]

DynamicRAG: Leveraging Out- puts of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, and Jiawei Han. DynamicRAG: Leveraging Out- puts of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation. InAdvances in Neural Information Processing Systems, 2025

2025

[35] [35]

Efficient Context Selection for Long- Context QA: No Tuning, No Iteration, Just Adaptive-k

Chihiro Taguchi, Seiya Maekawa, and Nikita Bhutani. Efficient Context Selection for Long- Context QA: No Tuning, No Iteration, Just Adaptive-k. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

2025

[36] [36]

Early Recognition

Jean-François Truchon and Christopher I. Bayly. Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem.Journal of Chemical Information and Modeling, 47(2):488–508, 2007

2007

[37] [37]

ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation.arXiv preprint arXiv:2601.21912, 2026

Zhao Wang, Ziliang Zhao, and Zhicheng Dou. ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation.arXiv preprint arXiv:2601.21912, 2026

work page arXiv 2026

[38] [38]

Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications.arXiv preprint arXiv:2511.14769, 2025

Yifan Xu, Vipul Gupta, Rohit Aggarwal, Varsha Mahadevan, and Bhaskar Krishnamachari. Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications.arXiv preprint arXiv:2511.14769, 2025

work page arXiv 2025

[39] [39]

Shunyu Yao et al.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Bayan Bruss

Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert, Igor Melnyk, Senthil Kumar, and C. Bayan Bruss. MIGRASCOPE: Revisiting RAG Retrievers: An Infor- mation Theoretic Benchmark.arXiv preprint arXiv:2602.21553, 2026

work page arXiv 2026

[41] [41]

ToolRerank: Adap- tive and Hierarchy-Aware Reranking for Tool Retrieval

Yuanhang Zheng, Peng Li, Wei Liu, Yang Liu, Jian Luan, and Bin Wang. ToolRerank: Adap- tive and Hierarchy-Aware Reranking for Tool Retrieval. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16263–16273, 2024. A Retrieval Validation We validate BoR on three d...

2024