arxiv: 2605.00737 · v1 · submitted 2026-05-01 · 💻 cs.AI

Recognition: unknown

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Qinyuan Wu , Soumi Das , Mahsa Amani , Arijit Nag , Seungeon Lee , Krishna P. Gummadi , Abhilasha Ravichander , Muhammad Bilal Zafar

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM tool callingdecision frameworkhidden statesnecessity and utilityagentic systemsweb search toolsmodel controllers

0 comments

The pith

LLMs often misjudge when calling tools like web search is truly necessary or useful, but estimators built from their internal hidden states can make better calls and raise task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a decision-theoretic framework that separates three factors for tool-use choices: necessity, utility, and affordability. It contrasts a normative view, which derives true need and value from an optimal allocation of calls, against a descriptive view drawn from the model's own observed behavior. Across experiments the authors show frequent misalignment between these two views. They then extract lightweight predictors of need and utility directly from the model's hidden states and use those predictors to build simple controllers. These controllers outperform the model's unaided self-perception on three different tasks and six different models.

Core claim

The central claim is that an LLM's self-perceived need and utility for tool calls frequently diverge from the true need and utility that would be obtained by optimally allocating calls; a framework that measures both perspectives reveals this gap, and estimators trained on hidden-state activations can produce controllers that close the gap and improve end-task results.

What carries the argument

A three-factor decision framework (necessity, utility, affordability) that contrasts a normative lens (optimal allocation as ground truth) with a descriptive lens (model's observed behavior), plus lightweight estimators of need and utility extracted from hidden states to drive controllers.

If this is right

Task performance rises when tool-call decisions are routed through the hidden-state estimators rather than left to the model's own judgment.
The same lightweight estimators can be attached to multiple base models without retraining the base model itself.
The framework supplies concrete metrics that let researchers quantify how often a given model over- or under-calls tools.
Simple rule-based or threshold-based controllers suffice once the estimators are trained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on tool sets other than web search to check whether hidden-state signals remain informative when the external tool is less noisy.
If the estimators prove stable across tasks, they might allow a single lightweight controller to serve many different agent workflows without task-specific retraining.
The misalignment finding raises the possibility that current agent scaffolds over-rely on the LLM's verbal self-assessment of its own knowledge gaps.

Load-bearing premise

True need and utility for tool calls can be inferred reliably enough from an optimal allocation of calls to serve as training labels that generalize across tasks.

What would settle it

On held-out tasks or new models, controllers built from the hidden-state estimators produce no measurable gain in task accuracy or decision quality compared with the model's unaided self-perception.

Figures

Figures reproduced from arXiv: 2605.00737 by Abhilasha Ravichander, Arijit Nag, Krishna P. Gummadi, Mahsa Amani, Muhammad Bilal Zafar, Qinyuan Wu, Seungeon Lee, Soumi Das.

**Figure 1.** Figure 1: Given input x, the model M decides π(x) ∈ {0, 1} to call a tool (response r) or not, producing y = M(x,r) or y = M(x). We compare NO TOOL, ALWAYS TOOL, and SELF-DECISION, and evaluate decisions via need (requires help), utility (performance gain), and affordability (cost vs. gain), distinguishing perceived vs. true quantities. 2023) and improve tool selection during inference (Schick et al., 2023a), while … view at source ↗

**Figure 2.** Figure 2: True need and true positive utility are correlated, but not perfectly aligned. Rows are grouped by the model’s (GPTOSS-120B) factuality scores under NO TOOL (parametric knowledge), while columns show scores under ALWAYS TOOL. Scores are bucketed into low (0–0.1), mid (0.1–0.9), and high (0.9–1). Cells above the diagonal indicate positive utility, while those below indicate negative utility. The bracket … view at source ↗

**Figure 3.** Figure 3: Perceived need is only partially aligned with tool call (perceived utility). The x-axis shows the model’s perceived need, and the y-axis shows perceived utility / tool-call decisions. Need Need No Need Perceived Need No Need True 205 143 14 138 (a) GPT-OSS Need No Need Perceived Need No Need True 256 69 69 106 (b) Qwen3-A3B Need No Need Perceived Need No Need True 303 14 177 6 (c) Qwen3-IT Need No Need Per… view at source ↗

**Figure 5.** Figure 5: The perceived need and utility are not aligned with the true need and utility. Entity Task; Top: true vs perceived need. Bottom: true vs. perceived utility across models. Most models are self-consistent: perceived utility (tool calling) correlates with perceived need, with two models showing substantial deviations. In the absence of tool descriptions, model behavior is driven solely by perceived need— i.e.… view at source ↗

**Figure 4.** Figure 4: Perceived signals only partially align with true utility. Venn diagrams of True Positive Utility, Perceived Need, and Perceived Utility for GPT-OSS-120B on the entity task. Ideally, Perceived Utility ⊆ Perceived Need ⊆ True Positive Utility, but this nesting is violated, explaining the suboptimal performance of SELFDECISION. Even as models’ perceptions are self-consistent, their perceptions do not alig… view at source ↗

**Figure 6.** Figure 6: Utility gain over the NO TOOL under varying cost constraints. Solid lines show optimal allocation (optimal top-k), dashed lines show model performance with cost information, squares denote no cost-awareness, and dotted lines indicate always-calling. Cost information has uneven effects. Without cost descriptions, most models (except Mistral and Llama) achieve slightly higher utility gains than in the cos… view at source ↗

**Figure 7.** Figure 7: LNE improves true-need prediction across models, with larger gains for smaller models. Confusion matrices appear in view at source ↗

**Figure 8.** Figure 8: Distribution of the number of Google search results across all entities in the entity dataset. view at source ↗

**Figure 9.** Figure 9: Entity Task: factuality distribution across different models. s ”The term Sky Blue ¨ primarily refers to Coventry City Football Club, an English professional football team ¨ based in Coventry, which plays in the EFL Championship. The club, originally founded in 1883 as Singers F.C., adopted the nickname The Sky Blues ¨ due to their distinctive sky blue kits. They joined the Football ¨ League in 1898 and ha… view at source ↗

**Figure 10.** Figure 10: Entity task: No-Tool vs. Force-Tool performance. Rows group entities by the model’s factuality score without a tool (reflecting parametric knowledge), while columns group scores when tool use is forced. Each cell reports the count and the column percentage. Off-diagonal cells indicate performance shifts due to tool use: cells above the diagonal show cases where the tool has positive utility, while cells b… view at source ↗

**Figure 11.** Figure 11: [Entity Task] Venn diagrams of three sets: view at source ↗

**Figure 12.** Figure 12: Entity Task: perceived need is only partially aligned with tool use. The x-axis shows perceived utility (number of entities predicted to need or not need external information), and the y-axis shows actual tool-use decisions. Percentages indicate how often the model follows its own prediction (call vs. not call). Results are shown for two prompt variants (v1 and v2). Some responses are excluded due to pars… view at source ↗

**Figure 13.** Figure 13: [Entity Task] Actual tool calls without budget enforcement. Models do not reliably reduce view at source ↗

**Figure 14.** Figure 14: [Entity Task] The NDCG rank correlation under different budgets across different models. view at source ↗

**Figure 15.** Figure 15: Cost-aware tool use on the Entity Task. Left: Utility gain over the no-tool baseline under varying cost constraints. Solid lines show oracle allocation (optimal top-k), dashed lines show model performance with cost information, squares denote no cost-awareness, and dotted lines indicate always-calling. Right: Actual tool calls without budget enforcement. Models do not reliably reduce or stop calls as cost… view at source ↗

**Figure 16.** Figure 16: [Entity Task] The NDCG rank correlation under different budgets across different models. view at source ↗

**Figure 17.** Figure 17: [Entity Task.] Confusion matrix for the LNE estimator E.2 Descriptive Lens As shown in view at source ↗

**Figure 18.** Figure 18: The LUEs can predict the True Utility more accurately across most models, especially for small and weaker models. We show the confusion matrixs of the two predictors in view at source ↗

**Figure 19.** Figure 19: [Entity Task.] Confusion matrix for the LUEx estimator F Results for BFCL task In this section, we show the additional results for the entity task. In view at source ↗

**Figure 20.** Figure 20: [Entity Task.] Confusion matrix for the LUEx,d estimator 0% 20% 40% 60% 80% 100% % Queries with Tool Calls 0.00 0.05 0.10 0.15 0.20 0.25 Utility Gain over No-Tool Gemma3-27B GPT-OSS-120B Qwen3-30B Qwen-3-30B-IT Mistral3.1-24B-IT Llama3.2-3B Ideal curve Cost Description LNE (a) LNE 0% 20% 40% 60% 80% 100% % Queries with Tool Calls 0.00 0.05 0.10 0.15 0.20 0.25 Utility Gain over No-Tool Gemma3-27B GPT-OSS-1… view at source ↗

**Figure 21.** Figure 21: [Entity Task] Tool-call decisions are guided by the latent need estimator’s predicted view at source ↗

**Figure 22.** Figure 22: [InvivoQuery Task] factuality distribution across different models. Low Mid High Performance with tool Low Mid High Performance without tool True Need Neutral Positive Negative 53 40 14 62 173 34 40 39 45 (a) GPT-OSS-120B Low Mid High Performance with tool Low Mid High Performance without tool True Need Neutral Positive Negative 20 28 19 24 178 74 15 67 75 (b) Qwen3-30B-A3B Low Mid High Performance with t… view at source ↗

**Figure 23.** Figure 23: [InvivoQuery Task] NO TOOL vs. ALWAYS TOOL performance. Rows group entities by the model’s factuality score without a tool (reflecting parametric knowledge), while columns group scores when tool use is forced. Each cell reports the count and the column percentage. Off-diagonal cells indicate performance shifts due to tool use: cells above the diagonal show cases where the tool has positive utility, while … view at source ↗

**Figure 24.** Figure 24: [InvivoQuery Task] Venn diagrams of three sets: view at source ↗

**Figure 25.** Figure 25: [InvivoQuery Task] perceived need is only partially aligned with tool use. The x-axis shows perceived utility (number of entities predicted to need or not need external information), and the y-axis shows actual tool-use decisions. Percentages indicate how often the model follows its own prediction (call vs. not call). Results are shown for three prompt variants (v2 and v3). Some responses are excluded due… view at source ↗

**Figure 26.** Figure 26: The perceived need and utility are not aligned with the true need and utility. InvivoQuery Task; Left: perceived need matrices. Right: true vs. perceived utility across models. 0% 20% 40% 60% 80% 100% % Queries with Tool Calls 0.00 0.05 0.10 0.15 Utility Gain over No-Tool Gemma3-27B GPT-OSS-120B Qwen3-30B Qwen-3-30B-IT Mistral3.1-24B-IT Llama3.2-3B Cost Description No Cost Description (a) Utility gain wi… view at source ↗

**Figure 27.** Figure 27: Cost-aware tool use on the Invivo Task with implicit budget notification. Left: Utility gain over the no-tool baseline under varying cost constraints. Solid lines show oracle allocation (optimal top-k), dashed lines show model performance with cost information, squares denote no cost-awareness, and dotted lines indicate always-calling. Right: Actual tool calls without budget enforcement. Models do not rel… view at source ↗

**Figure 28.** Figure 28: [InvivoQuery Task] The NDCG rank correlation under different budgets across different view at source ↗

**Figure 29.** Figure 29: Cost-aware tool use on the Invivo Task with explicit budget notification. Left: Utility gain over the no-tool baseline under varying cost constraints. Solid lines show oracle allocation (optimal top-k), dashed lines show model performance with cost information, squares denote no cost-awareness, and dotted lines indicate always-calling. Right: Actual tool calls without budget enforcement. Models do not rel… view at source ↗

**Figure 30.** Figure 30: [InvivoQuery Task] The NDCG rank correlation under different budgets across different view at source ↗

**Figure 31.** Figure 31: InvivoQuery task: The LNE can predict the True Need more accurately across most models, especially for small and weaker models. 38 view at source ↗

**Figure 32.** Figure 32: [InvivoQuery Task.] Confusion matrix for the view at source ↗

**Figure 33.** Figure 33: The LUE can predict the True Utility more accurately across most models, especially for small and weaker models. In the view at source ↗

**Figure 34.** Figure 34: [InvivoQuery Task.] Confusion matrix for the view at source ↗

**Figure 35.** Figure 35: [InvivoQuery Task.] Confusion matrix for the view at source ↗

**Figure 36.** Figure 36: [InvivoQuery Task] Tool-call decisions are guided by the latent need estimator’s predicted view at source ↗

**Figure 37.** Figure 37: [BFCL Task] factuality distribution across different models. Incorrect Correct Performance with tool Incorrect Correct Performance without tool True Need Neutral Positive Negative 70 127 7 110 (a) GPT-OSS-120B Incorrect Correct Performance with tool Incorrect Correct Performance without tool True Need Neutral Positive Negative 108 144 12 50 (b) Qwen3-30B-A3B Incorrect Correct Performance with tool Incorre… view at source ↗

**Figure 38.** Figure 38: [BFCL task]: No-Tool vs. Force-Tool performance. Rows group entities by the model’s factuality score without a tool (reflecting parametric knowledge), while columns group scores when tool use is forced. Each cell reports the count and the column percentage. Off-diagonal cells indicate performance shifts due to tool use: cells above the diagonal show cases where the tool has positive utility, while cells b… view at source ↗

**Figure 39.** Figure 39: [BFCL Task] Venn diagrams of three sets: view at source ↗

**Figure 40.** Figure 40: [BFCL Task]: perceived need is only partially aligned with tool use. The x-axis shows perceived utility (number of entities predicted to need or not need external information), and the y-axis shows actual tool-use decisions. Percentages indicate how often the model follows its own prediction (call vs. not call). Results are shown for three prompt variants. Some responses are excluded due to parsing failur… view at source ↗

**Figure 41.** Figure 41: [BFCL Task] The perceived need and utility are not aligned with the true need and utility. Left: perceived need matrices. Right: true vs. perceived utility across models. 0% 20% 40% 60% 80% 100% % Queries with Tool Calls 0.0 0.1 0.2 0.3 0.4 0.5 Utility Gain over No-Tool Gemma3-27B GPT-OSS-120B Qwen3-30B Qwen-3-30B-IT Mistral3.1-24B-IT Llama3.2-3B Cost Description No Cost Description (a) Utility gain with … view at source ↗

**Figure 42.** Figure 42: Cost-aware tool use on the BFCL Task with implicit budget notification.. Left: Utility gain over the no-tool baseline under varying cost constraints. Solid lines show oracle allocation (optimal top-k), dashed lines show model performance with cost information, squares denote no cost-awareness, and dotted lines indicate always-calling. Right: Actual tool calls without budget enforcement. Models do not reli… view at source ↗

**Figure 43.** Figure 43: [BFCL Task] The NDCG rank correlation under different budgets across different models. view at source ↗

**Figure 44.** Figure 44: Cost-aware tool use on the BFCL Task with explicit budget notification. Left: Utility gain over the no-tool baseline under varying cost constraints. Solid lines show oracle allocation (optimal top-k), dashed lines show model performance with cost information, squares denote no cost-awareness, and dotted lines indicate always-calling. Right: Actual tool calls without budget enforcement. Models do not relia… view at source ↗

**Figure 45.** Figure 45: [BFCL Task] The NDCG rank correlation under different budgets across different models. view at source ↗

**Figure 46.** Figure 46: BFCL task: The LNE can predict the True Need more accurately across most models, especially for small and weaker models. Need No Need Predicted Need No Need Actual 162 (78%) 35 (33%) 45 (22%) 72 (67%) (a) GPT-OSS120B Need No Need Predicted Need No Need Actual 221 (89%) 31 (47%) 27 (11%) 35 (53%) (b) Qwen3-30BA3B Need No Need Predicted Need No Need Actual 189 (84%) 30 (34%) 37 (16%) 58 (66%) (c) Qwen3-30… view at source ↗

**Figure 47.** Figure 47: [BFCL Task.] Confusion matrix for the LNE estimator Gemma3-27B GPT-OSS-120B Qwen3-30B Qwen3-30B-IT Mistral3.1-24B Llama3.2-3B 0.00 0.20 0.40 0.60 0.80 1.00 Accuracy Perceived Utility LUEx, d LUEx view at source ↗

**Figure 48.** Figure 48: The LUE can predict the True Utility more accurately across most models, especially for small and weaker models.. 45 view at source ↗

**Figure 49.** Figure 49: [BFCL Task.] Confusion matrix for the LUEx estimator Help No Help Predicted Help No Help Actual 132 (68%) 55 (46%) 63 (32%) 64 (54%) (a) GPT-OSS120B Help No Help Predicted Help No Help Actual 115 (72%) 55 (36%) 45 (28%) 99 (64%) (b) Qwen3-30BA3B Help No Help Predicted Help No Help Actual 131 (75%) 39 (28%) 43 (25%) 101 (72%) (c) Qwen3-30BA3B-Instruct Help No Help Predicted Help No Help Actual 176 (71%)… view at source ↗

**Figure 50.** Figure 50: [BFCL Task.] Confusion matrix for the LUEx,d estimator 0% 20% 40% 60% 80% 100% % Queries with Tool Calls 0.0 0.1 0.2 0.3 0.4 0.5 Utility Gain over No-Tool Gemma3-27B GPT-OSS-120B Qwen3-30B Qwen-3-30B-IT Mistral3.1-24B-IT Llama3.2-3B Ideal curve Cost Description LNE (a) LNE 0% 20% 40% 60% 80% 100% % Queries with Tool Calls 0.0 0.1 0.2 0.3 0.4 0.5 Utility Gain over No-Tool Gemma3-27B GPT-OSS-120B Qwen3-30B … view at source ↗

**Figure 51.** Figure 51: [BFCL Task] Tool-call decisions are guided by the latent need estimator’s predicted view at source ↗

**Figure 52.** Figure 52: [Entity Task, Perplexity Search] True need and positive utility are correlated, but not perfectly aligned. Rows group by the model’s (GPT-OSS-120B) factuality scores under NO TOOL (parametric knowledge), while columns show scores under ALWAYS TOOL. Scores are bucketed into low (0–0.1), mid (0.1–0.9), and high (0.9–1). Cells above the diagonal indicate positive utility, while those below indicate negative … view at source ↗

**Figure 53.** Figure 53: [Entity Task; Perplexity Search] Perceived need is only partially aligned with tool call (perceived utility). Model: GPT-OSS-120B. The x-axis shows the model’s perceived need, and the y-axis shows perceived utility / tool-call decisions. 64 52 30 3 1 62 78 True Positive Utility Perceived Need Perceived Utility view at source ↗

**Figure 54.** Figure 54: [Entity Task; Perplexity Search] Perceived signals only partially align with true utility. Venn diagrams of True Positive Utility, Perceived Need, and Perceived Utility for GPT-OSS-120B on the entity task illustrate this misalignment. Ideally, Perceived Utility ⊆ Perceived Need ⊆ True Positive Utility. However, this nesting is violated, indicating misalignment with true utility and helping to explain the … view at source ↗

**Figure 55.** Figure 55: [Entity Task; Perplexity Search] The perceived need and utility are not aligned with the true need and utility. Entity Task; Top: perceived need matrices. Bottom: true vs. perceived utility across models. 47 view at source ↗

**Figure 56.** Figure 56: [Entity Task; Perplexity Search] Cost-aware tool use with explicit budget notification. Left: Utility gain over the no-tool baseline under varying cost constraints. Solid lines show oracle allocation (optimal top-k), dashed lines show model performance with cost information, squares denote no cost-awareness, and dotted lines indicate always-calling. Right: Actual tool calls without budget enforcement. Mod… view at source ↗

**Figure 57.** Figure 57: [Entity Task; Perplexity Search] Cost-aware tool use with implicit budget notification. Left: Utility gain over the no-tool baseline under varying cost constraints. Solid lines show oracle allocation (optimal top-k), dashed lines show model performance with cost information, squares denote no cost-awareness, and dotted lines indicate always-calling. Right: Actual tool calls without budget enforcement. Mod… view at source ↗

read the original abstract

Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, when performing a task. This decision is particularly challenging for web search tools, where the benefits of external information depend on the model's internal knowledge and its ability to integrate potentially noisy tool responses. We introduce a principled framework inspired by decision-making theory to evaluate web search tool-use decisions along three key factors: necessity, utility, and affordability. Our analysis combines two complementary lenses: a normative perspective that infers true need and utility from an optimal allocation of tool calls, and a descriptive perspective that infers the model's self-perceived need and utility from their observed behaviors. We find that models' perceived need and utility of tool calls are often misaligned with their true need and utility. Building on this framework, we train lightweight estimators of need and utility based on models' hidden states. Our estimators enable simple controllers that can improve decision quality and lead to stronger task performance than the self-perceived set up across three tasks and six models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a decision-theory framing for when LLMs should call tools and shows hidden-state estimators can beat the model's own judgments, but the optimal-allocation labels carry a real risk of circularity.

read the letter

The main thing to know is that this work splits tool-calling decisions into necessity, utility, and affordability, then contrasts what the model appears to believe with what an optimal allocation would do. They document misalignment on web-search tasks and train lightweight hidden-state estimators that feed simple controllers. Those controllers produce better task performance than the model's self-decisions across three tasks and six models. That is the concrete result worth noting.

Referee Report

2 major / 3 minor

Summary. The paper introduces a decision-theoretic framework for evaluating LLM tool-calling decisions (focusing on web search) along necessity, utility, and affordability. It contrasts a normative lens, which infers 'true' need/utility from an optimal allocation of tool calls, with a descriptive lens based on observed model behaviors, revealing frequent misalignments. Lightweight estimators trained on hidden states are then used to build controllers that improve decision quality and task performance over self-perceived baselines across three tasks and six models.

Significance. If the normative labels prove independent of evaluation data and the hidden-state estimators generalize without selection bias, the work offers a practical route to optimize tool use in agentic systems, reducing redundant or harmful calls while boosting accuracy. The empirical demonstration across multiple models and tasks, combined with the use of internal representations rather than external supervision, strengthens its potential impact on reliable LLM agents.

major comments (2)

[Abstract and normative perspective] Abstract and normative perspective section: The central claim that models' perceived need/utility are misaligned with 'true' values, and that hidden-state estimators outperform self-perception, rests on optimal allocation serving as independent ground truth. However, the manuscript does not detail the exact procedure for computing this allocation (e.g., whether it relies on post-hoc accuracy gains on the same instances, any data exclusion rules, or cross-validation). This risks circularity or selection bias in the labels used both for misalignment analysis and estimator training, as flagged in the stress-test note.
[Results and experimental setup] Results and experimental setup: Performance improvements from the controllers are reported across tasks and models, but without error bars, number of runs, statistical tests, or explicit confirmation that optimal-allocation labels were derived from held-out data, the robustness of the gains cannot be verified. This directly affects the claim of stronger task performance than the self-perceived setup.

minor comments (3)

[Framework introduction] The affordability factor is mentioned in the framework but receives less elaboration than necessity and utility; a brief formal definition or example early in the paper would improve clarity.
[Figures] Ensure all figures comparing normative vs. descriptive decisions include clear legends, axis labels, and sample sizes to aid interpretation of misalignment patterns.
[Related work] Add a short discussion of related work on LLM calibration, uncertainty estimation, or tool-use optimization to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity and rigor of our work. We respond to each major comment below, indicating the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and normative perspective] Abstract and normative perspective section: The central claim that models' perceived need/utility are misaligned with 'true' values, and that hidden-state estimators outperform self-perception, rests on optimal allocation serving as independent ground truth. However, the manuscript does not detail the exact procedure for computing this allocation (e.g., whether it relies on post-hoc accuracy gains on the same instances, any data exclusion rules, or cross-validation). This risks circularity or selection bias in the labels used both for misalignment analysis and estimator training, as flagged in the stress-test note.

Authors: We agree that a more explicit description of the optimal allocation procedure is necessary to establish its independence as ground truth. In the revised version, we will expand the relevant section to provide a detailed, step-by-step account of how the allocation is computed, including the use of data splits, any exclusion criteria, and cross-validation steps. This will clarify that the normative labels are derived independently of the instances used for misalignment analysis and estimator training, thereby addressing concerns about circularity and selection bias noted in the stress-test. revision: yes
Referee: [Results and experimental setup] Results and experimental setup: Performance improvements from the controllers are reported across tasks and models, but without error bars, number of runs, statistical tests, or explicit confirmation that optimal-allocation labels were derived from held-out data, the robustness of the gains cannot be verified. This directly affects the claim of stronger task performance than the self-perceived setup.

Authors: We concur that the experimental results would benefit from enhanced statistical reporting to better substantiate the performance improvements. We will revise the results and experimental setup sections to include error bars, specify the number of experimental runs, incorporate appropriate statistical tests, and provide clear confirmation along with details that the optimal-allocation labels were indeed derived from held-out data. These additions will allow readers to verify the robustness of the gains over the self-perceived baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: normative optimal allocation is independent external ground truth

full rationale

The paper's core chain defines true need/utility from an optimal allocation benchmark (computed via task outcomes), contrasts it with self-perceived behavior from observed decisions, trains hidden-state estimators to predict the benchmark, and evaluates controllers on downstream task performance. This is standard probe training to an external label with no equations reducing the estimator output to the input by construction, no self-citation load-bearing the uniqueness of the framework, and no renaming or ansatz smuggling. The claimed misalignment and improvement are falsifiable against held-out task metrics and remain independent of the estimators themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract introduces new framework concepts without detailing fitted parameters; relies on domain assumptions about tool use and ad-hoc definitions of true need.

axioms (2)

domain assumption Tool use is not always beneficial; some calls may be redundant or even harmful.
Explicitly stated in abstract as motivation for the decision problem.
ad hoc to paper True need and utility can be inferred from an optimal allocation of tool calls.
Foundational to the normative perspective used to define misalignment.

invented entities (1)

necessity, utility, and affordability factors no independent evidence
purpose: To structure evaluation of tool-use decisions
New organizing concepts introduced by the framework.

pith-pipeline@v0.9.0 · 5545 in / 1507 out tokens · 42573 ms · 2026-05-09T19:29:37.035402+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 20 canonical work pages · 5 internal anchors

[1]

Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home

Moskvoretskii, Viktor and Marina, Maria and Salnikov, Mikhail and Ivanov, Nikolay and Pletenev, Sergey and Galimzianova, Daria and Krayko, Nikita and Konovalov, Vasily and Nikishina, Irina and Panchenko, Alexander. Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home. Proceedings of the 63rd Annual Meeting of the Association for Compu...

work page doi:10.18653/v1/2025.acl-long.319 2025
[2]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[3]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[4]

Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , pages=

Rowen: Adaptive Retrieval-Augmented Generation for Hallucination Mitigation in LLMs , author=. Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , pages=

2025
[5]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Seakr: Self-aware knowledge retrieval for adaptive retrieval augmented generation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[6]

Language Models (Mostly) Know What They Know

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

work page internal anchor Pith review arXiv
[7]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[8]

Advances in Neural Information Processing Systems , volume=

Large language models must be taught to know what they don’t know , author=. Advances in Neural Information Processing Systems , volume=
[9]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Estimating knowledge in large language models without generating a single token , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[10]

Malik, and Yarin Gal

Semantic entropy probes: Robust and cheap hallucination detection in llms , author=. arXiv preprint arXiv:2406.15927 , year=

work page arXiv
[11]

ICLR , year=

Query-level uncertainty in large language models , author=. ICLR , year=
[12]

ACM Web Conference , year=

Bowling with ChatGPT: On the Evolving User Interactions with Conversational AI Systems , author=. ACM Web Conference , year=
[13]

Forty-second International Conference on Machine Learning , year=

The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models , author=. Forty-second International Conference on Machine Learning , year=
[14]

2025 , url=

Hadas Orgad and Michael Toker and Zorik Gekhman and Roi Reichart and Idan Szpektor and Hadas Kotek and Yonatan Belinkov , booktitle=. 2025 , url=

2025
[15]

doi: 10.18653/v1/2024.findings-emnlp.552

Song, Yixiao and Kim, Yekyung and Iyyer, Mohit. V eri S core: Evaluating the factuality of verifiable claims in long-form text generation. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.552

work page doi:10.18653/v1/2024.findings-emnlp.552 2024
[16]

2017 , publisher=

An introduction to decision theory , author=. 2017 , publisher=

2017
[17]

, author=

Judgment under Uncertainty: Heuristics and Biases: Biases in judgments reveal some heuristics of thinking under uncertainty. , author=. science , volume=. 1974 , publisher=

1974
[18]

2023 , html =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =

2023
[19]

Toolformer: language models can teach themselves to use tools , year =

Schick, Timo and Dwivedi-Yu, Jane and Dess\'. Toolformer: language models can teach themselves to use tools , year =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =
[20]

GEAR : Augmenting Language Models with Generalizable and Efficient Tool Resolution

Lu, Yining and Yu, Haoping and Khashabi, Daniel. GEAR : Augmenting Language Models with Generalizable and Efficient Tool Resolution. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.7

work page doi:10.18653/v1/2024.eacl-long.7 2024
[21]

Making language models better tool learners with execution feedback , url =

Qiao, Shuofei and Gui, Honghao and Lv, Chengfei and Jia, Qianghuai and Chen, Huajun and Zhang, Ningyu. Making Language Models Better Tool Learners with Execution Feedback. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.1865...

work page doi:10.18653/v1/2024.naacl-long.195 2024
[22]

Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

Agentic reasoning and tool integration for llms via reinforcement learning , author=. arXiv preprint arXiv:2505.01441 , year=

work page arXiv
[23]

ToolRL: Reward is All Tool Learning Needs

Toolrl: Reward is all tool learning needs , author=. arXiv preprint arXiv:2504.13958 , year=

work page internal anchor Pith review arXiv
[24]

2025 , eprint=

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs , author=. 2025 , eprint=

2025
[25]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review arXiv
[26]

Torl: Scaling tool-integrated rl, 2025 b

Torl: Scaling tool-integrated rl , author=. arXiv preprint arXiv:2503.23383 , year=

work page arXiv
[27]

arXiv e-prints , pages=

Otc: Optimal tool calls via reinforcement learning , author=. arXiv e-prints , pages=
[28]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Catp-llm: Empowering large language models for cost-aware tool planning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[29]

Budget-aware tool-use enables effective agent scaling

Budget-aware tool-use enables effective agent scaling , author=. arXiv preprint arXiv:2511.17006 , year=

work page arXiv
[30]

The Thirteenth International Conference on Learning Representations , year=

Rational decision-making agent with learning internal utility judgment , author=. The Thirteenth International Conference on Learning Representations , year=
[31]

Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing-Natural Language Processing in the Generative AI Era , pages=

When to retrieve: Teaching llms to utilize information retrieval effectively , author=. Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing-Natural Language Processing in the Generative AI Era , pages=
[32]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Adaptive tool use in large language models with meta-cognition trigger , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[33]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=
[34]

arXiv preprint arXiv:2309.10691 , year=

Mint: Evaluating llms in multi-turn interaction with tools and language feedback , author=. arXiv preprint arXiv:2309.10691 , year=

work page arXiv
[35]

Frontiers of Computer Science , volume=

Tool learning with large language models: A survey , author=. Frontiers of Computer Science , volume=. 2025 , publisher=

2025
[36]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Api-bank: A comprehensive benchmark for tool-augmented llms , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023
[38]

Advances in Neural Information Processing Systems , volume=

Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , volume=
[39]

MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024

Metatool benchmark for large language models: Deciding whether to use tools and which to use , author=. arXiv preprint arXiv:2310.03128 , year=

work page arXiv
[40]

Gummadi, and Muhammad Bilal Zafar

Characterizing Web Search in The Age of Generative AI , author=. arXiv preprint arXiv:2510.11560 , year=

work page arXiv
[41]

Xue et al

An illusion of progress? assessing the current state of web agents , author=. arXiv preprint arXiv:2504.01382 , year=

work page arXiv
[42]

34th USENIX Security Symposium (USENIX Security 25) , pages=

When \ LLMs \ go online: The emerging threat of \ Web-Enabled \ \ LLMs \ , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=
[43]

Eisenstein, R

Don't lie to your friends: Learning what you know from collaborative self-play , author=. arXiv preprint arXiv:2503.14481 , year=

work page arXiv
[44]

Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=

WebGLM: towards an efficient web-enhanced question answering system with human preferences , author=. Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=
[45]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =

Search-o1: Agentic Search-Enhanced Large Reasoning Models , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year =

2025
[46]

Akari Asai and Zeqiu Wu and Yizhong Wang and Avirup Sil and Hannaneh Hajishirzi , booktitle=. Self-. 2024 , url=

2024
[47]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Mallen, Alex and Asai, Akari and Zhong, Victor and Das, Rajarshi and Khashabi, Daniel and Hajishirzi, Hannaneh. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023...

work page doi:10.18653/v1/2023.acl-long.546 2023
[48]

ACM Transactions on Information Systems (TOIS) , volume=

Cumulated gain-based evaluation of IR techniques , author=. ACM Transactions on Information Systems (TOIS) , volume=. 2002 , publisher=

2002
[49]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
[50]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Active retrieval augmented generation , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023
[51]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Dragin: Dynamic retrieval augmented generation based on the real-time information needs of large language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[52]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=

work page internal anchor Pith review arXiv
[53]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

On early detection of hallucinations in factual question answering , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=
[54]

arXiv preprint arXiv:2407.17468 , year=

Wildhallucinations: Evaluating long-form factuality in llms with real-world entity queries , author=. arXiv preprint arXiv:2407.17468 , year=

work page arXiv
[55]

When2Call: When (not) to call tools , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[56]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=