LLM Agents Already Know When to Call Tools -- Even Without Reasoning
Pith reviewed 2026-05-22 10:52 UTC · model grok-4.3
The pith
LLM agents already encode whether a tool is needed in their hidden states before generating any output.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation.
What carries the argument
Linear probe on pre-generation hidden states that classifies tool necessity and is used to prefill a steering sentence.
If this is right
- Probe&Prefill cuts tool calls by 48% with only 1.7% accuracy loss across tested models.
- At matched accuracy the best prompt-only or reason-then-act baseline reduces tool calls by just 6%.
- The internal signal works across multiple model families without any additional training.
- Steering via hidden-state readout avoids the accuracy penalty that appears when forcing explicit reasoning on hard tasks.
Where Pith is reading between the lines
- Similar hidden-state probes could steer other binary or low-cardinality decisions agents make, such as which tool to pick or when to stop iterating.
- The gap between internal knowledge and verbalized reasoning suggests many agent behaviors could be improved by reading representations rather than prompting for explanations.
- Deployed systems could use such probes at inference time to trade off latency and cost on a per-query basis without retraining.
Load-bearing premise
The linear probe trained on hidden states from the benchmark environments will continue to provide a reliable steering signal on new tasks without introducing systematic response biases or accuracy drops beyond the reported 1.7%.
What would settle it
Run Probe&Prefill on a fresh collection of tasks drawn from outside the 18 When2Tool environments and check whether accuracy falls more than 1.7% or tool-call reduction falls substantially below 48%.
Figures
read the original abstract
Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$\times$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the When2Tool benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning computational scale, knowledge boundaries, and execution reliability, with controlled difficulty levels to create clear tool-necessary vs. tool-unnecessary boundaries. It evaluates prompt-only and reason-then-act baselines, which either suppress necessary calls or incur high accuracy costs. Probing shows tool necessity is linearly decodable from pre-generation hidden states with AUROC 0.89-0.96 across six models, exceeding verbalized reasoning. It proposes Probe&Prefill, a linear probe plus steering sentence that reduces tool calls by 48% with 1.7% accuracy loss, outperforming baselines.
Significance. If the results hold, the work demonstrates that LLMs internally encode tool necessity in hidden states even without explicit reasoning, enabling a lightweight steering method for more efficient tool-augmented agents. Strengths include the new benchmark with difficulty controls, concrete empirical gains (48% reduction, 1.7% loss vs. baselines at 6% reduction or 5x accuracy cost), high AUROC across models, and public code for reproducibility. This could shift agent design toward extracting and acting on existing internal signals rather than relying on prompting.
major comments (2)
- [Probe training and evaluation sections] Probe training and evaluation sections: Both the linear probe training and the Probe&Prefill steering evaluation are conducted entirely within the fixed set of 18 When2Tool environments and difficulty controls. This leaves untested whether the decodable signal (AUROC 0.89-0.96) reflects a general model capability or exploits dataset-specific regularities such as lexical cues for scale or knowledge boundaries. The 48% tool-call reduction with 1.7% accuracy loss may therefore not generalize, which is load-bearing for the central claim that models 'already know' when to call tools.
- [Results on baseline comparisons] Results on baseline comparisons: The claim that Probe&Prefill outperforms the best baseline at comparable accuracy (6% reduction) or similar reduction with 5x higher accuracy loss needs explicit per-environment and per-difficulty breakdowns to confirm robustness, as aggregate numbers could mask variance on hard multi-hop tasks.
minor comments (2)
- [Abstract and Section 3] Abstract and Section 3: A short concrete example for each of the three tool-necessity categories would help readers quickly grasp the decision boundaries created by the difficulty controls.
- [Figure legends and table captions] Figure legends and table captions: Ensure all axes, metrics (e.g., exact definition of accuracy), and baseline variants are fully labeled without requiring cross-reference to the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment point by point below, with planned revisions noted where we agree changes are warranted.
read point-by-point responses
-
Referee: [Probe training and evaluation sections] Probe training and evaluation sections: Both the linear probe training and the Probe&Prefill steering evaluation are conducted entirely within the fixed set of 18 When2Tool environments and difficulty controls. This leaves untested whether the decodable signal (AUROC 0.89-0.96) reflects a general model capability or exploits dataset-specific regularities such as lexical cues for scale or knowledge boundaries. The 48% tool-call reduction with 1.7% accuracy loss may therefore not generalize, which is load-bearing for the central claim that models 'already know' when to call tools.
Authors: We appreciate the referee raising this point about scope and potential dataset-specific effects. The When2Tool benchmark was explicitly constructed to isolate tool necessity across three distinct categories (computational scale, knowledge boundaries, execution reliability), with 15 single-hop and 3 multi-hop environments and controlled difficulty levels that create unambiguous decision boundaries. The high AUROC values are consistent across six models with different architectures and training regimes, which provides some evidence against purely lexical or environment-specific artifacts. Nevertheless, we agree that the current results are benchmark-internal and that stronger claims of generality would benefit from out-of-distribution testing. In the revised manuscript we will add an explicit limitations paragraph in the discussion section acknowledging this scope, clarifying that the central claim is demonstrated under the controlled conditions of When2Tool, and outlining future work on broader generalization. This revision will be partial, as we cannot introduce new experiments at this stage but can improve transparency. revision: partial
-
Referee: [Results on baseline comparisons] Results on baseline comparisons: The claim that Probe&Prefill outperforms the best baseline at comparable accuracy (6% reduction) or similar reduction with 5x higher accuracy loss needs explicit per-environment and per-difficulty breakdowns to confirm robustness, as aggregate numbers could mask variance on hard multi-hop tasks.
Authors: We agree that aggregate metrics alone leave open the possibility of uneven performance across environments. While the reported numbers already separate single-hop from multi-hop settings in the main results, we recognize that finer-grained per-environment and per-difficulty breakdowns would allow readers to verify consistency, especially on the harder multi-hop tasks. In the revised version we will add supplementary tables (or an expanded results section) providing these breakdowns for both the baseline methods and Probe&Prefill, including accuracy and tool-call reduction stratified by environment and difficulty level. This will directly address the concern about potential masking of variance. revision: yes
Circularity Check
No significant circularity in the decodability or steering claims
full rationale
The paper's core result—that tool necessity is linearly decodable from pre-generation hidden states with AUROC 0.89–0.96—is obtained by training a linear probe on labeled hidden-state vectors from the When2Tool benchmark and directly reporting its classification performance on held-out examples within the same controlled environments. This constitutes an independent empirical measurement of linear separability rather than any quantity that is fitted and then re-labeled as a prediction or derived by self-definition. The subsequent Probe&Prefill steering method applies the trained probe at inference time but does not alter or presuppose the reported AUROC; the accuracy and tool-call reduction figures are measured outcomes on the benchmark, not tautological restatements of the probe's training objective. No self-citation chain, uniqueness theorem, or ansatz imported from prior author work is invoked to justify the central claim, and the derivation remains self-contained against standard linear-probing benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- probe decision threshold
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems , volume=
Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=
-
[2]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Advances in Neural Information Processing Systems , volume=
Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , volume=
-
[4]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Alignment for efficient tool calling of large language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[5]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
A joint optimization framework for enhancing efficiency of tool utilization in llm agents , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
work page 2025
-
[6]
arXiv preprint arXiv:2601.14192 , year=
Toward Efficient Agents: Memory, Tool learning, and Planning , author=. arXiv preprint arXiv:2601.14192 , year=
-
[7]
Advances in Neural Information Processing Systems , volume=
Toolqa: A dataset for llm question answering with external tools , author=. Advances in Neural Information Processing Systems , volume=
-
[8]
Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
Api-bank: A comprehensive benchmark for tool-augmented llms , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=
work page 2023
-
[9]
A structural probe for finding syntax in word representations , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=
work page 2019
-
[10]
Discovering Latent Knowledge in Language Models Without Supervision
Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Advances in Neural Information Processing Systems , volume=
Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
Language Models (Mostly) Know What They Know
Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
arXiv preprint arXiv:2412.07992 , year=
Concept bottleneck large language models , author=. arXiv preprint arXiv:2412.07992 , year=
-
[14]
Activation addition: Steering language models without optimization , author=
-
[15]
Representation Engineering: A Top-Down Approach to AI Transparency
Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Thinkedit: Interpretable weight editing to mitigate overly short thinking in reasoning models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[17]
Effective skill unlearning through intervention and abstention , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
work page 2025
-
[18]
arXiv preprint arXiv:2512.13979 , year=
ReflCtrl: Controlling LLM Reflection via Representation Engineering , author=. arXiv preprint arXiv:2512.13979 , year=
-
[19]
arXiv preprint arXiv:2602.09870 , year=
Steer2Edit: From Activation Steering to Component-Level Editing , author=. arXiv preprint arXiv:2602.09870 , year=
-
[20]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Search-o1: Agentic search-enhanced large reasoning models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[21]
The eleventh international conference on learning representations , year=
React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
-
[22]
Advances in neural information processing systems , volume=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[23]
Forty-second International Conference on Machine Learning , year=
The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=
-
[24]
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Dr tulu: Reinforcement learning with evolving rubrics for deep research , author=. arXiv preprint arXiv:2511.19399 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [25]
-
[26]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
arXiv preprint arXiv:2505.09569 , year=
MigrationBench: Repository-Level Code Migration Benchmark from Java 8 , author=. arXiv preprint arXiv:2505.09569 , year=
work page internal anchor Pith review arXiv
-
[28]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? , author=. arXiv preprint arXiv:2509.16941 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces , author=. arXiv preprint arXiv:2601.11868 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.