pith. sign in

arxiv: 2605.09252 · v2 · pith:BQNNVBBJnew · submitted 2026-05-10 · 💻 cs.CL

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Pith reviewed 2026-05-22 10:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentstool callinghidden state probingWhen2Tool benchmarklinear decodabilityagent efficiencysteering methods
0
0 comments X

The pith

LLM agents already encode whether a tool is needed in their hidden states before generating any output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that models used as tool-augmented agents already contain reliable internal knowledge of tool necessity, detectable directly from pre-generation hidden states. A new benchmark called When2Tool creates controlled environments across computational scale, knowledge boundaries, and execution reliability to isolate when tools are actually required. Probing shows this internal signal achieves high accuracy at distinguishing necessary from unnecessary calls, outperforming the models' own prompted reasoning. A lightweight steering method built on the probe then reduces wasteful tool calls while preserving accuracy.

Core claim

Tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation.

What carries the argument

Linear probe on pre-generation hidden states that classifies tool necessity and is used to prefill a steering sentence.

If this is right

  • Probe&Prefill cuts tool calls by 48% with only 1.7% accuracy loss across tested models.
  • At matched accuracy the best prompt-only or reason-then-act baseline reduces tool calls by just 6%.
  • The internal signal works across multiple model families without any additional training.
  • Steering via hidden-state readout avoids the accuracy penalty that appears when forcing explicit reasoning on hard tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hidden-state probes could steer other binary or low-cardinality decisions agents make, such as which tool to pick or when to stop iterating.
  • The gap between internal knowledge and verbalized reasoning suggests many agent behaviors could be improved by reading representations rather than prompting for explanations.
  • Deployed systems could use such probes at inference time to trade off latency and cost on a per-query basis without retraining.

Load-bearing premise

The linear probe trained on hidden states from the benchmark environments will continue to provide a reliable steering signal on new tasks without introducing systematic response biases or accuracy drops beyond the reported 1.7%.

What would settle it

Run Probe&Prefill on a fresh collection of tasks drawn from outside the 18 When2Tool environments and check whether accuracy falls more than 1.7% or tool-call reduction falls substantially below 48%.

Figures

Figures reproduced from arXiv: 2605.09252 by Chung-En Sun, Ge Yan, Linbo Liu, Tsui-Wei Weng, Zimo Wang.

Figure 1
Figure 1. Figure 1: Overview. Part 1: We design WHEN2TOOL for studying whether LLM agents know when they need tools, spanning 15 single and 3 multi-hop environments across three categories, each with three difficulty levels. Part 2: PROBE&PREFILL reads the model’s hidden state via a linear probe and prefills a steering sentence to guide the tool-call decision, achieving better tradeoffs. best baselines either reduce tool call… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy vs. Avg tool calls per difficulty for Qwen3-1.7B. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy vs. total tool calls for Qwen models. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy vs. total tool calls for Llama models. Soft prefill (green) is partially ignored; hard [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: OOD generalization: in-distribution probe (green) vs. OOD probe trained on held-out [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Soft prefill (green) vs. hard prefill (purple). Hard prefill forces the output format, while soft [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: visualizes the tradeoff curves. At T=1.0, the probe is sharp: low thresholds already reduce many tool calls, providing a wider operating range. At T=3.0, the probe is diffuse, offering finer control in the middle range. T=2.0 provides a good balance across models. The choice of temperature does not qualitatively change the finding that PROBE&PREFILL outperforms prompt baselines. 2000 1000 0 0.4 0.6 0.8 1.0… view at source ↗
read the original abstract

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$\times$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the When2Tool benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning computational scale, knowledge boundaries, and execution reliability, with controlled difficulty levels to create clear tool-necessary vs. tool-unnecessary boundaries. It evaluates prompt-only and reason-then-act baselines, which either suppress necessary calls or incur high accuracy costs. Probing shows tool necessity is linearly decodable from pre-generation hidden states with AUROC 0.89-0.96 across six models, exceeding verbalized reasoning. It proposes Probe&Prefill, a linear probe plus steering sentence that reduces tool calls by 48% with 1.7% accuracy loss, outperforming baselines.

Significance. If the results hold, the work demonstrates that LLMs internally encode tool necessity in hidden states even without explicit reasoning, enabling a lightweight steering method for more efficient tool-augmented agents. Strengths include the new benchmark with difficulty controls, concrete empirical gains (48% reduction, 1.7% loss vs. baselines at 6% reduction or 5x accuracy cost), high AUROC across models, and public code for reproducibility. This could shift agent design toward extracting and acting on existing internal signals rather than relying on prompting.

major comments (2)
  1. [Probe training and evaluation sections] Probe training and evaluation sections: Both the linear probe training and the Probe&Prefill steering evaluation are conducted entirely within the fixed set of 18 When2Tool environments and difficulty controls. This leaves untested whether the decodable signal (AUROC 0.89-0.96) reflects a general model capability or exploits dataset-specific regularities such as lexical cues for scale or knowledge boundaries. The 48% tool-call reduction with 1.7% accuracy loss may therefore not generalize, which is load-bearing for the central claim that models 'already know' when to call tools.
  2. [Results on baseline comparisons] Results on baseline comparisons: The claim that Probe&Prefill outperforms the best baseline at comparable accuracy (6% reduction) or similar reduction with 5x higher accuracy loss needs explicit per-environment and per-difficulty breakdowns to confirm robustness, as aggregate numbers could mask variance on hard multi-hop tasks.
minor comments (2)
  1. [Abstract and Section 3] Abstract and Section 3: A short concrete example for each of the three tool-necessity categories would help readers quickly grasp the decision boundaries created by the difficulty controls.
  2. [Figure legends and table captions] Figure legends and table captions: Ensure all axes, metrics (e.g., exact definition of accuracy), and baseline variants are fully labeled without requiring cross-reference to the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment point by point below, with planned revisions noted where we agree changes are warranted.

read point-by-point responses
  1. Referee: [Probe training and evaluation sections] Probe training and evaluation sections: Both the linear probe training and the Probe&Prefill steering evaluation are conducted entirely within the fixed set of 18 When2Tool environments and difficulty controls. This leaves untested whether the decodable signal (AUROC 0.89-0.96) reflects a general model capability or exploits dataset-specific regularities such as lexical cues for scale or knowledge boundaries. The 48% tool-call reduction with 1.7% accuracy loss may therefore not generalize, which is load-bearing for the central claim that models 'already know' when to call tools.

    Authors: We appreciate the referee raising this point about scope and potential dataset-specific effects. The When2Tool benchmark was explicitly constructed to isolate tool necessity across three distinct categories (computational scale, knowledge boundaries, execution reliability), with 15 single-hop and 3 multi-hop environments and controlled difficulty levels that create unambiguous decision boundaries. The high AUROC values are consistent across six models with different architectures and training regimes, which provides some evidence against purely lexical or environment-specific artifacts. Nevertheless, we agree that the current results are benchmark-internal and that stronger claims of generality would benefit from out-of-distribution testing. In the revised manuscript we will add an explicit limitations paragraph in the discussion section acknowledging this scope, clarifying that the central claim is demonstrated under the controlled conditions of When2Tool, and outlining future work on broader generalization. This revision will be partial, as we cannot introduce new experiments at this stage but can improve transparency. revision: partial

  2. Referee: [Results on baseline comparisons] Results on baseline comparisons: The claim that Probe&Prefill outperforms the best baseline at comparable accuracy (6% reduction) or similar reduction with 5x higher accuracy loss needs explicit per-environment and per-difficulty breakdowns to confirm robustness, as aggregate numbers could mask variance on hard multi-hop tasks.

    Authors: We agree that aggregate metrics alone leave open the possibility of uneven performance across environments. While the reported numbers already separate single-hop from multi-hop settings in the main results, we recognize that finer-grained per-environment and per-difficulty breakdowns would allow readers to verify consistency, especially on the harder multi-hop tasks. In the revised version we will add supplementary tables (or an expanded results section) providing these breakdowns for both the baseline methods and Probe&Prefill, including accuracy and tool-call reduction stratified by environment and difficulty level. This will directly address the concern about potential masking of variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the decodability or steering claims

full rationale

The paper's core result—that tool necessity is linearly decodable from pre-generation hidden states with AUROC 0.89–0.96—is obtained by training a linear probe on labeled hidden-state vectors from the When2Tool benchmark and directly reporting its classification performance on held-out examples within the same controlled environments. This constitutes an independent empirical measurement of linear separability rather than any quantity that is fitted and then re-labeled as a prediction or derived by self-definition. The subsequent Probe&Prefill steering method applies the trained probe at inference time but does not alter or presuppose the reported AUROC; the accuracy and tool-call reduction figures are measured outcomes on the benchmark, not tautological restatements of the probe's training objective. No self-citation chain, uniqueness theorem, or ansatz imported from prior author work is invoked to justify the central claim, and the derivation remains self-contained against standard linear-probing benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central method trains a linear probe on hidden-state activations; this introduces a small number of fitted weights whose values are determined from the benchmark data. No new physical entities or unstated mathematical axioms are introduced.

free parameters (1)
  • probe decision threshold
    The cutoff used to convert the probe output into a steering sentence is chosen to achieve the reported accuracy-tool-call trade-off.

pith-pipeline@v0.9.0 · 5871 in / 1237 out tokens · 32755 ms · 2026-05-22T10:52:01.078261+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 12 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

  2. [2]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

  3. [3]

    Advances in Neural Information Processing Systems , volume=

    Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , volume=

  4. [4]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Alignment for efficient tool calling of large language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  5. [5]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    A joint optimization framework for enhancing efficiency of tool utilization in llm agents , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  6. [6]

    arXiv preprint arXiv:2601.14192 , year=

    Toward Efficient Agents: Memory, Tool learning, and Planning , author=. arXiv preprint arXiv:2601.14192 , year=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Toolqa: A dataset for llm question answering with external tools , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

    Api-bank: A comprehensive benchmark for tool-augmented llms , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

  9. [9]

    A structural probe for finding syntax in word representations , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

  10. [10]

    Discovering Latent Knowledge in Language Models Without Supervision

    Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    Language Models (Mostly) Know What They Know

    Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

  13. [13]

    arXiv preprint arXiv:2412.07992 , year=

    Concept bottleneck large language models , author=. arXiv preprint arXiv:2412.07992 , year=

  14. [14]

    Activation addition: Steering language models without optimization , author=

  15. [15]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

  16. [16]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Thinkedit: Interpretable weight editing to mitigate overly short thinking in reasoning models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  17. [17]

    Effective skill unlearning through intervention and abstention , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  18. [18]

    arXiv preprint arXiv:2512.13979 , year=

    ReflCtrl: Controlling LLM Reflection via Representation Engineering , author=. arXiv preprint arXiv:2512.13979 , year=

  19. [19]

    arXiv preprint arXiv:2602.09870 , year=

    Steer2Edit: From Activation Steering to Component-Level Editing , author=. arXiv preprint arXiv:2602.09870 , year=

  20. [20]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Search-o1: Agentic search-enhanced large reasoning models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  21. [21]

    The eleventh international conference on learning representations , year=

    React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

  22. [22]

    Advances in neural information processing systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

  23. [23]

    Forty-second International Conference on Machine Learning , year=

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=

  24. [24]

    DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

    Dr tulu: Reinforcement learning with evolving rubrics for deep research , author=. arXiv preprint arXiv:2511.19399 , year=

  25. [25]

    2025 , url =

    Deep research System Card , author =. 2025 , url =

  26. [26]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

  27. [27]

    arXiv preprint arXiv:2505.09569 , year=

    MigrationBench: Repository-Level Code Migration Benchmark from Java 8 , author=. arXiv preprint arXiv:2505.09569 , year=

  28. [28]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? , author=. arXiv preprint arXiv:2509.16941 , year=

  29. [29]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

  30. [30]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

  31. [31]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    ^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=

  32. [32]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces , author=. arXiv preprint arXiv:2601.11868 , year=