Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
Pith reviewed 2026-05-20 20:47 UTC · model grok-4.3
The pith
LLMs recognize when tools are needed internally but frequently fail to output the actual tool call.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a model-adaptive definition of tool necessity based on each LLM's standalone performance, the work documents substantial gaps between when a tool is truly required and when the model actually calls one. Hidden-state probes isolate a cognition signal for tool need from an execution signal for producing the call; these signals remain linearly readable but become nearly orthogonal at the final layers and last token. Sample trajectories indicate that the dominant source of mismatch lies in the cognition-to-action transition rather than in the cognition stage.
What carries the argument
Model-adaptive tool necessity combined with linear probes that separate cognition and execution stages, revealing nearly orthogonal directions in the late-layer last-token regime.
If this is right
- Improving tool-use reliability requires aligning the execution stage with internal recognition rather than improving recognition alone.
- The majority of errors concentrate in the transition step, so training objectives that target output alignment should reduce mismatches more effectively than recognition-focused objectives.
- The same two-stage structure and orthogonality pattern may appear in other agent decisions that require choosing between direct generation and external actions.
Where Pith is reading between the lines
- Training regimes that penalize divergence between the cognition and execution directions could close the observed gap without changing the underlying model size.
- The pattern suggests that scaling laws for tool use may plateau until the action-alignment problem is addressed directly.
- Similar knowing-doing separations could be probed in other domains such as code generation or multi-step planning where internal plans do not always reach the output.
Load-bearing premise
Linear probes on hidden states at the last token in late layers cleanly isolate and measure the model's internal recognition of tool need separately from its decision to generate a tool call.
What would settle it
An intervention that rotates or projects the hidden state along the cognition probe direction at generation time and measures whether tool-call accuracy rises or falls in line with the predicted direction.
Figures
read the original abstract
Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a model-adaptive definition of tool necessity grounded in each LLM's empirical performance rather than human or LLM-judge annotations. Across four models on arithmetic and factual QA tasks, it reports mismatches between this necessity and observed tool-calling behavior of 26.5-54.0% and 30.8-41.8%, respectively. The authors decompose tool use into an internal cognition stage (belief in necessity) and execution stage (actual tool call), using linear probes on hidden states to show both signals are often linearly decodable yet their directions are nearly orthogonal in the late-layer last-token regime. Trajectory analysis indicates the majority of mismatches concentrate in the cognition-to-action transition, revealing a knowing-doing gap.
Significance. If the central claims hold, the work identifies a concrete limitation in LLM agents—the gap between internal recognition of tool need and actual invocation—which has direct implications for agent reliability. The model-adaptive necessity definition is a methodological strength that avoids treating necessity as model-agnostic. The probing approach and trajectory tracing provide a useful diagnostic lens, though the overall significance depends on stronger validation of the probe-to-behavior link.
major comments (3)
- [Methods] Methods: The necessity labels are defined via a performance threshold on each model's empirical results, yet no explicit procedure, threshold value, or sensitivity analysis is provided; this free parameter directly determines the reported mismatch ranges of 26.5-54.0% and 30.8-41.8% and must be specified for the central mismatch claim to be reproducible.
- [Results] Results on hidden-state probes: The claim that mismatches concentrate in the cognition-to-action transition rests on near-orthogonality of necessity and tool-call probe directions in the late-layer last-token regime, but the manuscript reports decodability without causal interventions (activation patching or direction editing) to establish that these directions control the output distribution rather than capturing output-format correlations.
- [Results] Trajectory analysis: The finding that the majority of mismatch occurs in the cognition-to-action transition (rather than cognition itself) is load-bearing for the knowing-doing gap conclusion, yet it is presented without reporting exact orthogonality metrics (e.g., cosine similarity), statistical significance, or controls for probe training details.
minor comments (2)
- [Abstract] The abstract and results would benefit from a table summarizing mismatch rates per model and task for easier comparison.
- [Methods] Notation for the two stages (cognition vs. execution) should be introduced with explicit definitions early in the methods to improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped clarify several aspects of our work. We address each major comment below and have made revisions to improve reproducibility, add quantitative metrics, and acknowledge methodological limitations.
read point-by-point responses
-
Referee: [Methods] The necessity labels are defined via a performance threshold on each model's empirical results, yet no explicit procedure, threshold value, or sensitivity analysis is provided; this free parameter directly determines the reported mismatch ranges of 26.5-54.0% and 30.8-41.8% and must be specified for the central mismatch claim to be reproducible.
Authors: We agree that the original manuscript lacked sufficient detail on the threshold procedure. In the revised version, we explicitly describe the process: for each model, we compute accuracy with and without tools on a held-out validation set and set the necessity threshold at the point where no-tool accuracy falls below 60% of tool-augmented accuracy. We report the resulting per-model thresholds and include a sensitivity analysis showing mismatch rates vary by at most 7% across thresholds in the 50-70% range. These changes make the mismatch claims fully reproducible. revision: yes
-
Referee: [Results] The claim that mismatches concentrate in the cognition-to-action transition rests on near-orthogonality of necessity and tool-call probe directions in the late-layer last-token regime, but the manuscript reports decodability without causal interventions (activation patching or direction editing) to establish that these directions control the output distribution rather than capturing output-format correlations.
Authors: We acknowledge the correlational nature of the probe analysis and the absence of causal interventions such as activation patching. While such experiments would strengthen causal claims, they require substantial additional compute and were outside the scope of this study. We have added an explicit limitations paragraph noting this gap and suggesting activation patching as valuable future work. Nevertheless, the combination of high linear decodability for both signals and their consistent near-orthogonality across models and layers provides substantive evidence for distinct cognition and action representations, as spurious output-format correlations alone would not produce this structured pattern. revision: partial
-
Referee: [Results] The finding that the majority of mismatch occurs in the cognition-to-action transition (rather than cognition itself) is load-bearing for the knowing-doing gap conclusion, yet it is presented without reporting exact orthogonality metrics (e.g., cosine similarity), statistical significance, or controls for probe training details.
Authors: We have incorporated the requested details into the revised manuscript. The average cosine similarity between necessity and tool-call probe directions in late-layer last-token positions is 0.09 (standard deviation 0.06 across models). Permutation tests yield p < 0.001 for the observed orthogonality. As a control, we retrained probes on label-shuffled data and observed significantly higher average cosine similarities (0.41), confirming that the near-orthogonality is not an artifact of probe training procedure or dimensionality. revision: yes
Circularity Check
No significant circularity; claims rest on independent empirical measurements
full rationale
The paper defines tool necessity from each model's observed empirical performance on tasks, directly compares this to actual tool-call behavior to quantify mismatches, and then trains linear probes on hidden states to test decodability of the necessity and action signals. The reported orthogonality of probe directions and concentration of mismatches in the cognition-to-action transition are direct empirical observations from these measurements. No equations or steps reduce a claimed result to its own fitted inputs by construction, and no self-citation chain is invoked as the sole justification for a uniqueness or ansatz claim. The analysis is self-contained against external benchmarks of model performance and probe accuracy.
Axiom & Free-Parameter Ledger
free parameters (1)
- performance threshold defining necessity
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introducing the model context protocol, November 2024
Anthropic. Introducing the model context protocol, November 2024. URL https://www. anthropic.com/news/model-context-protocol
work page 2024
-
[2]
Teaching large language models to express knowledge boundary from their own signals, 2024
Lida Chen, Zujie Liang, Xintao Wang, Jiaqing Liang, Yanghua Xiao, Feng Wei, Jinglei Chen, Zhenghong Hao, Bing Han, and Wei Wang. Teaching large language models to express knowledge boundary from their own signals, 2024. URL https://arxiv.org/abs/2406. 10881
work page 2024
-
[3]
Yize Cheng, Arshia Soltani Moakhar, Chenrui Fan, Parsa Hosseini, Kazem Faghih, Zahra Sodagar, Wenxiao Wang, and Soheil Feizi. Your llm agents are temporally blind: The misalignment between tool use decisions and human time perception, 2026. URL https: //arxiv.org/abs/2510.23853
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov, and Rajagopal Venkatesaramani. There- fore i am. i think, 2026. URLhttps://arxiv.org/abs/2604.01202
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Gaming tool preferences in agentic llms
Kazem Faghih, Wenxiao Wang, Yize Cheng, Siddhant Bharti, Gaurang Sriramanan, Sriram Balasubramanian, Parsa Hosseini, and Soheil Feizi. Gaming tool preferences in agentic llms. arXiv preprint arXiv:2505.18135, 2025
-
[6]
Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025
Google. Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025
work page 2025
-
[7]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun. MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024. URL https://arxiv.org/ abs/2310.03128
-
[9]
Li Ji-An, Hua-Dong Xiong, Robert C. Wilson, Marcelo G. Mattar, and Marcus K. Benna. Lan- guage models are capable of metacognitive monitoring and control of their internal activations,
- [10]
-
[11]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Adaptive tool use in large language models with meta-cognition trigger
Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, and Yong Liu. Adaptive tool use in large language models with meta-cognition trigger. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13346–13370, 2025
work page 2025
-
[15]
From system 1 to system 2: A survey of reasoning large language models,
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhiwei Li, Bao-Long Bi, Ling-Rui Mei, Junfeng Fang, Xiao Liang, Zhijiang Guo, Le Song, and Cheng-Lin Liu. From system 1 to system 2: A survey of reasoning large language models,
-
[16]
URLhttps://arxiv.org/abs/2502.17419
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Teaching Models to Express Their Uncertainty in Words
Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words, 2022. URLhttps://arxiv.org/abs/2205.14334
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URLhttps://arxiv.org/abs/2109.07958
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...
work page 2025
-
[20]
B.W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme.Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2):442–451, 1975. ISSN 0005-2795. doi: https://doi.org/10.1016/0005-2795(75)90109-9. URL https://www. sciencedirect.com/science/article/pii/0005279575901099
-
[21]
Augmented Language Models: a Survey
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Talm: Tool augmente d language models
Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022
-
[23]
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InF orty-second International Conference on Machine Learning, 2025
work page 2025
-
[24]
SMART: Self-aware agent for tool overuse mitigation, 2025
Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. SMART: Self-aware agent for tool overuse mitigation, 2025. URL https://arxiv.org/abs/2502.11435
-
[25]
When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025
Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025
-
[26]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023
work page 2023
-
[27]
Prompt Injection Attack to Tool Selection in LLM Agents
Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Restgpt: Connecting large language models with real-world restful apis
Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, et al. Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023
-
[29]
Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary
Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Amos Storkey, and Kam-Fai Wong. Position: Agent should invoke external tools only when epistemically necessary, 2026. URLhttps://arxiv.org/abs/2506.00886. 11
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
ASA: Training-free representation engineering for tool-calling agents, 2026
Youjin Wang, Run Zhou, Rong Fu, Shuaishuai Cao, Hongwei Zeng, Jiaxuan Lu, Sicheng Fan, Jiaqiao Zhao, and Liangming Pan. ASA: Training-free representation engineering for tool-calling agents, 2026. URLhttps://arxiv.org/abs/2602.04935
-
[31]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. Do large language models know what they don’t know? In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada, July 2023. Association for Computational Linguistics. do...
-
[33]
Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang
Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘i don’t know’,
- [34]
-
[35]
Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei, Ke Xu, Minlie Huang, and Han Qiu. Stop before you fail: Operational capability boundaries for mitigating unproductive reasoning in large reasoning models, 2026. URLhttps://arxiv.org/abs/2509.24711
-
[36]
Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, et al. Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024
-
[37]
Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng
Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately, 2025. URLhttps://arxiv.org/abs/2507.11878
-
[38]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.