Recognition: no theorem link
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
Pith reviewed 2026-05-15 05:27 UTC · model grok-4.3
The pith
LLMs recognize when tools are needed but frequently fail to call them, exposing a knowing-doing gap in their decision process.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tool necessity is model-dependent because capability boundaries differ across LLMs. When necessity is measured by each model's own accuracy with and without tools, observed tool calls diverge from necessity in 26.5-54.0 percent of arithmetic cases and 30.8-41.8 percent of factual cases. Linear probes recover both the necessity signal and the action signal from hidden states, but the probe directions become nearly orthogonal in late layers. Trajectory analysis shows the majority of mismatches arise during the cognition-to-action transition.
What carries the argument
The model-adaptive tool-necessity label, which marks a query as requiring a tool only when the model fails without it yet succeeds with it, together with linear probes that isolate the internal cognition representation from the execution decision.
If this is right
- Reliable tool use requires separate training signals for the recognition step and the action step rather than a single end-to-end objective.
- Late-layer representations must be regularized so that the direction encoding necessity aligns with the direction controlling next-token tool calls.
- Tool policies and prompts should be tuned per model size and family because necessity boundaries shift with capability.
- Diagnosis via probing can be turned into a monitoring signal to detect when a deployed agent is about to ignore its own internal assessment.
Where Pith is reading between the lines
- The gap may persist or widen as models become more capable if the training objective continues to optimize only for final answer accuracy.
- Similar internal-to-output misalignments could appear in other agentic skills such as multi-step planning or self-correction.
- Adding an auxiliary loss that minimizes the angle between the necessity probe direction and the tool-call logit direction offers a direct way to close the observed gap.
- The same two-stage probe method could be applied to open-weight models to test whether the orthogonality pattern is architecture-dependent.
Load-bearing premise
That empirical success without a tool versus with it correctly identifies what the model 'should' do, and that linear probes on hidden states capture the intended internal state without major distortion from the probing technique.
What would settle it
Train or prompt a model to emit a tool call precisely when the necessity probe fires positive, then measure whether its accuracy rises to match the level achieved by the adaptive definition.
Figures
read the original abstract
Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a model-adaptive definition of tool necessity grounded in each LLM's empirical performance on arithmetic and factual QA tasks, rather than human or LLM-judge annotations. It reports mismatches between this necessity label and observed tool-calling behavior of 26.5-54.0% (arithmetic) and 30.8-41.8% (factual QA) across four models. Linear probes on hidden states show both necessity and tool-call signals are often linearly decodable, yet their directions are nearly orthogonal in the late-layer last-token regime; trajectory analysis attributes most mismatches to the cognition-to-action transition. The central claim is a 'knowing-doing gap' requiring better translation of recognition into action.
Significance. If the probing results and mismatch measurements hold under scrutiny, the work offers a useful diagnostic for LLM agent reliability by separating internal recognition from execution. The model-adaptive necessity definition avoids assuming uniform capabilities across models, which is a clear methodological advance over prior model-agnostic annotations. The linear-probe decomposition provides a concrete way to localize failures, though its implications for training interventions remain to be demonstrated.
major comments (3)
- [§4.2] §4.2 (probe analysis): the claim that necessity and tool-call signals are 'nearly orthogonal' in late-layer last-token representations is load-bearing for the knowing-doing gap conclusion, yet the manuscript reports neither cosine similarities nor statistical tests against null distributions; without these, it is impossible to distinguish true dissociation from probe noise or label variance.
- [§3] §3 (necessity labeling): the model-adaptive necessity label is derived from post-hoc empirical performance on the same tasks used for behavior measurement; while not circular by construction, the absence of held-out splits, cross-validation details, or sensitivity analysis to performance thresholds makes it unclear whether the reported 26.5-54.0% and 30.8-41.8% mismatch ranges are robust or partly artifacts of label construction.
- [§5] §5 (trajectory analysis): the conclusion that the majority of mismatches occur in the cognition-to-action transition relies on linear probes faithfully reflecting internal stages; however, no causal interventions (e.g., activation patching along the necessity direction) are performed to test whether the probed signal actually participates in next-token computation for tool calls, leaving the gap interpretation correlational rather than mechanistic.
minor comments (2)
- [Table 1] Table 1: mismatch percentages lack error bars or per-model breakdowns, which would help assess whether the wide ranges (26.5-54.0%) reflect genuine model differences or sampling variability.
- [Figure 3] Figure 3: the hidden-state trajectory plots would benefit from explicit axis labels indicating the two probe directions and from reporting the number of samples per trajectory bin.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while indicating where revisions will be made to improve clarity and robustness.
read point-by-point responses
-
Referee: [§4.2] §4.2 (probe analysis): the claim that necessity and tool-call signals are 'nearly orthogonal' in late-layer last-token representations is load-bearing for the knowing-doing gap conclusion, yet the manuscript reports neither cosine similarities nor statistical tests against null distributions; without these, it is impossible to distinguish true dissociation from probe noise or label variance.
Authors: We agree that explicit quantitative support for the orthogonality claim would strengthen the presentation. In the revised manuscript we will add cosine similarity values between the necessity and tool-call probe directions (observed range 0.08–0.22 in late-layer last-token positions) together with permutation-test results against label-shuffled null distributions (p < 0.01). These statistics will be reported in §4.2 with an accompanying figure. revision: yes
-
Referee: [§3] §3 (necessity labeling): the model-adaptive necessity label is derived from post-hoc empirical performance on the same tasks used for behavior measurement; while not circular by construction, the absence of held-out splits, cross-validation details, or sensitivity analysis to performance thresholds makes it unclear whether the reported 26.5-54.0% and 30.8-41.8% mismatch ranges are robust or partly artifacts of label construction.
Authors: The necessity label is an empirical definition based on each model’s observed accuracy without tools; it is therefore not circular with the tool-calling decision. To demonstrate robustness we will include, in the revision, a sensitivity analysis that varies the accuracy threshold (70 %, 80 %, 90 %) and reports the resulting mismatch ranges. We will also clarify the threshold-selection procedure. Held-out splits are not applicable because the label is not produced by a learned classifier, but the added sensitivity results will address concerns about label stability. revision: yes
-
Referee: [§5] §5 (trajectory analysis): the conclusion that the majority of mismatches occur in the cognition-to-action transition relies on linear probes faithfully reflecting internal stages; however, no causal interventions (e.g., activation patching along the necessity direction) are performed to test whether the probed signal actually participates in next-token computation for tool calls, leaving the gap interpretation correlational rather than mechanistic.
Authors: We acknowledge that causal interventions such as activation patching would provide stronger mechanistic confirmation. Performing these experiments requires substantial additional compute and falls outside the scope of the present diagnostic study. The linear-probe and trajectory results remain correlational but offer interpretable evidence for the two-stage decomposition. In the revision we will explicitly state this limitation in §5 and the discussion, while preserving the current analysis as a useful diagnostic. revision: partial
Circularity Check
No significant circularity; derivation rests on independent empirical measurements
full rationale
The model-adaptive necessity label is constructed directly from each model's observed performance on the arithmetic and factual QA tasks, after which the paper measures the mismatch against actual tool-call behavior as a separate observation (26.5-54.0% and 30.8-41.8%). The two-stage decomposition and linear probing of hidden states constitute an additional diagnostic measurement of decodability and orthogonality; no equation or reduction equates the reported knowing-doing gap to a fitted parameter, self-citation chain, or renamed input by construction. The central claim therefore remains self-contained against the external performance benchmarks and does not collapse into its own definitions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Tool necessity for a given model can be defined by whether that model solves the task correctly without tools on similar examples.
- domain assumption Linear probes on hidden states can recover the internal cognition signal for tool necessity.
Reference graph
Works this paper leans on
-
[1]
Introducing the model context protocol, November 2024
Anthropic. Introducing the model context protocol, November 2024. URL https://www. anthropic.com/news/model-context-protocol
work page 2024
-
[2]
Teaching large language models to express knowledge boundary from their own signals, 2024
Lida Chen, Zujie Liang, Xintao Wang, Jiaqing Liang, Yanghua Xiao, Feng Wei, Jinglei Chen, Zhenghong Hao, Bing Han, and Wei Wang. Teaching large language models to express knowledge boundary from their own signals, 2024. URL https://arxiv.org/abs/2406. 10881
work page 2024
-
[3]
Yize Cheng, Arshia Soltani Moakhar, Chenrui Fan, Parsa Hosseini, Kazem Faghih, Zahra Sodagar, Wenxiao Wang, and Soheil Feizi. Your llm agents are temporally blind: The misalignment between tool use decisions and human time perception, 2026. URL https: //arxiv.org/abs/2510.23853
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov, and Rajagopal Venkatesaramani. There- fore i am. i think, 2026. URLhttps://arxiv.org/abs/2604.01202
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Gaming tool preferences in agentic llms
Kazem Faghih, Wenxiao Wang, Yize Cheng, Siddhant Bharti, Gaurang Sriramanan, Sriram Balasubramanian, Parsa Hosseini, and Soheil Feizi. Gaming tool preferences in agentic llms. arXiv preprint arXiv:2505.18135, 2025
-
[6]
Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025
Google. Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025
work page 2025
-
[7]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024
Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun. MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024. URL https://arxiv.org/ abs/2310.03128
-
[9]
Li Ji-An, Hua-Dong Xiong, Robert C. Wilson, Marcelo G. Mattar, and Marcus K. Benna. Lan- guage models are capable of metacognitive monitoring and control of their internal activations,
- [10]
-
[11]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Api-bank: A comprehensive benchmark for tool-augmented llms
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023
-
[14]
Adaptive tool use in large language models with meta-cognition trigger
Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, and Yong Liu. Adaptive tool use in large language models with meta-cognition trigger. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13346–13370, 2025
work page 2025
-
[15]
From system 1 to system 2: A survey of reasoning large language models,
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhiwei Li, Bao-Long Bi, Ling-Rui Mei, Junfeng Fang, Xiao Liang, Zhijiang Guo, Le Song, and Cheng-Lin Liu. From system 1 to system 2: A survey of reasoning large language models,
-
[16]
URLhttps://arxiv.org/abs/2502.17419. 10
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Teaching models to express their uncertainty in words, 2022
Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words, 2022. URLhttps://arxiv.org/abs/2205.14334
-
[18]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URLhttps://arxiv.org/abs/2109.07958
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...
work page 2025
-
[20]
B.W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme.Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2):442–451, 1975. ISSN 0005-2795. doi: https://doi.org/10.1016/0005-2795(75)90109-9. URL https://www. sciencedirect.com/science/article/pii/0005279575901099
-
[21]
Augmented language models: a survey, 2023
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023
-
[22]
Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022
Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022
-
[23]
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InF orty-second International Conference on Machine Learning, 2025
work page 2025
-
[24]
SMART: Self-aware agent for tool overuse mitigation.arXiv preprint arXiv:2502.11435, 2025
Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. SMART: Self-aware agent for tool overuse mitigation, 2025. URL https://arxiv.org/abs/2502.11435
-
[25]
When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025
Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025
-
[26]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023
work page 2023
-
[27]
Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025
Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025
-
[28]
Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, et al. Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023
-
[29]
Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary
Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Amos Storkey, and Kam-Fai Wong. Position: Agent should invoke external tools only when epistemically necessary, 2026. URLhttps://arxiv.org/abs/2506.00886
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
ASA: Training-free representation engineering for tool-calling agents, 2026
Youjin Wang, Run Zhou, Rong Fu, Shuaishuai Cao, Hongwei Zeng, Jiaxuan Lu, Sicheng Fan, Jiaqiao Zhao, and Liangming Pan. ASA: Training-free representation engineering for tool-calling agents, 2026. URLhttps://arxiv.org/abs/2602.04935
-
[31]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. Do large language models know what they don’t know? In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada, July 2023. Association for Computational Linguistics. do...
-
[33]
Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang
Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘i don’t know’,
- [34]
-
[35]
Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei, Ke Xu, Minlie Huang, and Han Qiu. Stop before you fail: Operational capability boundaries for mitigating unproductive reasoning in large reasoning models, 2026. URLhttps://arxiv.org/abs/2509.24711
-
[36]
Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, et al. Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024
-
[37]
LLMs encode harmfulness and refusal separately, 2025
Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately, 2025. URLhttps://arxiv.org/abs/2507.11878
-
[38]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.