{"paper":{"title":"Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use","license":"http://creativecommons.org/licenses/by/4.0/","headline":"LLMs recognize when tools are needed but frequently fail to call them, exposing a knowing-doing gap in their decision process.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Chenrui Fan, Keivan Rezaei, Mahdi JafariRaviz, Soheil Feiz, Yize Cheng","submitted_at":"2026-05-13T18:59:28Z","abstract_excerpt":"Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the model-adaptive definition of tool necessity (grounded in each model's empirical performance) correctly captures what 'should' trigger a tool call, and that linear probes on hidden states faithfully reflect the internal cognition stage without significant distortion from the probing method itself.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LLMs recognize when tools are needed but frequently fail to call them, exposing a knowing-doing gap in their decision process.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7389c4f7c42160833c39135d60b86377241d8777e7978b8f51c152691ae72ea5"},"source":{"id":"2605.14038","kind":"arxiv","version":1},"verdict":{"id":"fbda25f3-0837-420d-a649-515148eb7cce","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T05:27:37.428940Z","strongest_claim":"These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.","one_line_summary":"LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the model-adaptive definition of tool necessity (grounded in each model's empirical performance) correctly captures what 'should' trigger a tool call, and that linear probes on hidden states faithfully reflect the internal cognition stage without significant distortion from the probing method itself.","pith_extraction_headline":"LLMs recognize when tools are needed but frequently fail to call them, exposing a knowing-doing gap in their decision process."},"references":{"count":38,"sample":[{"doi":"","year":2024,"title":"Introducing the model context protocol, November 2024","work_id":"3484b828-86c4-4d45-8606-a07aafc042a2","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Teaching large language models to express knowledge boundary from their own signals, 2024","work_id":"84ae925c-0412-45b6-8448-5bb424a7c867","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception","work_id":"d297d2f5-6dc6-4a70-a2db-6c3a225dd5ad","ref_index":3,"cited_arxiv_id":"2510.23853","is_internal_anchor":true},{"doi":"","year":2026,"title":"Therefore I am. I Think","work_id":"439dc264-9c28-4052-a3b5-d4e344c667f2","ref_index":4,"cited_arxiv_id":"2604.01202","is_internal_anchor":true},{"doi":"","year":2025,"title":"Balasub- ramanian, Parsa Hosseini, and S","work_id":"01112f77-0708-480d-aeab-24828abbd8bb","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":38,"snapshot_sha256":"4ccc54235ef8e9ee9a72b71c0b1e3aaccb4d24c9fe204a75095dd48405144d8d","internal_anchors":14},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}