arxiv: 2605.14038 · v1 · submitted 2026-05-13 · 💻 cs.AI

Recognition: no theorem link

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

Yize Cheng , Chenrui Fan , Mahdi JafariRaviz , Keivan Rezaei , Soheil Feiz

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:27 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM tool usetool necessityknowing-doing gaplinear probinghidden statesagentic behaviorarithmetic QAfactual QA

0 comments

The pith

LLMs recognize when tools are needed but frequently fail to call them, exposing a knowing-doing gap in their decision process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines tool necessity adaptively for each model by measuring whether that specific model answers correctly without a tool versus with it. Comparing this label to actual behavior on arithmetic and factual questions reveals mismatches of 26 to 54 percent. The authors split tool use into an internal cognition stage and an execution stage, then probe hidden states to show both signals are linearly readable yet point in nearly orthogonal directions at the final token. Most errors concentrate in the transition from recognition to action rather than in forming the internal judgment itself.

Core claim

Tool necessity is model-dependent because capability boundaries differ across LLMs. When necessity is measured by each model's own accuracy with and without tools, observed tool calls diverge from necessity in 26.5-54.0 percent of arithmetic cases and 30.8-41.8 percent of factual cases. Linear probes recover both the necessity signal and the action signal from hidden states, but the probe directions become nearly orthogonal in late layers. Trajectory analysis shows the majority of mismatches arise during the cognition-to-action transition.

What carries the argument

The model-adaptive tool-necessity label, which marks a query as requiring a tool only when the model fails without it yet succeeds with it, together with linear probes that isolate the internal cognition representation from the execution decision.

If this is right

Reliable tool use requires separate training signals for the recognition step and the action step rather than a single end-to-end objective.
Late-layer representations must be regularized so that the direction encoding necessity aligns with the direction controlling next-token tool calls.
Tool policies and prompts should be tuned per model size and family because necessity boundaries shift with capability.
Diagnosis via probing can be turned into a monitoring signal to detect when a deployed agent is about to ignore its own internal assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap may persist or widen as models become more capable if the training objective continues to optimize only for final answer accuracy.
Similar internal-to-output misalignments could appear in other agentic skills such as multi-step planning or self-correction.
Adding an auxiliary loss that minimizes the angle between the necessity probe direction and the tool-call logit direction offers a direct way to close the observed gap.
The same two-stage probe method could be applied to open-weight models to test whether the orthogonality pattern is architecture-dependent.

Load-bearing premise

That empirical success without a tool versus with it correctly identifies what the model 'should' do, and that linear probes on hidden states capture the intended internal state without major distortion from the probing technique.

What would settle it

Train or prompt a model to emit a tool call precisely when the necessity probe fires positive, then measure whether its accuracy rises to match the level achieved by the adaptive definition.

Figures

Figures reproduced from arXiv: 2605.14038 by Chenrui Fan, Keivan Rezaei, Mahdi JafariRaviz, Soheil Feiz, Yize Cheng.

**Figure 1.** Figure 1: Overview of the two stage cognition-execution modeling of LLM tool-use. (Left) Necessity: We introduce a model-adaptive definition of tool necessity based on a model’s empirical ability to consistently answer a query correctly on its own, contrasting with prior model-agnostic approaches. (Middle) Cognition: By probing the model’s internal hidden states h, we identify a linear cognition direction wc that su… view at source ↗

**Figure 2.** Figure 2: Model-dependent tool-call necessity. Each vertical bar represents 0.5% of samples. Green indicates samples answered correctly in all N = 10 no-tool runs; red indicates at least one failure. Within each dataset, samples share the same order across rows, obtained by recursively sorting within each previous model’s correctness partition. · that expose these tools in the syntax expected by each model. We then … view at source ↗

**Figure 3.** Figure 3: Necessity probe performance across token-layer positions. Each cell reports the held-out MCC of a linear probe trained to predict the model-adaptive necessity from the hidden state at a given layer and token position. Darker blue indicates stronger linear separability. The linear separability is strongly task-dependent, and the heatmap structure appears similar within model families. Linear separability of… view at source ↗

**Figure 4.** Figure 4: The action Probe performance on different position in Matthews correlation coefficient. Each cell reports the held-out MCC of a linear probe trained to predict the tool-call action from the hidden state at a given layer and token position. Darker blue indicates stronger linear separability. The action signal appears highly linearly separable in the hidden states, particularly in near-end tokens and late la… view at source ↗

**Figure 5.** Figure 5: The cosine similarity score between wc and wa on different positions. The similarity scores between two probe direction are small across the majority of the area. Although for some models there are moderate similarity scores in late token and middle layer position, two directions fall back to near orthogonal relationship in the late layer of the last token (bottom right corner). and wa become close to orth… view at source ↗

**Figure 6.** Figure 6: Per-sample two-stage decomposition of tool-call behavior on Arithmetic and TruthfulQA, for Qwen3-8B (top) and Llama-3.1-8B-Instruct (bottom). Each flow tracks a sample through three nodes: ground-truth necessity (Factual), the model’s internal cognition of necessity (Cognition), and the executed action (Action). The end-to-end error is dominated by orange flow, where cognition is correct but action flips … view at source ↗

**Figure 7.** Figure 7: The confidence of cognition versus tool calling behavior. The x axis denotes the probe output after sigmoid function, and the y axis represents the tool call probability quantified by Equation 4. The mismatch can persist even when the internal representation strongly indicates that a sample is either tool-necessary or tool-unnecessary. The cognition–execution mismatch is not associated with cognition confi… view at source ↗

read the original abstract

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures a real mismatch between when models should call tools and when they do, using adaptive labels and probes, but the orthogonality claim does not yet prove a causal knowing-doing failure.

read the letter

The paper shows LLMs often know they need a tool based on their own results but still skip the call, and the authors locate the break in the hidden-state transition from recognition to action. They define necessity per model by checking if it solves the problem without help. On arithmetic and factual questions this produces 26-54% and 31-42% mismatches with actual calls across the four models tested. Linear probes recover both the necessity signal and the call signal, yet those directions sit nearly orthogonal in the late-layer last-token position that controls the output. Tracing the samples shows the errors cluster in the step between the two signals. The model-adaptive label and the staged probe analysis are the useful additions here. They give a practical way to measure the problem without relying on human or generic LLM judgments. The main weakness is that the evidence stays correlational. Orthogonality between probes does not demonstrate that the necessity information is available but ignored during the actual next-token computation. It could reflect probe choice, label noise from the performance-based definition, or the model handling the decision through earlier or non-linear routes the analysis does not capture. A causal test would strengthen the claim. This work is aimed at researchers building or debugging tool-using agents. Anyone measuring reliability in LLM decisions will find the mismatch numbers and the probe split worth looking at. The setup is clear enough that it should go to peer review; the central observation is reproducible and points to a fixable spot even if the exact mechanism needs more work.

Referee Report

3 major / 2 minor

Summary. The paper introduces a model-adaptive definition of tool necessity grounded in each LLM's empirical performance on arithmetic and factual QA tasks, rather than human or LLM-judge annotations. It reports mismatches between this necessity label and observed tool-calling behavior of 26.5-54.0% (arithmetic) and 30.8-41.8% (factual QA) across four models. Linear probes on hidden states show both necessity and tool-call signals are often linearly decodable, yet their directions are nearly orthogonal in the late-layer last-token regime; trajectory analysis attributes most mismatches to the cognition-to-action transition. The central claim is a 'knowing-doing gap' requiring better translation of recognition into action.

Significance. If the probing results and mismatch measurements hold under scrutiny, the work offers a useful diagnostic for LLM agent reliability by separating internal recognition from execution. The model-adaptive necessity definition avoids assuming uniform capabilities across models, which is a clear methodological advance over prior model-agnostic annotations. The linear-probe decomposition provides a concrete way to localize failures, though its implications for training interventions remain to be demonstrated.

major comments (3)

[§4.2] §4.2 (probe analysis): the claim that necessity and tool-call signals are 'nearly orthogonal' in late-layer last-token representations is load-bearing for the knowing-doing gap conclusion, yet the manuscript reports neither cosine similarities nor statistical tests against null distributions; without these, it is impossible to distinguish true dissociation from probe noise or label variance.
[§3] §3 (necessity labeling): the model-adaptive necessity label is derived from post-hoc empirical performance on the same tasks used for behavior measurement; while not circular by construction, the absence of held-out splits, cross-validation details, or sensitivity analysis to performance thresholds makes it unclear whether the reported 26.5-54.0% and 30.8-41.8% mismatch ranges are robust or partly artifacts of label construction.
[§5] §5 (trajectory analysis): the conclusion that the majority of mismatches occur in the cognition-to-action transition relies on linear probes faithfully reflecting internal stages; however, no causal interventions (e.g., activation patching along the necessity direction) are performed to test whether the probed signal actually participates in next-token computation for tool calls, leaving the gap interpretation correlational rather than mechanistic.

minor comments (2)

[Table 1] Table 1: mismatch percentages lack error bars or per-model breakdowns, which would help assess whether the wide ranges (26.5-54.0%) reflect genuine model differences or sampling variability.
[Figure 3] Figure 3: the hidden-state trajectory plots would benefit from explicit axis labels indicating the two probe directions and from reporting the number of samples per trajectory bin.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while indicating where revisions will be made to improve clarity and robustness.

read point-by-point responses

Referee: [§4.2] §4.2 (probe analysis): the claim that necessity and tool-call signals are 'nearly orthogonal' in late-layer last-token representations is load-bearing for the knowing-doing gap conclusion, yet the manuscript reports neither cosine similarities nor statistical tests against null distributions; without these, it is impossible to distinguish true dissociation from probe noise or label variance.

Authors: We agree that explicit quantitative support for the orthogonality claim would strengthen the presentation. In the revised manuscript we will add cosine similarity values between the necessity and tool-call probe directions (observed range 0.08–0.22 in late-layer last-token positions) together with permutation-test results against label-shuffled null distributions (p < 0.01). These statistics will be reported in §4.2 with an accompanying figure. revision: yes
Referee: [§3] §3 (necessity labeling): the model-adaptive necessity label is derived from post-hoc empirical performance on the same tasks used for behavior measurement; while not circular by construction, the absence of held-out splits, cross-validation details, or sensitivity analysis to performance thresholds makes it unclear whether the reported 26.5-54.0% and 30.8-41.8% mismatch ranges are robust or partly artifacts of label construction.

Authors: The necessity label is an empirical definition based on each model’s observed accuracy without tools; it is therefore not circular with the tool-calling decision. To demonstrate robustness we will include, in the revision, a sensitivity analysis that varies the accuracy threshold (70 %, 80 %, 90 %) and reports the resulting mismatch ranges. We will also clarify the threshold-selection procedure. Held-out splits are not applicable because the label is not produced by a learned classifier, but the added sensitivity results will address concerns about label stability. revision: yes
Referee: [§5] §5 (trajectory analysis): the conclusion that the majority of mismatches occur in the cognition-to-action transition relies on linear probes faithfully reflecting internal stages; however, no causal interventions (e.g., activation patching along the necessity direction) are performed to test whether the probed signal actually participates in next-token computation for tool calls, leaving the gap interpretation correlational rather than mechanistic.

Authors: We acknowledge that causal interventions such as activation patching would provide stronger mechanistic confirmation. Performing these experiments requires substantial additional compute and falls outside the scope of the present diagnostic study. The linear-probe and trajectory results remain correlational but offer interpretable evidence for the two-stage decomposition. In the revision we will explicitly state this limitation in §5 and the discussion, while preserving the current analysis as a useful diagnostic. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent empirical measurements

full rationale

The model-adaptive necessity label is constructed directly from each model's observed performance on the arithmetic and factual QA tasks, after which the paper measures the mismatch against actual tool-call behavior as a separate observation (26.5-54.0% and 30.8-41.8%). The two-stage decomposition and linear probing of hidden states constitute an additional diagnostic measurement of decodability and orthogonality; no equation or reduction equates the reported knowing-doing gap to a fitted parameter, self-citation chain, or renamed input by construction. The central claim therefore remains self-contained against the external performance benchmarks and does not collapse into its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that empirical performance on held-out examples defines ground-truth necessity and that linear probes on late-layer states capture the relevant internal decision process. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Tool necessity for a given model can be defined by whether that model solves the task correctly without tools on similar examples.
This is the grounding for the model-adaptive definition stated in the abstract.
domain assumption Linear probes on hidden states can recover the internal cognition signal for tool necessity.
Invoked when the authors state that both signals are often linearly decodable.

pith-pipeline@v0.9.0 · 5631 in / 1439 out tokens · 40496 ms · 2026-05-15T05:27:37.428940+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 10 internal anchors

[1]

Introducing the model context protocol, November 2024

Anthropic. Introducing the model context protocol, November 2024. URL https://www. anthropic.com/news/model-context-protocol

work page 2024
[2]

Teaching large language models to express knowledge boundary from their own signals, 2024

Lida Chen, Zujie Liang, Xintao Wang, Jiaqing Liang, Yanghua Xiao, Feng Wei, Jinglei Chen, Zhenghong Hao, Bing Han, and Wei Wang. Teaching large language models to express knowledge boundary from their own signals, 2024. URL https://arxiv.org/abs/2406. 10881

work page 2024
[3]

Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception

Yize Cheng, Arshia Soltani Moakhar, Chenrui Fan, Parsa Hosseini, Kazem Faghih, Zahra Sodagar, Wenxiao Wang, and Soheil Feizi. Your llm agents are temporally blind: The misalignment between tool use decisions and human time perception, 2026. URL https: //arxiv.org/abs/2510.23853

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Therefore I am. I Think

Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov, and Rajagopal Venkatesaramani. There- fore i am. i think, 2026. URLhttps://arxiv.org/abs/2604.01202

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Gaming tool preferences in agentic llms

Kazem Faghih, Wenxiao Wang, Yize Cheng, Siddhant Bharti, Gaurang Sriramanan, Sriram Balasubramanian, Parsa Hosseini, and Soheil Feizi. Gaming tool preferences in agentic llms. arXiv preprint arXiv:2505.18135, 2025

work page arXiv 2025
[6]

Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025

Google. Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025

work page 2025
[7]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024

Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun. MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024. URL https://arxiv.org/ abs/2310.03128

work page arXiv 2024
[9]

Wilson, Marcelo G

Li Ji-An, Hua-Dong Xiong, Robert C. Wilson, Marcelo G. Mattar, and Marcus K. Benna. Lan- guage models are capable of metacognitive monitoring and control of their internal activations,

work page
[10]

URLhttps://arxiv.org/abs/2505.13763

work page arXiv
[11]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Api-bank: A comprehensive benchmark for tool-augmented llms

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023

work page arXiv 2023
[14]

Adaptive tool use in large language models with meta-cognition trigger

Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, and Yong Liu. Adaptive tool use in large language models with meta-cognition trigger. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13346–13370, 2025

work page 2025
[15]

From system 1 to system 2: A survey of reasoning large language models,

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhiwei Li, Bao-Long Bi, Ling-Rui Mei, Junfeng Fang, Xiao Liang, Zhijiang Guo, Le Song, and Cheng-Lin Liu. From system 1 to system 2: A survey of reasoning large language models,

work page
[16]

URLhttps://arxiv.org/abs/2502.17419. 10

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Teaching models to express their uncertainty in words, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words, 2022. URLhttps://arxiv.org/abs/2205.14334

work page arXiv 2022
[18]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URLhttps://arxiv.org/abs/2109.07958

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

work page 2025
[20]

Matthews

B.W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme.Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2):442–451, 1975. ISSN 0005-2795. doi: https://doi.org/10.1016/0005-2795(75)90109-9. URL https://www. sciencedirect.com/science/article/pii/0005279575901099

work page doi:10.1016/0005-2795(75)90109-9 1975
[21]

Augmented language models: a survey, 2023

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

work page arXiv 2023
[22]

Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

work page arXiv 2022
[23]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InF orty-second International Conference on Machine Learning, 2025

work page 2025
[24]

SMART: Self-aware agent for tool overuse mitigation.arXiv preprint arXiv:2502.11435, 2025

Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. SMART: Self-aware agent for tool overuse mitigation, 2025. URL https://arxiv.org/abs/2502.11435

work page arXiv 2025
[25]

When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025

Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025

work page arXiv 2025
[26]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

work page 2023
[27]

Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

work page arXiv 2025
[28]

Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023

Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, et al. Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023

work page arXiv 2023
[29]

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Amos Storkey, and Kam-Fai Wong. Position: Agent should invoke external tools only when epistemically necessary, 2026. URLhttps://arxiv.org/abs/2506.00886

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

ASA: Training-free representation engineering for tool-calling agents, 2026

Youjin Wang, Run Zhou, Rong Fu, Shuaishuai Cao, Hongwei Zeng, Jiaxuan Lu, Sicheng Fan, Jiaqiao Zhao, and Liangming Pan. ASA: Training-free representation engineering for tool-calling agents, 2026. URLhttps://arxiv.org/abs/2602.04935

work page arXiv 2026
[31]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. Do large language models know what they don’t know? In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada, July 2023. Association for Computational Linguistics. do...

work page doi:10.18653/v1/2023.findings-acl.551 2023
[33]

Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang

Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘i don’t know’,

work page
[34]

URLhttps://arxiv.org/abs/2311.09677

work page arXiv
[35]

Stop before you fail: Operational capability boundaries for mitigating unproductive reasoning in large reasoning models, 2026

Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei, Ke Xu, Minlie Huang, and Han Qiu. Stop before you fail: Operational capability boundaries for mitigating unproductive reasoning in large reasoning models, 2026. URLhttps://arxiv.org/abs/2509.24711

work page arXiv 2026
[36]

Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024

Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, et al. Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024

work page arXiv 2024
[37]

LLMs encode harmfulness and refusal separately, 2025

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately, 2025. URLhttps://arxiv.org/abs/2507.11878

work page arXiv 2025
[38]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...

work page internal anchor Pith review Pith/arXiv arXiv 2025