pith. machine review for the scientific record. sign in

arxiv: 2605.14038 · v1 · submitted 2026-05-13 · 💻 cs.AI

Recognition: no theorem link

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:27 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM tool usetool necessityknowing-doing gaplinear probinghidden statesagentic behaviorarithmetic QAfactual QA
0
0 comments X

The pith

LLMs recognize when tools are needed but frequently fail to call them, exposing a knowing-doing gap in their decision process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines tool necessity adaptively for each model by measuring whether that specific model answers correctly without a tool versus with it. Comparing this label to actual behavior on arithmetic and factual questions reveals mismatches of 26 to 54 percent. The authors split tool use into an internal cognition stage and an execution stage, then probe hidden states to show both signals are linearly readable yet point in nearly orthogonal directions at the final token. Most errors concentrate in the transition from recognition to action rather than in forming the internal judgment itself.

Core claim

Tool necessity is model-dependent because capability boundaries differ across LLMs. When necessity is measured by each model's own accuracy with and without tools, observed tool calls diverge from necessity in 26.5-54.0 percent of arithmetic cases and 30.8-41.8 percent of factual cases. Linear probes recover both the necessity signal and the action signal from hidden states, but the probe directions become nearly orthogonal in late layers. Trajectory analysis shows the majority of mismatches arise during the cognition-to-action transition.

What carries the argument

The model-adaptive tool-necessity label, which marks a query as requiring a tool only when the model fails without it yet succeeds with it, together with linear probes that isolate the internal cognition representation from the execution decision.

If this is right

  • Reliable tool use requires separate training signals for the recognition step and the action step rather than a single end-to-end objective.
  • Late-layer representations must be regularized so that the direction encoding necessity aligns with the direction controlling next-token tool calls.
  • Tool policies and prompts should be tuned per model size and family because necessity boundaries shift with capability.
  • Diagnosis via probing can be turned into a monitoring signal to detect when a deployed agent is about to ignore its own internal assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap may persist or widen as models become more capable if the training objective continues to optimize only for final answer accuracy.
  • Similar internal-to-output misalignments could appear in other agentic skills such as multi-step planning or self-correction.
  • Adding an auxiliary loss that minimizes the angle between the necessity probe direction and the tool-call logit direction offers a direct way to close the observed gap.
  • The same two-stage probe method could be applied to open-weight models to test whether the orthogonality pattern is architecture-dependent.

Load-bearing premise

That empirical success without a tool versus with it correctly identifies what the model 'should' do, and that linear probes on hidden states capture the intended internal state without major distortion from the probing technique.

What would settle it

Train or prompt a model to emit a tool call precisely when the necessity probe fires positive, then measure whether its accuracy rises to match the level achieved by the adaptive definition.

Figures

Figures reproduced from arXiv: 2605.14038 by Chenrui Fan, Keivan Rezaei, Mahdi JafariRaviz, Soheil Feiz, Yize Cheng.

Figure 1
Figure 1. Figure 1: Overview of the two stage cognition-execution modeling of LLM tool-use. (Left) Necessity: We introduce a model-adaptive definition of tool necessity based on a model’s empirical ability to consistently answer a query correctly on its own, contrasting with prior model-agnostic approaches. (Middle) Cognition: By probing the model’s internal hidden states h, we identify a linear cognition direction wc that su… view at source ↗
Figure 2
Figure 2. Figure 2: Model-dependent tool-call necessity. Each vertical bar represents 0.5% of samples. Green indicates samples answered correctly in all N = 10 no-tool runs; red indicates at least one failure. Within each dataset, samples share the same order across rows, obtained by recursively sorting within each previous model’s correctness partition. · that expose these tools in the syntax expected by each model. We then … view at source ↗
Figure 3
Figure 3. Figure 3: Necessity probe performance across token-layer positions. Each cell reports the held-out MCC of a linear probe trained to predict the model-adaptive necessity from the hidden state at a given layer and token position. Darker blue indicates stronger linear separability. The linear separability is strongly task-dependent, and the heatmap structure appears similar within model families. Linear separability of… view at source ↗
Figure 4
Figure 4. Figure 4: The action Probe performance on different position in Matthews correlation coefficient. Each cell reports the held-out MCC of a linear probe trained to predict the tool-call action from the hidden state at a given layer and token position. Darker blue indicates stronger linear separability. The action signal appears highly linearly separable in the hidden states, particularly in near-end tokens and late la… view at source ↗
Figure 5
Figure 5. Figure 5: The cosine similarity score between wc and wa on different positions. The similarity scores between two probe direction are small across the majority of the area. Although for some models there are moderate similarity scores in late token and middle layer position, two directions fall back to near orthogonal relationship in the late layer of the last token (bottom right corner). and wa become close to orth… view at source ↗
Figure 6
Figure 6. Figure 6: Per-sample two-stage decomposition of tool-call behavior on Arithmetic and Truth￾fulQA, for Qwen3-8B (top) and Llama-3.1-8B-Instruct (bottom). Each flow tracks a sample through three nodes: ground-truth necessity (Factual), the model’s internal cognition of necessity (Cognition), and the executed action (Action). The end-to-end error is dominated by orange flow, where cognition is correct but action flips … view at source ↗
Figure 7
Figure 7. Figure 7: The confidence of cognition versus tool calling behavior. The x axis denotes the probe output after sigmoid function, and the y axis represents the tool call probability quantified by Equation 4. The mismatch can persist even when the internal representation strongly indicates that a sample is either tool-necessary or tool-unnecessary. The cognition–execution mismatch is not associated with cognition confi… view at source ↗
read the original abstract

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a model-adaptive definition of tool necessity grounded in each LLM's empirical performance on arithmetic and factual QA tasks, rather than human or LLM-judge annotations. It reports mismatches between this necessity label and observed tool-calling behavior of 26.5-54.0% (arithmetic) and 30.8-41.8% (factual QA) across four models. Linear probes on hidden states show both necessity and tool-call signals are often linearly decodable, yet their directions are nearly orthogonal in the late-layer last-token regime; trajectory analysis attributes most mismatches to the cognition-to-action transition. The central claim is a 'knowing-doing gap' requiring better translation of recognition into action.

Significance. If the probing results and mismatch measurements hold under scrutiny, the work offers a useful diagnostic for LLM agent reliability by separating internal recognition from execution. The model-adaptive necessity definition avoids assuming uniform capabilities across models, which is a clear methodological advance over prior model-agnostic annotations. The linear-probe decomposition provides a concrete way to localize failures, though its implications for training interventions remain to be demonstrated.

major comments (3)
  1. [§4.2] §4.2 (probe analysis): the claim that necessity and tool-call signals are 'nearly orthogonal' in late-layer last-token representations is load-bearing for the knowing-doing gap conclusion, yet the manuscript reports neither cosine similarities nor statistical tests against null distributions; without these, it is impossible to distinguish true dissociation from probe noise or label variance.
  2. [§3] §3 (necessity labeling): the model-adaptive necessity label is derived from post-hoc empirical performance on the same tasks used for behavior measurement; while not circular by construction, the absence of held-out splits, cross-validation details, or sensitivity analysis to performance thresholds makes it unclear whether the reported 26.5-54.0% and 30.8-41.8% mismatch ranges are robust or partly artifacts of label construction.
  3. [§5] §5 (trajectory analysis): the conclusion that the majority of mismatches occur in the cognition-to-action transition relies on linear probes faithfully reflecting internal stages; however, no causal interventions (e.g., activation patching along the necessity direction) are performed to test whether the probed signal actually participates in next-token computation for tool calls, leaving the gap interpretation correlational rather than mechanistic.
minor comments (2)
  1. [Table 1] Table 1: mismatch percentages lack error bars or per-model breakdowns, which would help assess whether the wide ranges (26.5-54.0%) reflect genuine model differences or sampling variability.
  2. [Figure 3] Figure 3: the hidden-state trajectory plots would benefit from explicit axis labels indicating the two probe directions and from reporting the number of samples per trajectory bin.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while indicating where revisions will be made to improve clarity and robustness.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (probe analysis): the claim that necessity and tool-call signals are 'nearly orthogonal' in late-layer last-token representations is load-bearing for the knowing-doing gap conclusion, yet the manuscript reports neither cosine similarities nor statistical tests against null distributions; without these, it is impossible to distinguish true dissociation from probe noise or label variance.

    Authors: We agree that explicit quantitative support for the orthogonality claim would strengthen the presentation. In the revised manuscript we will add cosine similarity values between the necessity and tool-call probe directions (observed range 0.08–0.22 in late-layer last-token positions) together with permutation-test results against label-shuffled null distributions (p < 0.01). These statistics will be reported in §4.2 with an accompanying figure. revision: yes

  2. Referee: [§3] §3 (necessity labeling): the model-adaptive necessity label is derived from post-hoc empirical performance on the same tasks used for behavior measurement; while not circular by construction, the absence of held-out splits, cross-validation details, or sensitivity analysis to performance thresholds makes it unclear whether the reported 26.5-54.0% and 30.8-41.8% mismatch ranges are robust or partly artifacts of label construction.

    Authors: The necessity label is an empirical definition based on each model’s observed accuracy without tools; it is therefore not circular with the tool-calling decision. To demonstrate robustness we will include, in the revision, a sensitivity analysis that varies the accuracy threshold (70 %, 80 %, 90 %) and reports the resulting mismatch ranges. We will also clarify the threshold-selection procedure. Held-out splits are not applicable because the label is not produced by a learned classifier, but the added sensitivity results will address concerns about label stability. revision: yes

  3. Referee: [§5] §5 (trajectory analysis): the conclusion that the majority of mismatches occur in the cognition-to-action transition relies on linear probes faithfully reflecting internal stages; however, no causal interventions (e.g., activation patching along the necessity direction) are performed to test whether the probed signal actually participates in next-token computation for tool calls, leaving the gap interpretation correlational rather than mechanistic.

    Authors: We acknowledge that causal interventions such as activation patching would provide stronger mechanistic confirmation. Performing these experiments requires substantial additional compute and falls outside the scope of the present diagnostic study. The linear-probe and trajectory results remain correlational but offer interpretable evidence for the two-stage decomposition. In the revision we will explicitly state this limitation in §5 and the discussion, while preserving the current analysis as a useful diagnostic. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent empirical measurements

full rationale

The model-adaptive necessity label is constructed directly from each model's observed performance on the arithmetic and factual QA tasks, after which the paper measures the mismatch against actual tool-call behavior as a separate observation (26.5-54.0% and 30.8-41.8%). The two-stage decomposition and linear probing of hidden states constitute an additional diagnostic measurement of decodability and orthogonality; no equation or reduction equates the reported knowing-doing gap to a fitted parameter, self-citation chain, or renamed input by construction. The central claim therefore remains self-contained against the external performance benchmarks and does not collapse into its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that empirical performance on held-out examples defines ground-truth necessity and that linear probes on late-layer states capture the relevant internal decision process. No free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Tool necessity for a given model can be defined by whether that model solves the task correctly without tools on similar examples.
    This is the grounding for the model-adaptive definition stated in the abstract.
  • domain assumption Linear probes on hidden states can recover the internal cognition signal for tool necessity.
    Invoked when the authors state that both signals are often linearly decodable.

pith-pipeline@v0.9.0 · 5631 in / 1439 out tokens · 40496 ms · 2026-05-15T05:27:37.428940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 10 internal anchors

  1. [1]

    Introducing the model context protocol, November 2024

    Anthropic. Introducing the model context protocol, November 2024. URL https://www. anthropic.com/news/model-context-protocol

  2. [2]

    Teaching large language models to express knowledge boundary from their own signals, 2024

    Lida Chen, Zujie Liang, Xintao Wang, Jiaqing Liang, Yanghua Xiao, Feng Wei, Jinglei Chen, Zhenghong Hao, Bing Han, and Wei Wang. Teaching large language models to express knowledge boundary from their own signals, 2024. URL https://arxiv.org/abs/2406. 10881

  3. [3]

    Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception

    Yize Cheng, Arshia Soltani Moakhar, Chenrui Fan, Parsa Hosseini, Kazem Faghih, Zahra Sodagar, Wenxiao Wang, and Soheil Feizi. Your llm agents are temporally blind: The misalignment between tool use decisions and human time perception, 2026. URL https: //arxiv.org/abs/2510.23853

  4. [4]

    Therefore I am. I Think

    Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov, and Rajagopal Venkatesaramani. There- fore i am. i think, 2026. URLhttps://arxiv.org/abs/2604.01202

  5. [5]

    Gaming tool preferences in agentic llms

    Kazem Faghih, Wenxiao Wang, Yize Cheng, Siddhant Bharti, Gaurang Sriramanan, Sriram Balasubramanian, Parsa Hosseini, and Soheil Feizi. Gaming tool preferences in agentic llms. arXiv preprint arXiv:2505.18135, 2025

  6. [6]

    Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025

    Google. Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025

  7. [7]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  8. [8]

    MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024

    Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun. MetaTool benchmark for large language models: Deciding whether to use tools and which to use, 2024. URL https://arxiv.org/ abs/2310.03128

  9. [9]

    Wilson, Marcelo G

    Li Ji-An, Hua-Dong Xiong, Robert C. Wilson, Marcelo G. Mattar, and Marcus K. Benna. Lan- guage models are capable of metacognitive monitoring and control of their internal activations,

  10. [10]

    URLhttps://arxiv.org/abs/2505.13763

  11. [11]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  12. [12]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

  13. [13]

    Api-bank: A comprehensive benchmark for tool-augmented llms

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023

  14. [14]

    Adaptive tool use in large language models with meta-cognition trigger

    Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, and Yong Liu. Adaptive tool use in large language models with meta-cognition trigger. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13346–13370, 2025

  15. [15]

    From system 1 to system 2: A survey of reasoning large language models,

    Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhiwei Li, Bao-Long Bi, Ling-Rui Mei, Junfeng Fang, Xiao Liang, Zhijiang Guo, Le Song, and Cheng-Lin Liu. From system 1 to system 2: A survey of reasoning large language models,

  16. [16]

    URLhttps://arxiv.org/abs/2502.17419. 10

  17. [17]

    Teaching models to express their uncertainty in words, 2022

    Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words, 2022. URLhttps://arxiv.org/abs/2205.14334

  18. [18]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URLhttps://arxiv.org/abs/2109.07958

  19. [19]

    Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

  20. [20]

    Matthews

    B.W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme.Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2):442–451, 1975. ISSN 0005-2795. doi: https://doi.org/10.1016/0005-2795(75)90109-9. URL https://www. sciencedirect.com/science/article/pii/0005279575901099

  21. [21]

    Augmented language models: a survey, 2023

    Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

  22. [22]

    Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

    Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

  23. [23]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InF orty-second International Conference on Machine Learning, 2025

  24. [24]

    SMART: Self-aware agent for tool overuse mitigation.arXiv preprint arXiv:2502.11435, 2025

    Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. SMART: Self-aware agent for tool overuse mitigation, 2025. URL https://arxiv.org/abs/2502.11435

  25. [25]

    When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025

    Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025

  26. [26]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

  27. [27]

    Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

    Jiawen Shi, Zenghui Yuan, Guiyao Tie, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

  28. [28]

    Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023

    Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, et al. Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023

  29. [29]

    Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

    Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Amos Storkey, and Kam-Fai Wong. Position: Agent should invoke external tools only when epistemically necessary, 2026. URLhttps://arxiv.org/abs/2506.00886

  30. [30]

    ASA: Training-free representation engineering for tool-calling agents, 2026

    Youjin Wang, Run Zhou, Rong Fu, Shuaishuai Cao, Hongwei Zeng, Jiaxuan Lu, Sicheng Fan, Jiaqiao Zhao, and Liangming Pan. ASA: Training-free representation engineering for tool-calling agents, 2026. URLhttps://arxiv.org/abs/2602.04935

  31. [31]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  32. [32]

    Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. Do large language models know what they don’t know? In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada, July 2023. Association for Computational Linguistics. do...

  33. [33]

    Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang

    Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘i don’t know’,

  34. [34]

    URLhttps://arxiv.org/abs/2311.09677

  35. [35]

    Stop before you fail: Operational capability boundaries for mitigating unproductive reasoning in large reasoning models, 2026

    Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei, Ke Xu, Minlie Huang, and Han Qiu. Stop before you fail: Operational capability boundaries for mitigating unproductive reasoning in large reasoning models, 2026. URLhttps://arxiv.org/abs/2509.24711

  36. [36]

    Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024

    Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, et al. Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024

  37. [37]

    LLMs encode harmfulness and refusal separately, 2025

    Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately, 2025. URLhttps://arxiv.org/abs/2507.11878

  38. [38]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...