pith. sign in

arxiv: 2409.00557 · v4 · submitted 2024-08-31 · 💻 cs.CL · cs.AI· cs.SE

Learning to Ask: When LLM Agents Meet Unclear Instruction

Pith reviewed 2026-05-23 21:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE
keywords LLM agentstool learningunclear instructionsclarifying questionsNoisyToolBenchAsk-when-Neededfunction callinghallucination
0
0 comments X

The pith

LLMs that ask clarifying questions when instructions are unclear outperform models that guess missing details in tool use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how real user instructions for tool-calling often omit key details, causing LLMs to invent arguments and produce errors. It constructs NoisyToolBench from actual user queries to measure this problem and introduces the Ask-when-Needed framework, which tells the model to query the user instead of proceeding with guesses. Experiments show AwN raises success rates over prior tool-learning methods while also tracking efficiency through an automated evaluator. A sympathetic reader would care because tool-using agents are moving into open-ended settings where perfect instructions cannot be assumed. The work therefore shifts the design focus from perfect prompt engineering to explicit handling of uncertainty through interaction.

Core claim

When instructions lack required arguments for tool calls, next-token prediction causes LLMs to fabricate values rather than seek clarification, producing hallucinations. The Ask-when-Needed framework counters this by inserting a decision step that prompts the model to ask the user for missing information whenever an obstacle from unclear instructions is detected. On the NoisyToolBench benchmark built from real queries, this approach yields higher accuracy and better efficiency scores than existing tool-learning frameworks, as measured by the ToolEvaluator.

What carries the argument

Ask-when-Needed (AwN) framework, which adds a prompting rule that instructs the LLM to ask the user for clarification whenever it detects an obstacle caused by unclear instructions.

If this is right

  • Tool-calling agents can maintain correctness on vague real-world queries without requiring perfect upfront instructions.
  • The same prompting pattern reduces the rate at which models invent argument values and thereby lowers hallucination risk.
  • An automated evaluator can now score both final accuracy and the number of user turns needed, giving a joint efficiency metric.
  • Benchmarks built from logged user queries provide a more realistic test distribution than synthetic clear-instruction sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may transfer to other interactive LLM settings such as code generation or multi-step planning where partial information is common.
  • If clarification turns become expensive, future work could add a cost threshold inside the decision rule to decide when to ask versus when to abort.
  • Real-user studies could measure whether people actually prefer to answer clarifying questions versus receiving a guessed but possibly wrong result.

Load-bearing premise

That asking users for clarification is always feasible and low-cost in the actual settings where these tool agents will be deployed.

What would settle it

A deployment trial in which users are given the option to refuse or ignore clarification requests and the measured task success rate for AwN falls below the guessing baseline.

Figures

Figures reproduced from arXiv: 2409.00557 by Chaozheng Wang, Cheryl Lee, Jen-tse Huang, Juluan Shi, Michael R. Lyu, Wenxiang Jiao, Wenxuan Wang, Youliang Yuan, Yuk-Kit Chan, Zixuan Ling.

Figure 1
Figure 1. Figure 1: The motivating example of our Ask-when-Needed (AwN) framework. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The comparison of our QwN prompting com [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the Auto-Interaction module. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces NoisyToolBench, a benchmark for LLM tool use under unclear instructions built by examining real user queries and error patterns. It proposes the Ask-when-Needed (AwN) framework, which prompts LLMs to ask clarifying questions when facing obstacles from imprecise instructions. An automated ToolEvaluator is presented to measure accuracy and efficiency without manual intervention. The central claim is that AwN significantly outperforms existing tool-learning frameworks on NoisyToolBench, with code and datasets to be released.

Significance. If the benchmark distribution matches deployment conditions and the ToolEvaluator scores align with human judgment, the work addresses a practical gap in reliable LLM tool calling by offering a lightweight prompting solution to reduce arbitrary argument generation and hallucinations. The explicit commitment to releasing code and datasets supports reproducibility.

major comments (2)
  1. [Section 3] Section 3 (NoisyToolBench): The benchmark is constructed by examining real queries and error patterns, yet no quantitative details are supplied on selection criteria, inter-annotator agreement, coverage of failure modes, or how the unclear-instruction distribution was validated against deployment data. This is load-bearing for the outperformance claim, as the headline result cannot be interpreted without evidence that the benchmark reflects the target distribution.
  2. [Section 4.3] Section 4.3 (ToolEvaluator): The automated evaluator is introduced to reduce manual labor and produce accuracy/efficiency scores, but the manuscript provides no calibration against human ratings, ablation of scoring rules, or inter-rater reliability metrics. This directly affects the validity of the experimental comparison used to support the central claim that AwN outperforms baselines.
minor comments (1)
  1. [Abstract] The abstract states the outperformance result but supplies no numerical metrics, baseline names, or statistical details; moving a concise summary of key numbers into the abstract would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The feedback highlights important aspects of transparency in benchmark construction and evaluator validation that will improve the manuscript. We address each major comment below and will incorporate revisions as described.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (NoisyToolBench): The benchmark is constructed by examining real queries and error patterns, yet no quantitative details are supplied on selection criteria, inter-annotator agreement, coverage of failure modes, or how the unclear-instruction distribution was validated against deployment data. This is load-bearing for the outperformance claim, as the headline result cannot be interpreted without evidence that the benchmark reflects the target distribution.

    Authors: We agree that additional quantitative details would strengthen interpretability of the results. The benchmark was derived from analysis of real user queries collected from public sources and internal logs, with error patterns identified through manual review of failure cases in tool-calling attempts. In the revised manuscript, we will add: (1) explicit selection criteria with counts (e.g., total queries examined and fraction retained), (2) coverage statistics for each identified failure mode, (3) inter-annotator agreement if multiple reviewers were involved in pattern categorization (or a note that primary analysis was performed by the authors with spot-checks), and (4) a direct statement on the absence of head-to-head validation against a specific production deployment distribution, along with the rationale for why the observed patterns are representative. The full dataset and annotation guidelines will be released to allow independent verification. revision: yes

  2. Referee: [Section 4.3] Section 4.3 (ToolEvaluator): The automated evaluator is introduced to reduce manual labor and produce accuracy/efficiency scores, but the manuscript provides no calibration against human ratings, ablation of scoring rules, or inter-rater reliability metrics. This directly affects the validity of the experimental comparison used to support the central claim that AwN outperforms baselines.

    Authors: We acknowledge the importance of validating the automated ToolEvaluator. The current implementation uses rule-based scoring derived from observable tool-call outcomes and efficiency metrics to approximate human judgment. In the revision, we will add: (1) a calibration study on a held-out subset of 100+ interactions where human raters score accuracy and efficiency, reporting correlation with ToolEvaluator scores, (2) inter-rater reliability (e.g., Cohen's kappa or percentage agreement) among human evaluators, and (3) an ablation of key scoring rules to show sensitivity. If the additional human evaluation cannot be completed before resubmission, we will clearly state this limitation and provide the raw interaction logs so readers can perform their own validation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims on new benchmark

full rationale

The paper constructs a new benchmark (NoisyToolBench) from real user queries, proposes the AwN framework, introduces ToolEvaluator, and reports empirical outperformance. No equations, fitted parameters, predictions, or derivations are present. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central result is an experimental comparison whose validity depends on benchmark quality and evaluator calibration (external concerns), not on any reduction of outputs to inputs by construction. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on domain assumptions about LLM prompting behavior and the fidelity of the benchmark to real usage; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption LLMs can reliably detect unclear instructions and generate useful clarification questions when prompted
    Central to the Ask-when-Needed framework functioning as described.

pith-pipeline@v0.9.0 · 5793 in / 1054 out tokens · 24907 ms · 2026-05-23T21:06:38.393420+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. $How^{2}$: How to learn from procedural How-to questions

    cs.AI 2025-10 unverdicted novelty 7.0

    $How^{2}$ is a memory agent framework enabling agents to ask, store, and reuse answers to how-to questions at varying abstraction levels for better lifelong planning in environments like Plancraft.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    Chinmaya Andukuri, Jan-Philipp Fr \"a nken, Tobias Gerstenberg, and Noah D Goodman. 2024. Star-gate: Teaching language models to ask clarifying questions. arXiv preprint arXiv:2403.19154

  2. [2]

    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023

  3. [3]

    Cheng-Han Chiang and Hung yi Lee. 2023. https://api.semanticscholar.org/CorpusID:258461287 Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics

  4. [4]

    Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-collaboration code generation via chatgpt. arXiv preprint arXiv:2304.07590

  5. [5]

    Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weijian Zheng, Meikang Qiu, and Haihua Chen. 2023. Investigating code generation performance of chatgpt with crowdsourcing social data. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), pages 876--885. IEEE

  6. [6]

    Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745

  7. [7]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2022. Clam: Selective clarification for ambiguous questions with generative language models. arXiv preprint arXiv:2212.07769

  8. [8]

    Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244

  9. [9]

    Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439

  10. [10]

    Brady D Lund and Ting Wang. 2023. Chatting about chatgpt: how may ai and gpt impact academia and libraries? Library Hi Tech News, 40(3):26--29

  11. [11]

    Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.07842 Augmented language models: a survey . Preprint, arXiv:2302.07842

  12. [12]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2305.15334 Gorilla: Large language model connected with massive apis . Preprint, arXiv:2305.15334

  13. [13]

    Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. 2023. Towards making the most of chatgpt for machine translation. arXiv preprint arXiv:2303.13780

  14. [14]

    Cheng Qian, Bingxiang He, Zhong Zhuang, Jia Deng, Yujia Qin, Xin Cong, Zhong Zhang, Jie Zhou, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. https://arxiv.org/abs/2402.09205 Tell me more! towards implicit user intention understanding of language model driven agents . Preprint, arXiv:2402.09205

  15. [15]

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...

  16. [16]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023 b . Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789

  17. [17]

    Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. 2024. Tool learning with large language models: A survey. arXiv preprint arXiv:2405.17935

  18. [18]

    Sudha Rao and Hal Daum \'e III. 2018. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. arXiv preprint arXiv:1805.04655

  19. [19]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. EMNLP

  20. [20]

    Fardin Ahsan Sakib, Saadat Hasan Khan, and AHM Karim. 2023. Extending the frontier of chatgpt: Code generation and debugging. arXiv preprint arXiv:2307.08260

  21. [21]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36

  22. [22]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.04761 Toolformer: Language models can teach themselves to use tools . Preprint, arXiv:2302.04761

  23. [23]

    Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366

  24. [24]

    Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, Ye Tian, and Sujian Li. 2023. https://arxiv.org/abs/2306.06624 Restgpt: Connecting large language models with real-world restful apis . Preprint, arXiv:2306.06624

  25. [25]

    Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. 2023. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301

  26. [26]

    Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael R Lyu. 2023. All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905

  27. [27]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824--24837

  28. [28]

    Jie JW Wu. 2023. https://arxiv.org/abs/2308.13507 Does asking clarifying questions increases confidence in generated code? on the communication skills of large language models . Preprint, arXiv:2308.13507

  29. [29]

    Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023. A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122--1136

  30. [30]

    Mitchell, and Yuanzhi Li

    Yue Wu, Xuan Tang, Tom M. Mitchell, and Yuanzhi Li. 2024. https://arxiv.org/abs/2310.01557 Smartplay: A benchmark for llms as intelligent agents . Preprint, arXiv:2310.01557

  31. [31]

    Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. 2023 a . https://arxiv.org/abs/2303.04129 Foundation models for decision making: Problems, methods, and opportunities . Preprint, arXiv:2303.04129

  32. [32]

    Xianjun Yang, Xiao Wang, Qi Zhang, Linda Ruth Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. 2023 b . https://api.semanticscholar.org/CorpusID:263620436 Shadow alignment: The ease of subverting safely-aligned language models . ArXiv, abs/2310.02949

  33. [33]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

  34. [34]

    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. 2023. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463

  35. [35]

    Jenny Zhang, Samson Yu, Jiafei Duan, and Cheston Tan. 2023. https://arxiv.org/abs/2206.10606 Good time to ask: A learning framework for asking for help in embodied visual navigation . Preprint, arXiv:2206.10606

  36. [36]

    Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2023. https://arxiv.org/abs/2306.13304 Toolqa: A dataset for llm question answering with external tools . Preprint, arXiv:2306.13304

  37. [37]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  38. [38]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...