Learning to Ask: When LLM Agents Meet Unclear Instruction
Pith reviewed 2026-05-23 21:06 UTC · model grok-4.3
The pith
LLMs that ask clarifying questions when instructions are unclear outperform models that guess missing details in tool use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When instructions lack required arguments for tool calls, next-token prediction causes LLMs to fabricate values rather than seek clarification, producing hallucinations. The Ask-when-Needed framework counters this by inserting a decision step that prompts the model to ask the user for missing information whenever an obstacle from unclear instructions is detected. On the NoisyToolBench benchmark built from real queries, this approach yields higher accuracy and better efficiency scores than existing tool-learning frameworks, as measured by the ToolEvaluator.
What carries the argument
Ask-when-Needed (AwN) framework, which adds a prompting rule that instructs the LLM to ask the user for clarification whenever it detects an obstacle caused by unclear instructions.
If this is right
- Tool-calling agents can maintain correctness on vague real-world queries without requiring perfect upfront instructions.
- The same prompting pattern reduces the rate at which models invent argument values and thereby lowers hallucination risk.
- An automated evaluator can now score both final accuracy and the number of user turns needed, giving a joint efficiency metric.
- Benchmarks built from logged user queries provide a more realistic test distribution than synthetic clear-instruction sets.
Where Pith is reading between the lines
- The approach may transfer to other interactive LLM settings such as code generation or multi-step planning where partial information is common.
- If clarification turns become expensive, future work could add a cost threshold inside the decision rule to decide when to ask versus when to abort.
- Real-user studies could measure whether people actually prefer to answer clarifying questions versus receiving a guessed but possibly wrong result.
Load-bearing premise
That asking users for clarification is always feasible and low-cost in the actual settings where these tool agents will be deployed.
What would settle it
A deployment trial in which users are given the option to refuse or ignore clarification requests and the measured task success rate for AwN falls below the guessing baseline.
Figures
read the original abstract
Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces NoisyToolBench, a benchmark for LLM tool use under unclear instructions built by examining real user queries and error patterns. It proposes the Ask-when-Needed (AwN) framework, which prompts LLMs to ask clarifying questions when facing obstacles from imprecise instructions. An automated ToolEvaluator is presented to measure accuracy and efficiency without manual intervention. The central claim is that AwN significantly outperforms existing tool-learning frameworks on NoisyToolBench, with code and datasets to be released.
Significance. If the benchmark distribution matches deployment conditions and the ToolEvaluator scores align with human judgment, the work addresses a practical gap in reliable LLM tool calling by offering a lightweight prompting solution to reduce arbitrary argument generation and hallucinations. The explicit commitment to releasing code and datasets supports reproducibility.
major comments (2)
- [Section 3] Section 3 (NoisyToolBench): The benchmark is constructed by examining real queries and error patterns, yet no quantitative details are supplied on selection criteria, inter-annotator agreement, coverage of failure modes, or how the unclear-instruction distribution was validated against deployment data. This is load-bearing for the outperformance claim, as the headline result cannot be interpreted without evidence that the benchmark reflects the target distribution.
- [Section 4.3] Section 4.3 (ToolEvaluator): The automated evaluator is introduced to reduce manual labor and produce accuracy/efficiency scores, but the manuscript provides no calibration against human ratings, ablation of scoring rules, or inter-rater reliability metrics. This directly affects the validity of the experimental comparison used to support the central claim that AwN outperforms baselines.
minor comments (1)
- [Abstract] The abstract states the outperformance result but supplies no numerical metrics, baseline names, or statistical details; moving a concise summary of key numbers into the abstract would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. The feedback highlights important aspects of transparency in benchmark construction and evaluator validation that will improve the manuscript. We address each major comment below and will incorporate revisions as described.
read point-by-point responses
-
Referee: [Section 3] Section 3 (NoisyToolBench): The benchmark is constructed by examining real queries and error patterns, yet no quantitative details are supplied on selection criteria, inter-annotator agreement, coverage of failure modes, or how the unclear-instruction distribution was validated against deployment data. This is load-bearing for the outperformance claim, as the headline result cannot be interpreted without evidence that the benchmark reflects the target distribution.
Authors: We agree that additional quantitative details would strengthen interpretability of the results. The benchmark was derived from analysis of real user queries collected from public sources and internal logs, with error patterns identified through manual review of failure cases in tool-calling attempts. In the revised manuscript, we will add: (1) explicit selection criteria with counts (e.g., total queries examined and fraction retained), (2) coverage statistics for each identified failure mode, (3) inter-annotator agreement if multiple reviewers were involved in pattern categorization (or a note that primary analysis was performed by the authors with spot-checks), and (4) a direct statement on the absence of head-to-head validation against a specific production deployment distribution, along with the rationale for why the observed patterns are representative. The full dataset and annotation guidelines will be released to allow independent verification. revision: yes
-
Referee: [Section 4.3] Section 4.3 (ToolEvaluator): The automated evaluator is introduced to reduce manual labor and produce accuracy/efficiency scores, but the manuscript provides no calibration against human ratings, ablation of scoring rules, or inter-rater reliability metrics. This directly affects the validity of the experimental comparison used to support the central claim that AwN outperforms baselines.
Authors: We acknowledge the importance of validating the automated ToolEvaluator. The current implementation uses rule-based scoring derived from observable tool-call outcomes and efficiency metrics to approximate human judgment. In the revision, we will add: (1) a calibration study on a held-out subset of 100+ interactions where human raters score accuracy and efficiency, reporting correlation with ToolEvaluator scores, (2) inter-rater reliability (e.g., Cohen's kappa or percentage agreement) among human evaluators, and (3) an ablation of key scoring rules to show sensitivity. If the additional human evaluation cannot be completed before resubmission, we will clearly state this limitation and provide the raw interaction logs so readers can perform their own validation. revision: yes
Circularity Check
No circularity: purely empirical claims on new benchmark
full rationale
The paper constructs a new benchmark (NoisyToolBench) from real user queries, proposes the AwN framework, introduces ToolEvaluator, and reports empirical outperformance. No equations, fitted parameters, predictions, or derivations are present. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central result is an experimental comparison whose validity depends on benchmark quality and evaluator calibration (external concerns), not on any reduction of outputs to inputs by construction. This is self-contained empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can reliably detect unclear instructions and generate useful clarification questions when prompted
Forward citations
Cited by 1 Pith paper
-
$How^{2}$: How to learn from procedural How-to questions
$How^{2}$ is a memory agent framework enabling agents to ask, store, and reuse answers to how-to questions at varying abstraction levels for better lifelong planning in environments like Plancraft.
Reference graph
Works this paper leans on
- [1]
-
[2]
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Cheng-Han Chiang and Hung yi Lee. 2023. https://api.semanticscholar.org/CorpusID:258461287 Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics
work page 2023
- [4]
-
[5]
Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weijian Zheng, Meikang Qiu, and Haihua Chen. 2023. Investigating code generation performance of chatgpt with crowdsourcing social data. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), pages 876--885. IEEE
work page 2023
- [6]
- [7]
-
[8]
Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [9]
-
[10]
Brady D Lund and Ting Wang. 2023. Chatting about chatgpt: how may ai and gpt impact academia and libraries? Library Hi Tech News, 40(3):26--29
work page 2023
-
[11]
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.07842 Augmented language models: a survey . Preprint, arXiv:2302.07842
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Gorilla: Large Language Model Connected with Massive APIs
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2305.15334 Gorilla: Large language model connected with massive apis . Preprint, arXiv:2305.15334
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [13]
-
[14]
Cheng Qian, Bingxiang He, Zhong Zhuang, Jia Deng, Yujia Qin, Xin Cong, Zhong Zhang, Jie Zhou, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. https://arxiv.org/abs/2402.09205 Tell me more! towards implicit user intention understanding of language model driven agents . Preprint, arXiv:2402.09205
-
[15]
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023 b . Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [17]
-
[18]
Sudha Rao and Hal Daum \'e III. 2018. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. arXiv preprint arXiv:1805.04655
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. EMNLP
work page 2019
- [20]
-
[21]
Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36
work page 2024
-
[22]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.04761 Toolformer: Language models can teach themselves to use tools . Preprint, arXiv:2302.04761
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [24]
-
[25]
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. 2023. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [26]
-
[27]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824--24837
work page 2022
- [28]
-
[29]
Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023. A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122--1136
work page 2023
-
[30]
Yue Wu, Xuan Tang, Tom M. Mitchell, and Yuanzhi Li. 2024. https://arxiv.org/abs/2310.01557 Smartplay: A benchmark for llms as intelligent agents . Preprint, arXiv:2310.01557
- [31]
- [32]
-
[33]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [34]
- [35]
- [36]
-
[37]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[38]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.