Learning to Ask: When LLM Agents Meet Unclear Instruction

Chaozheng Wang; Cheryl Lee; Jen-tse Huang; Juluan Shi; Michael R. Lyu; Wenxiang Jiao; Wenxuan Wang; Youliang Yuan; Yuk-Kit Chan; Zixuan Ling

arxiv: 2409.00557 · v4 · submitted 2024-08-31 · 💻 cs.CL · cs.AI· cs.SE

Learning to Ask: When LLM Agents Meet Unclear Instruction

Wenxuan Wang , Juluan Shi , Zixuan Ling , Yuk-Kit Chan , Chaozheng Wang , Cheryl Lee , Youliang Yuan , Jen-tse Huang

show 2 more authors

Wenxiang Jiao Michael R. Lyu

This is my paper

Pith reviewed 2026-05-23 21:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE

keywords LLM agentstool learningunclear instructionsclarifying questionsNoisyToolBenchAsk-when-Neededfunction callinghallucination

0 comments

The pith

LLMs that ask clarifying questions when instructions are unclear outperform models that guess missing details in tool use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how real user instructions for tool-calling often omit key details, causing LLMs to invent arguments and produce errors. It constructs NoisyToolBench from actual user queries to measure this problem and introduces the Ask-when-Needed framework, which tells the model to query the user instead of proceeding with guesses. Experiments show AwN raises success rates over prior tool-learning methods while also tracking efficiency through an automated evaluator. A sympathetic reader would care because tool-using agents are moving into open-ended settings where perfect instructions cannot be assumed. The work therefore shifts the design focus from perfect prompt engineering to explicit handling of uncertainty through interaction.

Core claim

When instructions lack required arguments for tool calls, next-token prediction causes LLMs to fabricate values rather than seek clarification, producing hallucinations. The Ask-when-Needed framework counters this by inserting a decision step that prompts the model to ask the user for missing information whenever an obstacle from unclear instructions is detected. On the NoisyToolBench benchmark built from real queries, this approach yields higher accuracy and better efficiency scores than existing tool-learning frameworks, as measured by the ToolEvaluator.

What carries the argument

Ask-when-Needed (AwN) framework, which adds a prompting rule that instructs the LLM to ask the user for clarification whenever it detects an obstacle caused by unclear instructions.

If this is right

Tool-calling agents can maintain correctness on vague real-world queries without requiring perfect upfront instructions.
The same prompting pattern reduces the rate at which models invent argument values and thereby lowers hallucination risk.
An automated evaluator can now score both final accuracy and the number of user turns needed, giving a joint efficiency metric.
Benchmarks built from logged user queries provide a more realistic test distribution than synthetic clear-instruction sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may transfer to other interactive LLM settings such as code generation or multi-step planning where partial information is common.
If clarification turns become expensive, future work could add a cost threshold inside the decision rule to decide when to ask versus when to abort.
Real-user studies could measure whether people actually prefer to answer clarifying questions versus receiving a guessed but possibly wrong result.

Load-bearing premise

That asking users for clarification is always feasible and low-cost in the actual settings where these tool agents will be deployed.

What would settle it

A deployment trial in which users are given the option to refuse or ignore clarification requests and the measured task success rate for AwN falls below the guessing baseline.

Figures

Figures reproduced from arXiv: 2409.00557 by Chaozheng Wang, Cheryl Lee, Jen-tse Huang, Juluan Shi, Michael R. Lyu, Wenxiang Jiao, Wenxuan Wang, Youliang Yuan, Yuk-Kit Chan, Zixuan Ling.

**Figure 2.** Figure 2: The comparison of our QwN prompting com [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the Auto-Interaction module. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper creates NoisyToolBench from real queries and an Ask-when-Needed prompting method to handle unclear instructions in tool-calling LLMs, but the abstract supplies no experimental details or validation steps.

read the letter

The main new pieces are the NoisyToolBench benchmark, built by looking at actual user queries and error patterns, and the Ask-when-Needed framework that tells the model to ask clarifying questions instead of guessing missing tool arguments. They also add an automated ToolEvaluator to score accuracy and efficiency without constant human checks, and they plan to release the code and data. That last part is useful for the subfield. The identification of the hallucination risk from next-token training is straightforward and on point. The soft spots sit in the missing details. The abstract states that AwN outperforms existing frameworks on the new benchmark, yet gives no baselines, metrics, error analysis, or description of how queries were selected or how inter-annotator agreement was measured. ToolEvaluator is presented as a labor saver, but nothing is said about calibration against human ratings. If those steps are thin in the full paper, the outperformance claim stays hard to interpret. The benchmark construction and evaluator validity are the load-bearing parts here, and they are not yet shown. This work is for researchers working on practical tool-augmented agents who run into unclear instructions in deployment. A reader in that area would get value from the benchmark and the prompting idea once the construction and results are visible. It deserves peer review because the problem is real and the artifacts are new, even if the current evidence is limited to the abstract-level claim.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces NoisyToolBench, a benchmark for LLM tool use under unclear instructions built by examining real user queries and error patterns. It proposes the Ask-when-Needed (AwN) framework, which prompts LLMs to ask clarifying questions when facing obstacles from imprecise instructions. An automated ToolEvaluator is presented to measure accuracy and efficiency without manual intervention. The central claim is that AwN significantly outperforms existing tool-learning frameworks on NoisyToolBench, with code and datasets to be released.

Significance. If the benchmark distribution matches deployment conditions and the ToolEvaluator scores align with human judgment, the work addresses a practical gap in reliable LLM tool calling by offering a lightweight prompting solution to reduce arbitrary argument generation and hallucinations. The explicit commitment to releasing code and datasets supports reproducibility.

major comments (2)

[Section 3] Section 3 (NoisyToolBench): The benchmark is constructed by examining real queries and error patterns, yet no quantitative details are supplied on selection criteria, inter-annotator agreement, coverage of failure modes, or how the unclear-instruction distribution was validated against deployment data. This is load-bearing for the outperformance claim, as the headline result cannot be interpreted without evidence that the benchmark reflects the target distribution.
[Section 4.3] Section 4.3 (ToolEvaluator): The automated evaluator is introduced to reduce manual labor and produce accuracy/efficiency scores, but the manuscript provides no calibration against human ratings, ablation of scoring rules, or inter-rater reliability metrics. This directly affects the validity of the experimental comparison used to support the central claim that AwN outperforms baselines.

minor comments (1)

[Abstract] The abstract states the outperformance result but supplies no numerical metrics, baseline names, or statistical details; moving a concise summary of key numbers into the abstract would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The feedback highlights important aspects of transparency in benchmark construction and evaluator validation that will improve the manuscript. We address each major comment below and will incorporate revisions as described.

read point-by-point responses

Referee: [Section 3] Section 3 (NoisyToolBench): The benchmark is constructed by examining real queries and error patterns, yet no quantitative details are supplied on selection criteria, inter-annotator agreement, coverage of failure modes, or how the unclear-instruction distribution was validated against deployment data. This is load-bearing for the outperformance claim, as the headline result cannot be interpreted without evidence that the benchmark reflects the target distribution.

Authors: We agree that additional quantitative details would strengthen interpretability of the results. The benchmark was derived from analysis of real user queries collected from public sources and internal logs, with error patterns identified through manual review of failure cases in tool-calling attempts. In the revised manuscript, we will add: (1) explicit selection criteria with counts (e.g., total queries examined and fraction retained), (2) coverage statistics for each identified failure mode, (3) inter-annotator agreement if multiple reviewers were involved in pattern categorization (or a note that primary analysis was performed by the authors with spot-checks), and (4) a direct statement on the absence of head-to-head validation against a specific production deployment distribution, along with the rationale for why the observed patterns are representative. The full dataset and annotation guidelines will be released to allow independent verification. revision: yes
Referee: [Section 4.3] Section 4.3 (ToolEvaluator): The automated evaluator is introduced to reduce manual labor and produce accuracy/efficiency scores, but the manuscript provides no calibration against human ratings, ablation of scoring rules, or inter-rater reliability metrics. This directly affects the validity of the experimental comparison used to support the central claim that AwN outperforms baselines.

Authors: We acknowledge the importance of validating the automated ToolEvaluator. The current implementation uses rule-based scoring derived from observable tool-call outcomes and efficiency metrics to approximate human judgment. In the revision, we will add: (1) a calibration study on a held-out subset of 100+ interactions where human raters score accuracy and efficiency, reporting correlation with ToolEvaluator scores, (2) inter-rater reliability (e.g., Cohen's kappa or percentage agreement) among human evaluators, and (3) an ablation of key scoring rules to show sensitivity. If the additional human evaluation cannot be completed before resubmission, we will clearly state this limitation and provide the raw interaction logs so readers can perform their own validation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims on new benchmark

full rationale

The paper constructs a new benchmark (NoisyToolBench) from real user queries, proposes the AwN framework, introduces ToolEvaluator, and reports empirical outperformance. No equations, fitted parameters, predictions, or derivations are present. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central result is an experimental comparison whose validity depends on benchmark quality and evaluator calibration (external concerns), not on any reduction of outputs to inputs by construction. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on domain assumptions about LLM prompting behavior and the fidelity of the benchmark to real usage; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption LLMs can reliably detect unclear instructions and generate useful clarification questions when prompted
Central to the Ask-when-Needed framework functioning as described.

pith-pipeline@v0.9.0 · 5793 in / 1054 out tokens · 24907 ms · 2026-05-23T21:06:38.393420+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

$How^{2}$: How to learn from procedural How-to questions
cs.AI 2025-10 unverdicted novelty 7.0

$How^{2}$ is a memory agent framework enabling agents to ask, store, and reuse answers to how-to questions at varying abstraction levels for better lifelong planning in environments like Plancraft.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

Chinmaya Andukuri, Jan-Philipp Fr \"a nken, Tobias Gerstenberg, and Noah D Goodman. 2024. Star-gate: Teaching language models to ask clarifying questions. arXiv preprint arXiv:2403.19154

work page arXiv 2024
[2]

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Cheng-Han Chiang and Hung yi Lee. 2023. https://api.semanticscholar.org/CorpusID:258461287 Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics

work page 2023
[4]

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-collaboration code generation via chatgpt. arXiv preprint arXiv:2304.07590

work page arXiv 2023
[5]

Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weijian Zheng, Meikang Qiu, and Haihua Chen. 2023. Investigating code generation performance of chatgpt with crowdsourcing social data. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), pages 876--885. IEEE

work page 2023
[6]

Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745

work page arXiv 2023
[7]

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2022. Clam: Selective clarification for ambiguous questions with generative language models. arXiv preprint arXiv:2212.07769

work page arXiv 2022
[8]

Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439

work page arXiv 2023
[10]

Brady D Lund and Ting Wang. 2023. Chatting about chatgpt: how may ai and gpt impact academia and libraries? Library Hi Tech News, 40(3):26--29

work page 2023
[11]

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.07842 Augmented language models: a survey . Preprint, arXiv:2302.07842

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2305.15334 Gorilla: Large language model connected with massive apis . Preprint, arXiv:2305.15334

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. 2023. Towards making the most of chatgpt for machine translation. arXiv preprint arXiv:2303.13780

work page arXiv 2023
[14]

Cheng Qian, Bingxiang He, Zhong Zhuang, Jia Deng, Yujia Qin, Xin Cong, Zhong Zhang, Jie Zhou, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. https://arxiv.org/abs/2402.09205 Tell me more! towards implicit user intention understanding of language model driven agents . Preprint, arXiv:2402.09205

work page arXiv 2024
[15]

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023 b . Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. 2024. Tool learning with large language models: A survey. arXiv preprint arXiv:2405.17935

work page arXiv 2024
[18]

Sudha Rao and Hal Daum \'e III. 2018. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. arXiv preprint arXiv:1805.04655

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. EMNLP

work page 2019
[20]

Fardin Ahsan Sakib, Saadat Hasan Khan, and AHM Karim. 2023. Extending the frontier of chatgpt: Code generation and debugging. arXiv preprint arXiv:2307.08260

work page arXiv 2023
[21]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36

work page 2024
[22]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.04761 Toolformer: Language models can teach themselves to use tools . Preprint, arXiv:2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, Ye Tian, and Sujian Li. 2023. https://arxiv.org/abs/2306.06624 Restgpt: Connecting large language models with real-world restful apis . Preprint, arXiv:2306.06624

work page arXiv 2023
[25]

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. 2023. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael R Lyu. 2023. All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905

work page arXiv 2023
[27]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824--24837

work page 2022
[28]

Jie JW Wu. 2023. https://arxiv.org/abs/2308.13507 Does asking clarifying questions increases confidence in generated code? on the communication skills of large language models . Preprint, arXiv:2308.13507

work page arXiv 2023
[29]

Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023. A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122--1136

work page 2023
[30]

Mitchell, and Yuanzhi Li

Yue Wu, Xuan Tang, Tom M. Mitchell, and Yuanzhi Li. 2024. https://arxiv.org/abs/2310.01557 Smartplay: A benchmark for llms as intelligent agents . Preprint, arXiv:2310.01557

work page arXiv 2024
[31]

Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. 2023 a . https://arxiv.org/abs/2303.04129 Foundation models for decision making: Problems, methods, and opportunities . Preprint, arXiv:2303.04129

work page arXiv 2023
[32]

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Ruth Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. 2023 b . https://api.semanticscholar.org/CorpusID:263620436 Shadow alignment: The ease of subverting safely-aligned language models . ArXiv, abs/2310.02949

work page arXiv 2023
[33]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. 2023. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463

work page arXiv 2023
[35]

Jenny Zhang, Samson Yu, Jiafei Duan, and Cheston Tan. 2023. https://arxiv.org/abs/2206.10606 Good time to ask: A learning framework for asking for help in embodied visual navigation . Preprint, arXiv:2206.10606

work page arXiv 2023
[36]

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2023. https://arxiv.org/abs/2306.13304 Toolqa: A dataset for llm question answering with external tools . Preprint, arXiv:2306.13304

work page arXiv 2023
[37]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Chinmaya Andukuri, Jan-Philipp Fr \"a nken, Tobias Gerstenberg, and Noah D Goodman. 2024. Star-gate: Teaching language models to ask clarifying questions. arXiv preprint arXiv:2403.19154

work page arXiv 2024

[2] [2]

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Cheng-Han Chiang and Hung yi Lee. 2023. https://api.semanticscholar.org/CorpusID:258461287 Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics

work page 2023

[4] [4]

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-collaboration code generation via chatgpt. arXiv preprint arXiv:2304.07590

work page arXiv 2023

[5] [5]

Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weijian Zheng, Meikang Qiu, and Haihua Chen. 2023. Investigating code generation performance of chatgpt with crowdsourcing social data. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), pages 876--885. IEEE

work page 2023

[6] [6]

Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745

work page arXiv 2023

[7] [7]

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2022. Clam: Selective clarification for ambiguous questions with generative language models. arXiv preprint arXiv:2212.07769

work page arXiv 2022

[8] [8]

Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439

work page arXiv 2023

[10] [10]

Brady D Lund and Ting Wang. 2023. Chatting about chatgpt: how may ai and gpt impact academia and libraries? Library Hi Tech News, 40(3):26--29

work page 2023

[11] [11]

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.07842 Augmented language models: a survey . Preprint, arXiv:2302.07842

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2305.15334 Gorilla: Large language model connected with massive apis . Preprint, arXiv:2305.15334

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. 2023. Towards making the most of chatgpt for machine translation. arXiv preprint arXiv:2303.13780

work page arXiv 2023

[14] [14]

Cheng Qian, Bingxiang He, Zhong Zhuang, Jia Deng, Yujia Qin, Xin Cong, Zhong Zhang, Jie Zhou, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. https://arxiv.org/abs/2402.09205 Tell me more! towards implicit user intention understanding of language model driven agents . Preprint, arXiv:2402.09205

work page arXiv 2024

[15] [15]

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023 b . Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. 2024. Tool learning with large language models: A survey. arXiv preprint arXiv:2405.17935

work page arXiv 2024

[18] [18]

Sudha Rao and Hal Daum \'e III. 2018. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. arXiv preprint arXiv:1805.04655

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. EMNLP

work page 2019

[20] [20]

Fardin Ahsan Sakib, Saadat Hasan Khan, and AHM Karim. 2023. Extending the frontier of chatgpt: Code generation and debugging. arXiv preprint arXiv:2307.08260

work page arXiv 2023

[21] [21]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36

work page 2024

[22] [22]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.04761 Toolformer: Language models can teach themselves to use tools . Preprint, arXiv:2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, Ye Tian, and Sujian Li. 2023. https://arxiv.org/abs/2306.06624 Restgpt: Connecting large language models with real-world restful apis . Preprint, arXiv:2306.06624

work page arXiv 2023

[25] [25]

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. 2023. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael R Lyu. 2023. All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905

work page arXiv 2023

[27] [27]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824--24837

work page 2022

[28] [28]

Jie JW Wu. 2023. https://arxiv.org/abs/2308.13507 Does asking clarifying questions increases confidence in generated code? on the communication skills of large language models . Preprint, arXiv:2308.13507

work page arXiv 2023

[29] [29]

Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023. A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122--1136

work page 2023

[30] [30]

Mitchell, and Yuanzhi Li

Yue Wu, Xuan Tang, Tom M. Mitchell, and Yuanzhi Li. 2024. https://arxiv.org/abs/2310.01557 Smartplay: A benchmark for llms as intelligent agents . Preprint, arXiv:2310.01557

work page arXiv 2024

[31] [31]

Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. 2023 a . https://arxiv.org/abs/2303.04129 Foundation models for decision making: Problems, methods, and opportunities . Preprint, arXiv:2303.04129

work page arXiv 2023

[32] [32]

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Ruth Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. 2023 b . https://api.semanticscholar.org/CorpusID:263620436 Shadow alignment: The ease of subverting safely-aligned language models . ArXiv, abs/2310.02949

work page arXiv 2023

[33] [33]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. 2023. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463

work page arXiv 2023

[35] [35]

Jenny Zhang, Samson Yu, Jiafei Duan, and Cheston Tan. 2023. https://arxiv.org/abs/2206.10606 Good time to ask: A learning framework for asking for help in embodied visual navigation . Preprint, arXiv:2206.10606

work page arXiv 2023

[36] [36]

Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2023. https://arxiv.org/abs/2306.13304 Toolqa: A dataset for llm question answering with external tools . Preprint, arXiv:2306.13304

work page arXiv 2023

[37] [37]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[38] [38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page