Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.
”Evaluating the logical reasoning ability of chatgpt and gpt-4.” arXiv preprint arXiv:2304.03439 (2023)
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
TaNOS decouples table semantics from numerical structure via anonymization, sketches, and program-first self-supervision, yielding 80.13% FinQA accuracy with 10% data and near-zero cross-domain gap versus over 10pp for standard fine-tuning.
A goal-oriented chatbot uses family photos to generate W-questions and open prompts for elderly reminiscence, analyzes topics to suggest follow-up photos, and supplies caregivers with conversation insights via a web portal.
citing papers explorer
-
Learning to Ask: When LLM Agents Meet Unclear Instruction
Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.