SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
CLAMBER: A benchmark of identifying and clarifying ambiguous information needs in large language models
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Frontier LLMs struggle to discriminate data uncertainty from model uncertainty even when accurate, but a new benchmark and lightweight RL strategy improve attribution without sacrificing answer accuracy.
LLMs exhibit 20-40% lower recall on ambiguous human names for PII detection, worsening under prompt injections, as shown via the new AmBench benchmark.
citing papers explorer
-
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
-
Beyond "I Don't Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty
Frontier LLMs struggle to discriminate data uncertainty from model uncertainty even when accurate, but a new benchmark and lightweight RL strategy improve attribution without sacrificing answer accuracy.
-
Can Large Language Models Really Recognize Your Name?
LLMs exhibit 20-40% lower recall on ambiguous human names for PII detection, worsening under prompt injections, as shown via the new AmBench benchmark.