PaperPilot induces executable DAG workflows for multi-turn literature search and trains via imitation plus preference optimization, raising Hit@5 from 58.0 to 77.0 over a baseline agent.
NOVA: NOise-aware Verbal Confidence CAlibration for Robust Large Language Models in RAG Systems
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance especially when noisy contexts are retrieved. Specifically, contradictory or irrelevant evidence tends to exacerbate the model's overconfidence issue. To address this, we propose NOVA Rules (NOise-Aware Verbal Confidence CAlibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NOVA, a noise-aware calibration framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NOVA equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NOVA yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NOVA paves the way for both accurate and epistemically reliable LLMs.
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Multi-Turn Agentic Scientific Literature Search via Workflow Induction
PaperPilot induces executable DAG workflows for multi-turn literature search and trains via imitation plus preference optimization, raising Hit@5 from 58.0 to 77.0 over a baseline agent.