A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
KARL uses a knowledge-boundary-aware reward from within-group response statistics and two-stage RL training to align LLM abstention with actual knowledge, yielding a better accuracy-hallucination trade-off on benchmarks.
LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.
Exploration-Commitment Decoupling instantiated as Calibration-Aware Generation improves long-form factuality by up to 13% and reduces decoding time by up to 37% on five benchmarks.
citing papers explorer
-
Task Abstention for Large Language Models in Code Generation
A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.
-
KARL: Mitigating Hallucinations in LLMs via Knowledge-Boundary-Aware Reinforcement Learning
KARL uses a knowledge-boundary-aware reward from within-group response statistics and two-stage RL training to align LLM abstention with actual knowledge, yielding a better accuracy-hallucination trade-off on benchmarks.
-
LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations
LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.
-
Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality
Exploration-Commitment Decoupling instantiated as Calibration-Aware Generation improves long-form factuality by up to 13% and reduces decoding time by up to 37% on five benchmarks.