LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.
Correcting length bias in neural machine translation
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
citing papers explorer
-
LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations
LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.
-
Beyond Distribution Sharpening: The Importance of Task Rewards
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
-
Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.