Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
Conftuner: Training large language models to express their confidence verbally
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 1polarities
background 1representative citing papers
A composite loss with Brier calibration, anchor regularization, contrastive alignment from 2x2 perturbations, and KL stabilization reduces calibration error by over 60% in medical VQA while preserving accuracy.
CoMet decomposes MLLM uncertainty into context-specific and multiplicity-specific terms estimated by a trained post-hoc module, improving performance on open-ended multimodal benchmarks and hallucination detection.
Fine-tuning Gemma 3 4B on unfiltered self-consistency targets produces a binary verbal correctness discriminator with AUROC 0.774 on TriviaQA, outperforming logit entropy after a modal-filtered pre-registration failed.
citing papers explorer
-
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
-
Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA
A composite loss with Brier calibration, anchor regularization, contrastive alignment from 2x2 perturbations, and KL stabilization reduces calibration error by over 60% in medical VQA while preserving accuracy.
-
CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation
CoMet decomposes MLLM uncertainty into context-specific and multiplicity-specific terms estimated by a trained post-hoc module, improving performance on open-ended multimodal benchmarks and hallucination detection.
-
Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B
Fine-tuning Gemma 3 4B on unfiltered self-consistency targets produces a binary verbal correctness discriminator with AUROC 0.774 on TriviaQA, outperforming logit entropy after a modal-filtered pre-registration failed.