Calibrating long-form generations from large language models

Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, Bhuwan Dhingra · 2024 · DOI 10.18653/v1/2024.findings-emnlp.785

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

representative citing papers

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

RLMF uses quality of model self-judgments to refine RL rankings and select training data, achieving SOTA faithful calibration while preserving accuracy and outperforming standard RL by up to 63%.

PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.

Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

cs.CV · 2026-04-02 · conditional · novelty 6.0

Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.

LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations

cs.CL · 2025-05-29 · unverdicted · novelty 6.0

LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.

citing papers explorer

Showing 1 of 1 citing paper after filters.

LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations cs.CL · 2025-05-29 · unverdicted · none · ref 59
LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.

Calibrating long-form generations from large language models

fields

years

verdicts

representative citing papers

citing papers explorer