Calibrating long-form generations from large language models

Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, Bhuwan Dhingra · 2024 · DOI 10.18653/v1/2024.findings-emnlp.785

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.

Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

cs.CV · 2026-04-02 · conditional · novelty 6.0

Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.

LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations

cs.CL · 2025-05-29 · unverdicted · novelty 6.0

LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.

citing papers explorer

Showing 3 of 3 citing papers.

PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking cs.CL · 2026-04-20 · unverdicted · none · ref 76
PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.
Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation cs.CV · 2026-04-02 · conditional · none · ref 13
Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.
LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations cs.CL · 2025-05-29 · unverdicted · none · ref 59
LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.

Calibrating long-form generations from large language models

fields

years

verdicts

representative citing papers

citing papers explorer