Confidence Should Be Calibrated More Than One Turn Deep
Pith reviewed 2026-05-10 19:32 UTC · model grok-4.3
The pith
LLM confidence must be calibrated dynamically at each conversation turn rather than once.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality,
What carries the argument
MTCal, a training procedure that minimises Expected Calibration Error at turn T by optimising against a surrogate calibration target conditioned on conversation history.
If this is right
- User feedback such as persuasion increases calibration error across successive turns.
- MTCal keeps ECE@T low in multi-turn settings where single-turn methods degrade.
- ConfChat decoding that uses the calibrated scores raises both factuality and consistency of generated answers.
- The combination supports reliable LLM use in extended interactions in domains such as healthcare and education.
Where Pith is reading between the lines
- Calibration routines may need to be rerun or adapted whenever the style of user input changes substantially.
- The same dynamic framing could apply to other measures of model uncertainty in long dialogues.
- Deployment pipelines could add a lightweight calibration check after each user turn before accepting the next response.
- One testable extension is whether the surrogate target still works when the underlying model is updated or fine-tuned on new data.
Load-bearing premise
The surrogate calibration target chosen for MTCal will continue to match actual user feedback patterns in new settings without creating fresh biases or needing repeated retuning.
What would settle it
Apply MTCal to a held-out set of conversations that include feedback patterns absent from the original training data, such as repeated factual challenges, then measure whether ECE@T rises above single-turn baselines.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the task of multi-turn calibration for LLMs, reframing calibration as a dynamic process conditioned on conversation history. It defines ECE@T to track calibration error over turns and demonstrates that user feedback (e.g., persuasion) can degrade performance. To address this, MTCal minimizes ECE@T via a surrogate calibration target, while ConfChat uses the resulting calibrated confidence scores as a decoding strategy to improve factuality and consistency. Extensive experiments are claimed to show that MTCal achieves outstanding and consistent multi-turn calibration and that ConfChat preserves or enhances model performance.
Significance. If the central claims hold, this work addresses a genuine gap by extending single-turn calibration techniques to conversational settings critical for high-stakes applications. The new ECE@T metric and the two proposed methods (MTCal and ConfChat) provide concrete tools for handling history-dependent degradation. The reported extensive experiments and consistent gains across settings are a positive feature that, if reproducible, would strengthen the case for treating multi-turn calibration as a distinct research direction.
major comments (2)
- [Methods (MTCal)] Methods section on MTCal: the surrogate calibration target used to minimize ECE@T is constructed from simulated user feedback (primarily persuasion scenarios). No theoretical argument or cross-validation is provided showing that this surrogate preserves calibration properties under other history-conditioned degradations such as factual corrections, topic drift, or clarification requests; this directly undermines the claim that MTCal delivers general multi-turn calibration improvements.
- [Experiments] Experiments and results: the abstract asserts 'outstanding and consistent performance' for MTCal on ECE@T and downstream gains for ConfChat, yet the support rests on experiments whose data splits, baseline implementations, statistical tests, and effect sizes are not detailed enough in the manuscript to verify robustness. If the surrogate is narrowly tuned, the reported ECE@T reductions may not transfer.
minor comments (2)
- [Abstract] Abstract: inconsistent naming ('MTCal' vs 'MT-Cal') should be unified throughout the paper.
- [Introduction] Notation: ECE@T is introduced without an explicit equation in the early sections; adding a formal definition (e.g., as an expectation over turns) would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of MTCal and the experimental details.
read point-by-point responses
-
Referee: [Methods (MTCal)] Methods section on MTCal: the surrogate calibration target used to minimize ECE@T is constructed from simulated user feedback (primarily persuasion scenarios). No theoretical argument or cross-validation is provided showing that this surrogate preserves calibration properties under other history-conditioned degradations such as factual corrections, topic drift, or clarification requests; this directly undermines the claim that MTCal delivers general multi-turn calibration improvements.
Authors: We acknowledge that the surrogate target is derived from persuasion-based simulations, which we selected as a representative and high-impact form of history-dependent degradation. The MTCal objective itself is formulated generally as minimization of ECE@T, but we agree that the current manuscript lacks explicit cross-validation or theoretical justification for transfer to other feedback types. In the revision we will add a new subsection discussing the design rationale for the surrogate, include additional experiments on factual corrections and topic drift, and report calibration performance under those conditions to better substantiate generality. revision: yes
-
Referee: [Experiments] Experiments and results: the abstract asserts 'outstanding and consistent performance' for MTCal on ECE@T and downstream gains for ConfChat, yet the support rests on experiments whose data splits, baseline implementations, statistical tests, and effect sizes are not detailed enough in the manuscript to verify robustness. If the surrogate is narrowly tuned, the reported ECE@T reductions may not transfer.
Authors: We agree that the current experimental description is insufficient for full reproducibility and robustness assessment. In the revised manuscript we will expand the Experiments section with: explicit data-split construction and sizes, complete baseline implementation details (including hyper-parameters), statistical significance tests with p-values, and effect-size reporting for all ECE@T and downstream metrics. We will also add a sensitivity analysis varying the surrogate construction to address transfer concerns, and will moderate the abstract language to align with the added evidence. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper defines a new task (multi-turn calibration) and metric (ECE@T) then proposes MTCal to minimize that metric via a surrogate target, with downstream use in ConfChat. No quoted derivation step reduces the claimed result to its own inputs by construction, no fitted parameter is relabeled as a prediction, and no load-bearing self-citation or imported uniqueness theorem is required. The central claims rest on the introduction of the metric plus empirical minimization, which are independent of the outputs they produce. This is the normal non-circular case for a methods paper that introduces and evaluates a new objective.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM confidence scores are comparable to empirical accuracy in a way that allows ECE-style metrics to be computed turn-by-turn.
Reference graph
Works this paper leans on
-
[1]
Teaching models to balance resisting and ac- cepting persuasion. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8108–8122. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunky- oung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, and Minjoon Seo. 2025. Reason- ing models better express their confidence.arXiv preprint arXiv:2505.14489. Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Au- rojit Panda, Jinyang Li, and He He. 202...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
InFindings of the Association for Computational Linguistics ACL 2024, pages 8702–8718
Fact-and-reflection (far) improves confidence calibration of large language models. InFindings of the Association for Computational Linguistics ACL 2024, pages 8702–8718. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others
work page 2024
-
[4]
Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine A Heller, and Subhrajit Roy. Batch calibration: Rethinking calibration for in- context learning and prompt engineering. InThe Twelfth International Conference on...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.