Confidence Should Be Calibrated More Than One Turn Deep

Chao Shen; Chengzhengxu Li; Ioannis Patras; Xiaoming Liu; Zhaohan Zhang; Ziquan Liu

arxiv: 2604.05397 · v1 · submitted 2026-04-07 · 💻 cs.CL

Confidence Should Be Calibrated More Than One Turn Deep

Zhaohan Zhang , Chengzhengxu Li , Xiaoming Liu , Chao Shen , Ziquan Liu , Ioannis Patras This is my paper

Pith reviewed 2026-05-10 19:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-turn calibrationLLM confidence estimationExpected Calibration ErrorECE@TMTCalConfChatfactualitymulti-turn conversations

0 comments

The pith

LLM confidence must be calibrated dynamically at each conversation turn rather than once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that calibration of model confidence, which is usually handled as a fixed property for single questions, becomes unreliable in back-and-forth conversations because user replies can push the model into worse calibration over time. It defines a new metric, Expected Calibration Error at turn T, to track this drift and shows that common inputs like persuasion make the error grow. MTCal counters the drift by training against a surrogate target that approximates the needed adjustment at every step, while ConfChat feeds the resulting scores into decoding to keep answers more accurate and coherent. A reader would care because most practical LLM use happens in extended chats rather than isolated queries. If the claim holds, it gives a direct way to keep calibration useful when models stay in dialogue.

Core claim

We introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality,

What carries the argument

MTCal, a training procedure that minimises Expected Calibration Error at turn T by optimising against a surrogate calibration target conditioned on conversation history.

If this is right

User feedback such as persuasion increases calibration error across successive turns.
MTCal keeps ECE@T low in multi-turn settings where single-turn methods degrade.
ConfChat decoding that uses the calibrated scores raises both factuality and consistency of generated answers.
The combination supports reliable LLM use in extended interactions in domains such as healthcare and education.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Calibration routines may need to be rerun or adapted whenever the style of user input changes substantially.
The same dynamic framing could apply to other measures of model uncertainty in long dialogues.
Deployment pipelines could add a lightweight calibration check after each user turn before accepting the next response.
One testable extension is whether the surrogate target still works when the underlying model is updated or fine-tuned on new data.

Load-bearing premise

The surrogate calibration target chosen for MTCal will continue to match actual user feedback patterns in new settings without creating fresh biases or needing repeated retuning.

What would settle it

Apply MTCal to a held-out set of conversations that include feedback patterns absent from the original training data, such as repeated factual challenges, then measure whether ECE@T rises above single-turn baselines.

Figures

Figures reproduced from arXiv: 2604.05397 by Chao Shen, Chengzhengxu Li, Ioannis Patras, Xiaoming Liu, Zhaohan Zhang, Ziquan Liu.

**Figure 1.** Figure 1: LLMs are prone to change their responses with confidence when challenged. The figure in the bottom left is the reliability diagram for confidence at the first turn. The figure in the bottom right is the reliability diagram for confidence at the second turn1 . Liu et al., 2025). Despite their impressive performance, there remain concerns about hallucinated or misleading outputs in multi-turn conversations… view at source ↗

**Figure 2.** Figure 2: (a) Changes in the reliability diagram from the initial response to the subsequent reply after receiving critical follow-up messages for Llama3.1-8B-Instruct. The diagrams above the arrows correspond to the first turn, while those below represent the second turn. (b) The analysis of answer changes with the change of model confidence. Correct → Correct: The answer remains correct. Correct → Incorrect: The c… view at source ↗

**Figure 3.** Figure 3: The framework of ConfChat process. (a) In the first turn, the token with the highest rescaled generation score is selected at each decoding step. (b) In subsequent turns, candidate token sets are generated based on both the first-turn and current-turn inputs, and the two candidate sets are aggregated to select the token with the highest overall generation score. Ti is the number of conversation turns in th… view at source ↗

**Figure 4.** Figure 4: Domain generalization on Llama-8BInstruct. OOD denotes the out-of-domain setting, IND denotes the in-domain setting, and PS refers to Platt Scaling. the calibration across all question–answer pairs in the conversations. We provided theoretical proof to this in Appendix A. 7.2 Domain Generalization MTCal needs to train a probe on the calibration set, raising the question of whether it generalizes to other … view at source ↗

**Figure 5.** Figure 5: The comparison of change in accuracy of Llama3.1-8B-Instruct in different conversation rounds between ConfChat and other strategies. Our method ConfChat keeps a relatively stable accuracy across turns. Method ECE@1 ECE@2 ECE@3 ECE@4 ECE@5 Llama3.1-8B-Instruct SL 7.56 8.13 8.25 7.00 9.42 Apricot 6.03 5.18 7.12 7.33 6.26 MTCal 5.04 2.31 5.91 5.01 6.08 Qwen2.5-7B-Instruct SL 23.58 24.65 23.20 20.38 22.89 Apri… view at source ↗

**Figure 7.** Figure 7: Domain generalization on Gemma2-9B-it. OOD denotes the out-of-domain setting, IND denotes the in-domain setting, and PS refers to Platt Scaling. Verbal (Tian et al., 2023). It prompts the model to give a verbalized confidence towards its response. P(True) (Kadavath et al., 2022). It asks the model whether or not its response is true and uses the probability of predicting true as the confidence measure. C.… view at source ↗

**Figure 8.** Figure 8: The change of ECE@T during conversation for different confidence estimations. Our method MTCal consistently outperforms the comparison methods across turns. 1 2 3 4 5 Turn 30 40 50 Accuracy (%) TriviaQA 1 2 3 4 5 Turn 50 60 70 80 SciQ ACC RP CARG ConfChat 1 2 3 4 5 Turn 25.0 27.5 30.0 32.5 35.0 37.5 40.0 NQ [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: The comparison of change in response accuracy of Qwen2.5-7B-Instruct in different conversation rounds between ConfChat and other strategies. Our method ConfChat keeps a relatively stable accuracy across turns. 1 2 3 4 5 Turn 60 65 70 75 Accuracy (%) TriviaQA 1 2 3 4 5 Turn 70 75 80 85 90 SciQ ACC RP CARG ConfChat 1 2 3 4 5 Turn 44 46 48 50 52 NQ [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: The comparison of change in response accuracy of Gemma2-9B-it in different conversation rounds between ConfChat and other strategies. Our method ConfChat keeps a relatively stable accuracy across turns. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames calibration as a dynamic multi-turn problem with ECE@T, MTCal, and ConfChat, but the surrogate target looks like the main point that needs more checking.

read the letter

The paper's main contribution is shifting the focus of LLM calibration from single responses to entire conversations. They point out that existing methods don't account for how confidence can drift as users provide feedback over multiple turns. To measure this, they define ECE@T, which looks at calibration error at each turn in the dialogue. Their experiments indicate that things like persuasive user input can worsen calibration as the conversation goes on. They address it with MTCal, a training approach that minimizes the ECE@T using a surrogate calibration target instead of direct optimization, which makes sense for making it trainable. They also introduce ConfChat, a decoding method that uses the now-calibrated confidence scores to guide generation, leading to better factuality and consistency according to their tests. This is useful because it directly tackles a limitation in current calibration literature for real applications where interactions are ongoing. The methods are practical and seem to deliver gains without major overhead, based on the reported results. That said, the soft spots center on how well the surrogate target captures actual conversation dynamics. The approach assumes this proxy will work across different user behaviors, but if the experiments primarily use simulated persuasion, it might not extend to other common patterns like users correcting inaccuracies or shifting topics. This could mean the performance improvements are narrower than claimed, and additional tuning might be needed for new domains. I'd also like to see more details on the statistical significance of the gains and how the baselines were selected to ensure they're not overly favorable. Overall, this is relevant for anyone working on reliable multi-turn LLM systems in fields like healthcare or education. It has enough novelty and experimental support to warrant peer review, though the authors should strengthen the case for the surrogate's robustness in revisions.

Referee Report

2 major / 2 minor

Summary. The paper introduces the task of multi-turn calibration for LLMs, reframing calibration as a dynamic process conditioned on conversation history. It defines ECE@T to track calibration error over turns and demonstrates that user feedback (e.g., persuasion) can degrade performance. To address this, MTCal minimizes ECE@T via a surrogate calibration target, while ConfChat uses the resulting calibrated confidence scores as a decoding strategy to improve factuality and consistency. Extensive experiments are claimed to show that MTCal achieves outstanding and consistent multi-turn calibration and that ConfChat preserves or enhances model performance.

Significance. If the central claims hold, this work addresses a genuine gap by extending single-turn calibration techniques to conversational settings critical for high-stakes applications. The new ECE@T metric and the two proposed methods (MTCal and ConfChat) provide concrete tools for handling history-dependent degradation. The reported extensive experiments and consistent gains across settings are a positive feature that, if reproducible, would strengthen the case for treating multi-turn calibration as a distinct research direction.

major comments (2)

[Methods (MTCal)] Methods section on MTCal: the surrogate calibration target used to minimize ECE@T is constructed from simulated user feedback (primarily persuasion scenarios). No theoretical argument or cross-validation is provided showing that this surrogate preserves calibration properties under other history-conditioned degradations such as factual corrections, topic drift, or clarification requests; this directly undermines the claim that MTCal delivers general multi-turn calibration improvements.
[Experiments] Experiments and results: the abstract asserts 'outstanding and consistent performance' for MTCal on ECE@T and downstream gains for ConfChat, yet the support rests on experiments whose data splits, baseline implementations, statistical tests, and effect sizes are not detailed enough in the manuscript to verify robustness. If the surrogate is narrowly tuned, the reported ECE@T reductions may not transfer.

minor comments (2)

[Abstract] Abstract: inconsistent naming ('MTCal' vs 'MT-Cal') should be unified throughout the paper.
[Introduction] Notation: ECE@T is introduced without an explicit equation in the early sections; adding a formal definition (e.g., as an expectation over turns) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the presentation of MTCal and the experimental details.

read point-by-point responses

Referee: [Methods (MTCal)] Methods section on MTCal: the surrogate calibration target used to minimize ECE@T is constructed from simulated user feedback (primarily persuasion scenarios). No theoretical argument or cross-validation is provided showing that this surrogate preserves calibration properties under other history-conditioned degradations such as factual corrections, topic drift, or clarification requests; this directly undermines the claim that MTCal delivers general multi-turn calibration improvements.

Authors: We acknowledge that the surrogate target is derived from persuasion-based simulations, which we selected as a representative and high-impact form of history-dependent degradation. The MTCal objective itself is formulated generally as minimization of ECE@T, but we agree that the current manuscript lacks explicit cross-validation or theoretical justification for transfer to other feedback types. In the revision we will add a new subsection discussing the design rationale for the surrogate, include additional experiments on factual corrections and topic drift, and report calibration performance under those conditions to better substantiate generality. revision: yes
Referee: [Experiments] Experiments and results: the abstract asserts 'outstanding and consistent performance' for MTCal on ECE@T and downstream gains for ConfChat, yet the support rests on experiments whose data splits, baseline implementations, statistical tests, and effect sizes are not detailed enough in the manuscript to verify robustness. If the surrogate is narrowly tuned, the reported ECE@T reductions may not transfer.

Authors: We agree that the current experimental description is insufficient for full reproducibility and robustness assessment. In the revised manuscript we will expand the Experiments section with: explicit data-split construction and sizes, complete baseline implementation details (including hyper-parameters), statistical significance tests with p-values, and effect-size reporting for all ECE@T and downstream metrics. We will also add a sensitivity analysis varying the surrogate construction to address transfer concerns, and will moderate the abstract language to align with the added evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines a new task (multi-turn calibration) and metric (ECE@T) then proposes MTCal to minimize that metric via a surrogate target, with downstream use in ConfChat. No quoted derivation step reduces the claimed result to its own inputs by construction, no fitted parameter is relabeled as a prediction, and no load-bearing self-citation or imported uniqueness theorem is required. The central claims rest on the introduction of the metric plus empirical minimization, which are independent of the outputs they produce. This is the normal non-circular case for a methods paper that introduces and evaluates a new objective.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard LLM calibration assumptions (e.g., that confidence scores can be meaningfully compared to accuracy) plus the novel framing that history-conditioned calibration is necessary; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption LLM confidence scores are comparable to empirical accuracy in a way that allows ECE-style metrics to be computed turn-by-turn.
Invoked when defining ECE@T and the surrogate target for MTCal.

pith-pipeline@v0.9.0 · 5551 in / 1344 out tokens · 50729 ms · 2026-05-10T19:32:25.270429+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Teaching models to balance resisting and ac- cepting persuasion. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8108–8122. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunky- oung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, and Minjoon Seo. 2025. Reason- ing models better express their confidence.arXiv preprint arXiv:2505.14489. Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Au- rojit Panda, Jinyang Li, and He He. 202...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

InFindings of the Association for Computational Linguistics ACL 2024, pages 8702–8718

Fact-and-reflection (far) improves confidence calibration of large language models. InFindings of the Association for Computational Linguistics ACL 2024, pages 8702–8718. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others

work page 2024
[4]

role": system,

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine A Heller, and Subhrajit Roy. Batch calibration: Rethinking calibration for in- context learning and prompt engineering. InThe Twelfth International Conference on...

work page arXiv 2024

[1] [1]

Teaching models to balance resisting and ac- cepting persuasion. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8108–8122. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunky- oung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, and Minjoon Seo. 2025. Reason- ing models better express their confidence.arXiv preprint arXiv:2505.14489. Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Au- rojit Panda, Jinyang Li, and He He. 202...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

InFindings of the Association for Computational Linguistics ACL 2024, pages 8702–8718

Fact-and-reflection (far) improves confidence calibration of large language models. InFindings of the Association for Computational Linguistics ACL 2024, pages 8702–8718. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others

work page 2024

[4] [4]

role": system,

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine A Heller, and Subhrajit Roy. Batch calibration: Rethinking calibration for in- context learning and prompt engineering. InThe Twelfth International Conference on...

work page arXiv 2024