Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning
Pith reviewed 2026-05-08 04:33 UTC · model grok-4.3
The pith
In lifelong fine-tuning of large language models, conformal coverage degrades substantially faster than accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the classification-style continual learning settings studied, the drop in conformal coverage exceeds the drop in accuracy by a factor of roughly 3.4× on average across seeds. Calibration replay, which stores a modest held-out buffer per task and recomputes a task-specific conformal threshold after each model update, restores coverage to within two points of nominal at buffer size m=200 while adding no gradient cost during training and using less than one percent of the memory of ordinary experience replay.
What carries the argument
Calibration replay, a post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update.
If this is right
- Accuracy-focused continual learning methods leave coverage unprotected.
- Pooled thresholds across tasks do not maintain validity because of distribution drift between tasks.
- Calibration replay restores coverage without any training-time gradient cost.
- The finite-sample theorem gives exact validity when exchangeability holds within each task buffer.
Where Pith is reading between the lines
- Lifelong systems may need separate maintenance for calibration that is independent of accuracy preservation.
- Similar coverage collapse could appear in open-ended generation tasks, though the paper treats those extensions as exploratory.
- Alternative nonconformity scores might change the observed 3.4× degradation factor and should be compared directly.
Load-bearing premise
The finite-sample recovery theorem and the observed degradation factor both assume that task-specific buffers remain exchangeable with test points after later model updates and that the chosen tasks and score functions are representative.
What would settle it
Repeating the fine-tuning sequences on additional benchmarks or models and finding that average coverage loss stays within one times accuracy loss, or that calibration replay at m=200 fails to restore coverage, would falsify the central empirical claim.
Figures
read the original abstract
Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more sharply than top-1 performance. We study this empirically by measuring conformal coverage and calibration error on sequentially fine-tuned models across three model families and eight task sequences drawn primarily from classification and multiple-choice benchmarks. Across the classification-style settings we study, coverage loss exceeds accuracy loss by a factor of roughly \(3.4\times \pm 0.5\times\) on average across seeds; in the most pronounced case, coverage drops from \(0.92\) to \(0.61\), while accuracy remains within three points of baseline. Standard continual-learning methods that preserve accuracy do not automatically preserve coverage, and naive calibration baselines recover only part of the gap. We propose calibration replay, a lightweight post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update. It adds no training-time gradient cost, uses less than one percent of the memory of ordinary experience replay, and typically restores coverage to within two points of nominal at buffer size \(m = 200\). We accompany the empirical study with a drift decomposition, a finite-sample recovery theorem showing exact conformal validity under exchangeability, and a mixture-validity proposition explaining why pooled thresholds do not suffice. Our guarantees are stated for classification-style tasks with task-specific buffers; extensions to open-ended generation are exploratory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that conformal coverage in LLMs degrades substantially earlier and more sharply than top-1 accuracy during sequential fine-tuning. Across three model families and eight task sequences (primarily classification and multiple-choice), it reports that coverage loss exceeds accuracy loss by a factor of roughly 3.4× ± 0.5× on average, with extreme cases showing coverage falling from 0.92 to 0.61 while accuracy stays within 3 points. Standard continual-learning methods fail to preserve coverage, and the authors propose calibration replay: a post-hoc procedure maintaining task-specific buffers of size m=200 to refit conformal thresholds under the current model. This is supported by a drift decomposition, a finite-sample recovery theorem establishing exact validity under exchangeability with task-specific buffers, and a mixture-validity proposition explaining failure of pooled thresholds. The method adds no training-time cost and uses <1% memory of experience replay.
Significance. If the patterns hold, the work is significant for showing that accuracy-centric continual learning evaluations are incomplete for LLMs, as uncertainty reliability can fail first. The calibration replay method is practical due to its negligible overhead. Credit is due for the finite-sample theorem providing exact guarantees (rather than asymptotic) and the mixture-validity proposition, which are load-bearing strengths. This could prompt reevaluation of continual fine-tuning benchmarks to include calibration metrics.
major comments (1)
- [Abstract and empirical evaluation] Abstract and empirical evaluation: the headline 3.4× ± 0.5× coverage-to-accuracy loss ratio is obtained with particular nonconformity scores and the eight chosen task sequences. The manuscript does not report ablations varying the score function (e.g., softmax margin vs. negative log-likelihood or entropy) or broadening the task distribution, so it is unclear whether the multiplier is a stable property of continual LLM fine-tuning or an artifact of the experimental choices; this directly affects the generality of the central empirical claim.
minor comments (2)
- [Theorem and proposition statements] The finite-sample recovery theorem and mixture-validity proposition are stated for classification-style tasks with task-specific buffers; the manuscript should explicitly note the scope limitation and any exploratory status for open-ended generation in the theorem statement section.
- [Calibration replay procedure] Clarify in the method description how the task-specific buffer is populated and maintained without data leakage across the sequential updates.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's significance and for the constructive feedback on the generality of our empirical claims. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract and empirical evaluation] Abstract and empirical evaluation: the headline 3.4× ± 0.5× coverage-to-accuracy loss ratio is obtained with particular nonconformity scores and the eight chosen task sequences. The manuscript does not report ablations varying the score function (e.g., softmax margin vs. negative log-likelihood or entropy) or broadening the task distribution, so it is unclear whether the multiplier is a stable property of continual LLM fine-tuning or an artifact of the experimental choices; this directly affects the generality of the central empirical claim.
Authors: We agree that the reported multiplier is specific to the nonconformity scores and task sequences used. The manuscript already scopes the claim to 'classification-style settings we study' and 'primarily from classification and multiple-choice benchmarks,' but we acknowledge that additional evidence would strengthen generality. In the revised manuscript we will add an ablation section comparing the softmax-margin nonconformity score to negative log-likelihood and entropy-based alternatives on the same sequences. We will also expand the experimental discussion to characterize the diversity of the eight sequences and report results on two additional task sequences drawn from the same benchmark families to probe stability of the factor. These changes will be presented as supplementary evidence rather than altering the core observations or theorems. revision: yes
Circularity Check
No significant circularity; empirical measurements and standard conformal guarantees are self-contained
full rationale
The paper's central results consist of direct empirical measurements of coverage and accuracy degradation across fine-tuned models on fixed task sequences, which are independent of the proposed calibration replay procedure. The finite-sample recovery theorem and mixture-validity proposition are stated as holding exactly under the standard exchangeability assumption for task-specific buffers in classification settings, without reducing to any fitted parameters or self-referential definitions from the current work. The calibration replay refits thresholds post-hoc on held-out buffers, but the reported restoration to nominal coverage follows directly from conformal calibration mechanics rather than being derived as a prediction from prior inputs. No load-bearing self-citations, ansatzes, or uniqueness theorems are invoked, and the 3.4× degradation factor is an observed statistic from the experiments.
Axiom & Free-Parameter Ledger
free parameters (1)
- buffer size m
axioms (2)
- domain assumption Finite-sample exact conformal validity holds under exchangeability when using task-specific buffers
- domain assumption Pooled thresholds across tasks do not preserve validity under distribution drift
Reference graph
Works this paper leans on
-
[1]
Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. Selective risk certification for LLM outputs via information-lift statistics: PAC - B ayes, robustness, and skeleton design. arXiv preprint arXiv:2509.12527, 2025 a
-
[2]
Anytime-valid answer sufficiency certificates for LLM generation via sequential information lift
Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. Anytime-valid answer sufficiency certificates for LLM generation via sequential information lift. arXiv preprint arXiv:2510.06478, 2025 b
-
[3]
Angelopoulos and Stephen Bates
Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. Foundations and Trends in Machine Learning, 16 0 (4): 0 494--591, 2023
work page 2023
-
[4]
Angelopoulos, Stephen Bates, Emmanuel J
Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Cand \`e s, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. In ICLR, 2022
work page 2022
-
[5]
Domain-shift-aware conformal prediction for large language models
Andreas Auer, Martin Gauch, Daniel Klotz, and Sepp Hochreiter. Domain-shift-aware conformal prediction for large language models. arXiv preprint arXiv:2510.05566, 2024
-
[6]
Distribution-free, risk-controlling prediction sets
Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM, 68 0 (6): 0 1--34, 2021
work page 2021
-
[7]
A continual learning survey: Defying forgetting in classification tasks
Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 0 (7): 0 3366--3385, 2021
work page 2021
-
[8]
Conformal prediction for natural language processing: A survey
Nicolas Deutschmann, Mateo Moisescu-Pareja, and Mingxiao Gao. Conformal prediction for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 12: 0 1497--1516, 2024
work page 2024
-
[9]
Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27 0 (3): 0 642--669, 1956
work page 1956
-
[10]
Adaptive conformal inference under distribution shift
Isaac Gibbs and Emmanuel Cand \`e s. Adaptive conformal inference under distribution shift. In NeurIPS, 2021
work page 2021
-
[11]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In ICML, 2017
work page 2017
-
[12]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In ICLR, 2022
work page 2022
-
[13]
Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal
Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. In ACL, 2024
work page 2024
-
[14]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Charles Blundell, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. PNAS, 114 0 (13): 0 3521--3526, 2017
work page 2017
-
[15]
arXiv preprint arXiv:2305.18404 , year=
Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023
-
[16]
Charles Lu, Yaodong Yu, Sai Praneeth Karimireddy, Michael I. Jordan, and Ramesh Raskar. Federated conformal predictors for distributed uncertainty quantification. In ICML, 2023
work page 2023
-
[17]
Language models with conformal factuality guarantees
Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In ICML, 2024
work page 2024
-
[18]
Sculley, Sebastian Nowozin, Joshua V
Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. In NeurIPS, 2019
work page 2019
-
[19]
Calibrated prediction with covariate shift via unsupervised domain adaptation
Sangdon Park, Osbert Bastani, James Weimer, and Insup Lee. Calibrated prediction with covariate shift via unsupervised domain adaptation. In AISTATS, 2020
work page 2020
-
[20]
Conformal prediction for federated uncertainty quantification under label shift
Vincent Plassier, Mehdi Makni, Aleksandr Rubashevskii, Eric Moulines, and Maxim Panov. Conformal prediction for federated uncertainty quantification under label shift. In ICML, 2023
work page 2023
-
[21]
John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 1999
work page 1999
-
[22]
Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling. In ICLR, 2024
work page 2024
-
[23]
David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Gregory Wayne. Experience replay for continual learning. In NeurIPS, 2019
work page 2019
-
[24]
Algorithmic Learning in a Random World
Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005
work page 2005
-
[25]
A comprehensive survey of continual learning: Theory, method and application
Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[26]
Orthogonal subspace learning for language model continual learning
Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. In Findings of EMNLP, 2023
work page 2023
-
[27]
Continual learning of large language models: A comprehensive survey
Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning of large language models: A comprehensive survey. ACM Computing Surveys, 2025
work page 2025
-
[28]
Continual learning through synaptic intelligence
Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, 2017
work page 2017
-
[29]
Spurious forgetting in continual learning of language models
Junhao Zheng, Xidi Qiu, Chengming Ma, Zhongqi Shen, Haoran Sun, and Qianli Ma. Spurious forgetting in continual learning of language models. In ICLR, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.