pith. sign in

arxiv: 2604.23987 · v1 · submitted 2026-04-27 · 💻 cs.LG

Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning

Pith reviewed 2026-05-08 04:33 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learningconformal predictionLLM fine-tuningcalibrationcoverageuncertainty estimationlifelong learning
0
0 comments X

The pith

In lifelong fine-tuning of large language models, conformal coverage degrades substantially faster than accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that continual learning for LLMs is usually judged only by whether accuracy holds up after sequential updates, but this misses an earlier failure in uncertainty reliability. Experiments across model families and classification task sequences show that conformal coverage loss is roughly 3.4 times larger than accuracy loss on average, with coverage sometimes falling from 0.92 to 0.61 while accuracy stays nearly flat. Standard accuracy-preserving methods do not protect coverage, and simple pooling of past data fails to restore it. The authors introduce calibration replay, a lightweight post-hoc step that keeps a small task-specific buffer and refits the conformal threshold after each update. Supporting results include a finite-sample recovery theorem under exchangeability and a proposition explaining why mixture thresholds break down.

Core claim

In the classification-style continual learning settings studied, the drop in conformal coverage exceeds the drop in accuracy by a factor of roughly 3.4× on average across seeds. Calibration replay, which stores a modest held-out buffer per task and recomputes a task-specific conformal threshold after each model update, restores coverage to within two points of nominal at buffer size m=200 while adding no gradient cost during training and using less than one percent of the memory of ordinary experience replay.

What carries the argument

Calibration replay, a post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update.

If this is right

  • Accuracy-focused continual learning methods leave coverage unprotected.
  • Pooled thresholds across tasks do not maintain validity because of distribution drift between tasks.
  • Calibration replay restores coverage without any training-time gradient cost.
  • The finite-sample theorem gives exact validity when exchangeability holds within each task buffer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Lifelong systems may need separate maintenance for calibration that is independent of accuracy preservation.
  • Similar coverage collapse could appear in open-ended generation tasks, though the paper treats those extensions as exploratory.
  • Alternative nonconformity scores might change the observed 3.4× degradation factor and should be compared directly.

Load-bearing premise

The finite-sample recovery theorem and the observed degradation factor both assume that task-specific buffers remain exchangeable with test points after later model updates and that the chosen tasks and score functions are representative.

What would settle it

Repeating the fine-tuning sequences on additional benchmarks or models and finding that average coverage loss stays within one times accuracy loss, or that calibration replay at m=200 fails to restore coverage, would falsify the central empirical claim.

Figures

Figures reproduced from arXiv: 2604.23987 by Anuj Sharma, Ibne Farabi Shihab, Sanjeda Akter.

Figure 1
Figure 1. Figure 1: Representative continual-learning trajectory on Pythia-1.4B through a six-task GLUE view at source ↗
Figure 2
Figure 2. Figure 2: Score CDF drift predicts coverage loss; accuracy drift does not. Each point is one (task, view at source ↗
Figure 3
Figure 3. Figure 3: Stale vs. refreshed coverage for Pythia-1.4B GLUE. Refreshed coverage recovers to a band view at source ↗
Figure 4
Figure 4. Figure 4: ECE trajectories for Pythia-1.4B GLUE. Tasks with more label classes (MNLI) show larger view at source ↗
read the original abstract

Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more sharply than top-1 performance. We study this empirically by measuring conformal coverage and calibration error on sequentially fine-tuned models across three model families and eight task sequences drawn primarily from classification and multiple-choice benchmarks. Across the classification-style settings we study, coverage loss exceeds accuracy loss by a factor of roughly \(3.4\times \pm 0.5\times\) on average across seeds; in the most pronounced case, coverage drops from \(0.92\) to \(0.61\), while accuracy remains within three points of baseline. Standard continual-learning methods that preserve accuracy do not automatically preserve coverage, and naive calibration baselines recover only part of the gap. We propose calibration replay, a lightweight post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update. It adds no training-time gradient cost, uses less than one percent of the memory of ordinary experience replay, and typically restores coverage to within two points of nominal at buffer size \(m = 200\). We accompany the empirical study with a drift decomposition, a finite-sample recovery theorem showing exact conformal validity under exchangeability, and a mixture-validity proposition explaining why pooled thresholds do not suffice. Our guarantees are stated for classification-style tasks with task-specific buffers; extensions to open-ended generation are exploratory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that conformal coverage in LLMs degrades substantially earlier and more sharply than top-1 accuracy during sequential fine-tuning. Across three model families and eight task sequences (primarily classification and multiple-choice), it reports that coverage loss exceeds accuracy loss by a factor of roughly 3.4× ± 0.5× on average, with extreme cases showing coverage falling from 0.92 to 0.61 while accuracy stays within 3 points. Standard continual-learning methods fail to preserve coverage, and the authors propose calibration replay: a post-hoc procedure maintaining task-specific buffers of size m=200 to refit conformal thresholds under the current model. This is supported by a drift decomposition, a finite-sample recovery theorem establishing exact validity under exchangeability with task-specific buffers, and a mixture-validity proposition explaining failure of pooled thresholds. The method adds no training-time cost and uses <1% memory of experience replay.

Significance. If the patterns hold, the work is significant for showing that accuracy-centric continual learning evaluations are incomplete for LLMs, as uncertainty reliability can fail first. The calibration replay method is practical due to its negligible overhead. Credit is due for the finite-sample theorem providing exact guarantees (rather than asymptotic) and the mixture-validity proposition, which are load-bearing strengths. This could prompt reevaluation of continual fine-tuning benchmarks to include calibration metrics.

major comments (1)
  1. [Abstract and empirical evaluation] Abstract and empirical evaluation: the headline 3.4× ± 0.5× coverage-to-accuracy loss ratio is obtained with particular nonconformity scores and the eight chosen task sequences. The manuscript does not report ablations varying the score function (e.g., softmax margin vs. negative log-likelihood or entropy) or broadening the task distribution, so it is unclear whether the multiplier is a stable property of continual LLM fine-tuning or an artifact of the experimental choices; this directly affects the generality of the central empirical claim.
minor comments (2)
  1. [Theorem and proposition statements] The finite-sample recovery theorem and mixture-validity proposition are stated for classification-style tasks with task-specific buffers; the manuscript should explicitly note the scope limitation and any exploratory status for open-ended generation in the theorem statement section.
  2. [Calibration replay procedure] Clarify in the method description how the task-specific buffer is populated and maintained without data leakage across the sequential updates.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive feedback on the generality of our empirical claims. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract and empirical evaluation] Abstract and empirical evaluation: the headline 3.4× ± 0.5× coverage-to-accuracy loss ratio is obtained with particular nonconformity scores and the eight chosen task sequences. The manuscript does not report ablations varying the score function (e.g., softmax margin vs. negative log-likelihood or entropy) or broadening the task distribution, so it is unclear whether the multiplier is a stable property of continual LLM fine-tuning or an artifact of the experimental choices; this directly affects the generality of the central empirical claim.

    Authors: We agree that the reported multiplier is specific to the nonconformity scores and task sequences used. The manuscript already scopes the claim to 'classification-style settings we study' and 'primarily from classification and multiple-choice benchmarks,' but we acknowledge that additional evidence would strengthen generality. In the revised manuscript we will add an ablation section comparing the softmax-margin nonconformity score to negative log-likelihood and entropy-based alternatives on the same sequences. We will also expand the experimental discussion to characterize the diversity of the eight sequences and report results on two additional task sequences drawn from the same benchmark families to probe stability of the factor. These changes will be presented as supplementary evidence rather than altering the core observations or theorems. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements and standard conformal guarantees are self-contained

full rationale

The paper's central results consist of direct empirical measurements of coverage and accuracy degradation across fine-tuned models on fixed task sequences, which are independent of the proposed calibration replay procedure. The finite-sample recovery theorem and mixture-validity proposition are stated as holding exactly under the standard exchangeability assumption for task-specific buffers in classification settings, without reducing to any fitted parameters or self-referential definitions from the current work. The calibration replay refits thresholds post-hoc on held-out buffers, but the reported restoration to nominal coverage follows directly from conformal calibration mechanics rather than being derived as a prediction from prior inputs. No load-bearing self-citations, ansatzes, or uniqueness theorems are invoked, and the 3.4× degradation factor is an observed statistic from the experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard conformal-prediction exchangeability plus the new empirical observation and lightweight procedure; no new physical entities are introduced.

free parameters (1)
  • buffer size m
    Chosen at 200 to restore coverage to within two points of nominal; acts as a practical hyperparameter rather than a fitted constant.
axioms (2)
  • domain assumption Finite-sample exact conformal validity holds under exchangeability when using task-specific buffers
    Invoked to justify the recovery theorem for classification-style tasks.
  • domain assumption Pooled thresholds across tasks do not preserve validity under distribution drift
    Basis for the mixture-validity proposition explaining why naive calibration fails.

pith-pipeline@v0.9.0 · 5576 in / 1248 out tokens · 39480 ms · 2026-05-08T04:33:12.431267+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Selective risk certification for LLM outputs via information-lift statistics: PAC - B ayes, robustness, and skeleton design

    Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. Selective risk certification for LLM outputs via information-lift statistics: PAC - B ayes, robustness, and skeleton design. arXiv preprint arXiv:2509.12527, 2025 a

  2. [2]

    Anytime-valid answer sufficiency certificates for LLM generation via sequential information lift

    Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. Anytime-valid answer sufficiency certificates for LLM generation via sequential information lift. arXiv preprint arXiv:2510.06478, 2025 b

  3. [3]

    Angelopoulos and Stephen Bates

    Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. Foundations and Trends in Machine Learning, 16 0 (4): 0 494--591, 2023

  4. [4]

    Angelopoulos, Stephen Bates, Emmanuel J

    Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Cand \`e s, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. In ICLR, 2022

  5. [5]

    Domain-shift-aware conformal prediction for large language models

    Andreas Auer, Martin Gauch, Daniel Klotz, and Sepp Hochreiter. Domain-shift-aware conformal prediction for large language models. arXiv preprint arXiv:2510.05566, 2024

  6. [6]

    Distribution-free, risk-controlling prediction sets

    Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM, 68 0 (6): 0 1--34, 2021

  7. [7]

    A continual learning survey: Defying forgetting in classification tasks

    Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 0 (7): 0 3366--3385, 2021

  8. [8]

    Conformal prediction for natural language processing: A survey

    Nicolas Deutschmann, Mateo Moisescu-Pareja, and Mingxiao Gao. Conformal prediction for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 12: 0 1497--1516, 2024

  9. [9]

    Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator

    Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27 0 (3): 0 642--669, 1956

  10. [10]

    Adaptive conformal inference under distribution shift

    Isaac Gibbs and Emmanuel Cand \`e s. Adaptive conformal inference under distribution shift. In NeurIPS, 2021

  11. [11]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In ICML, 2017

  12. [12]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In ICLR, 2022

  13. [13]

    Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal

    Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. In ACL, 2024

  14. [14]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Charles Blundell, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Charles Blundell, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. PNAS, 114 0 (13): 0 3521--3526, 2017

  15. [15]

    arXiv preprint arXiv:2305.18404 , year=

    Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023

  16. [16]

    Jordan, and Ramesh Raskar

    Charles Lu, Yaodong Yu, Sai Praneeth Karimireddy, Michael I. Jordan, and Ramesh Raskar. Federated conformal predictors for distributed uncertainty quantification. In ICML, 2023

  17. [17]

    Language models with conformal factuality guarantees

    Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In ICML, 2024

  18. [18]

    Sculley, Sebastian Nowozin, Joshua V

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. In NeurIPS, 2019

  19. [19]

    Calibrated prediction with covariate shift via unsupervised domain adaptation

    Sangdon Park, Osbert Bastani, James Weimer, and Insup Lee. Calibrated prediction with covariate shift via unsupervised domain adaptation. In AISTATS, 2020

  20. [20]

    Conformal prediction for federated uncertainty quantification under label shift

    Vincent Plassier, Mehdi Makni, Aleksandr Rubashevskii, Eric Moulines, and Maxim Panov. Conformal prediction for federated uncertainty quantification under label shift. In ICML, 2023

  21. [21]

    John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 1999

  22. [22]

    Jaakkola, and Regina Barzilay

    Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling. In ICLR, 2024

  23. [23]

    Lillicrap, and Gregory Wayne

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Gregory Wayne. Experience replay for continual learning. In NeurIPS, 2019

  24. [24]

    Algorithmic Learning in a Random World

    Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005

  25. [25]

    A comprehensive survey of continual learning: Theory, method and application

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  26. [26]

    Orthogonal subspace learning for language model continual learning

    Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. In Findings of EMNLP, 2023

  27. [27]

    Continual learning of large language models: A comprehensive survey

    Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning of large language models: A comprehensive survey. ACM Computing Surveys, 2025

  28. [28]

    Continual learning through synaptic intelligence

    Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, 2017

  29. [29]

    Spurious forgetting in continual learning of language models

    Junhao Zheng, Xidi Qiu, Chengming Ma, Zhongqi Shen, Haoran Sun, and Qianli Ma. Spurious forgetting in continual learning of language models. In ICLR, 2025