Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning

Anuj Sharma; Ibne Farabi Shihab; Sanjeda Akter

arxiv: 2604.23987 · v1 · submitted 2026-04-27 · 💻 cs.LG

Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning

Ibne Farabi Shihab , Sanjeda Akter , Anuj Sharma This is my paper

Pith reviewed 2026-05-08 04:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningconformal predictionLLM fine-tuningcalibrationcoverageuncertainty estimationlifelong learning

0 comments

The pith

In lifelong fine-tuning of large language models, conformal coverage degrades substantially faster than accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that continual learning for LLMs is usually judged only by whether accuracy holds up after sequential updates, but this misses an earlier failure in uncertainty reliability. Experiments across model families and classification task sequences show that conformal coverage loss is roughly 3.4 times larger than accuracy loss on average, with coverage sometimes falling from 0.92 to 0.61 while accuracy stays nearly flat. Standard accuracy-preserving methods do not protect coverage, and simple pooling of past data fails to restore it. The authors introduce calibration replay, a lightweight post-hoc step that keeps a small task-specific buffer and refits the conformal threshold after each update. Supporting results include a finite-sample recovery theorem under exchangeability and a proposition explaining why mixture thresholds break down.

Core claim

In the classification-style continual learning settings studied, the drop in conformal coverage exceeds the drop in accuracy by a factor of roughly 3.4× on average across seeds. Calibration replay, which stores a modest held-out buffer per task and recomputes a task-specific conformal threshold after each model update, restores coverage to within two points of nominal at buffer size m=200 while adding no gradient cost during training and using less than one percent of the memory of ordinary experience replay.

What carries the argument

Calibration replay, a post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update.

If this is right

Accuracy-focused continual learning methods leave coverage unprotected.
Pooled thresholds across tasks do not maintain validity because of distribution drift between tasks.
Calibration replay restores coverage without any training-time gradient cost.
The finite-sample theorem gives exact validity when exchangeability holds within each task buffer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Lifelong systems may need separate maintenance for calibration that is independent of accuracy preservation.
Similar coverage collapse could appear in open-ended generation tasks, though the paper treats those extensions as exploratory.
Alternative nonconformity scores might change the observed 3.4× degradation factor and should be compared directly.

Load-bearing premise

The finite-sample recovery theorem and the observed degradation factor both assume that task-specific buffers remain exchangeable with test points after later model updates and that the chosen tasks and score functions are representative.

What would settle it

Repeating the fine-tuning sequences on additional benchmarks or models and finding that average coverage loss stays within one times accuracy loss, or that calibration replay at m=200 fails to restore coverage, would falsify the central empirical claim.

Figures

Figures reproduced from arXiv: 2604.23987 by Anuj Sharma, Ibne Farabi Shihab, Sanjeda Akter.

**Figure 1.** Figure 1: Representative continual-learning trajectory on Pythia-1.4B through a six-task GLUE view at source ↗

**Figure 2.** Figure 2: Score CDF drift predicts coverage loss; accuracy drift does not. Each point is one (task, view at source ↗

**Figure 3.** Figure 3: Stale vs. refreshed coverage for Pythia-1.4B GLUE. Refreshed coverage recovers to a band view at source ↗

**Figure 4.** Figure 4: ECE trajectories for Pythia-1.4B GLUE. Tasks with more label classes (MNLI) show larger view at source ↗

read the original abstract

Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more sharply than top-1 performance. We study this empirically by measuring conformal coverage and calibration error on sequentially fine-tuned models across three model families and eight task sequences drawn primarily from classification and multiple-choice benchmarks. Across the classification-style settings we study, coverage loss exceeds accuracy loss by a factor of roughly \(3.4\times \pm 0.5\times\) on average across seeds; in the most pronounced case, coverage drops from \(0.92\) to \(0.61\), while accuracy remains within three points of baseline. Standard continual-learning methods that preserve accuracy do not automatically preserve coverage, and naive calibration baselines recover only part of the gap. We propose calibration replay, a lightweight post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update. It adds no training-time gradient cost, uses less than one percent of the memory of ordinary experience replay, and typically restores coverage to within two points of nominal at buffer size \(m = 200\). We accompany the empirical study with a drift decomposition, a finite-sample recovery theorem showing exact conformal validity under exchangeability, and a mixture-validity proposition explaining why pooled thresholds do not suffice. Our guarantees are stated for classification-style tasks with task-specific buffers; extensions to open-ended generation are exploratory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Coverage in continual LLM fine-tuning can drop sharply before accuracy does, and a small task-specific buffer restores it with almost no extra cost.

read the letter

Coverage reliability drops faster than accuracy during continual fine-tuning of LLMs. The paper shows this pattern holds across several models and task sequences, and offers a lightweight replay method to restore conformal coverage without much overhead. The new part is the direct comparison of coverage loss to accuracy loss, with the reported 3.4 times greater degradation on average. They document this on three model families and eight sequences from classification and multiple-choice benchmarks. The calibration replay keeps a small held-out buffer per task and refits the conformal threshold after each fine-tuning step. It adds negligible training cost and under one percent of the memory of standard experience replay, yet typically gets coverage back within two points of the target at buffer size 200. They support it with a drift decomposition and a finite-sample theorem that guarantees exact validity when the buffer stays exchangeable. This is useful because standard continual learning techniques that hold accuracy steady do not automatically hold coverage steady. The method is post-hoc and cheap, which makes it easy to add on top of existing pipelines. The main soft spot is that the degradation ratio depends on the chosen nonconformity scores and the specific task sequences. Without ablations on other scores like entropy or different benchmarks, it is hard to know if the 3.4 factor is general or tied to their setup. The theorem and propositions assume classification-style tasks with task-specific buffers, and they flag that open-ended generation remains exploratory. In real deployments the exchangeability assumption inside each buffer may not hold perfectly after fine-tuning shifts the model. This work is for researchers focused on continual learning for LLMs who also need reliable uncertainty estimates, such as in safety-critical applications. Readers interested in conformal prediction applied to sequential adaptation will get concrete numbers and a practical procedure. It deserves peer review because the empirical finding is distinct from prior work and the proposed fix is simple enough to test quickly. I would recommend sending it to referees.

Referee Report

1 major / 2 minor

Summary. The paper claims that conformal coverage in LLMs degrades substantially earlier and more sharply than top-1 accuracy during sequential fine-tuning. Across three model families and eight task sequences (primarily classification and multiple-choice), it reports that coverage loss exceeds accuracy loss by a factor of roughly 3.4× ± 0.5× on average, with extreme cases showing coverage falling from 0.92 to 0.61 while accuracy stays within 3 points. Standard continual-learning methods fail to preserve coverage, and the authors propose calibration replay: a post-hoc procedure maintaining task-specific buffers of size m=200 to refit conformal thresholds under the current model. This is supported by a drift decomposition, a finite-sample recovery theorem establishing exact validity under exchangeability with task-specific buffers, and a mixture-validity proposition explaining failure of pooled thresholds. The method adds no training-time cost and uses <1% memory of experience replay.

Significance. If the patterns hold, the work is significant for showing that accuracy-centric continual learning evaluations are incomplete for LLMs, as uncertainty reliability can fail first. The calibration replay method is practical due to its negligible overhead. Credit is due for the finite-sample theorem providing exact guarantees (rather than asymptotic) and the mixture-validity proposition, which are load-bearing strengths. This could prompt reevaluation of continual fine-tuning benchmarks to include calibration metrics.

major comments (1)

[Abstract and empirical evaluation] Abstract and empirical evaluation: the headline 3.4× ± 0.5× coverage-to-accuracy loss ratio is obtained with particular nonconformity scores and the eight chosen task sequences. The manuscript does not report ablations varying the score function (e.g., softmax margin vs. negative log-likelihood or entropy) or broadening the task distribution, so it is unclear whether the multiplier is a stable property of continual LLM fine-tuning or an artifact of the experimental choices; this directly affects the generality of the central empirical claim.

minor comments (2)

[Theorem and proposition statements] The finite-sample recovery theorem and mixture-validity proposition are stated for classification-style tasks with task-specific buffers; the manuscript should explicitly note the scope limitation and any exploratory status for open-ended generation in the theorem statement section.
[Calibration replay procedure] Clarify in the method description how the task-specific buffer is populated and maintained without data leakage across the sequential updates.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive feedback on the generality of our empirical claims. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and empirical evaluation] Abstract and empirical evaluation: the headline 3.4× ± 0.5× coverage-to-accuracy loss ratio is obtained with particular nonconformity scores and the eight chosen task sequences. The manuscript does not report ablations varying the score function (e.g., softmax margin vs. negative log-likelihood or entropy) or broadening the task distribution, so it is unclear whether the multiplier is a stable property of continual LLM fine-tuning or an artifact of the experimental choices; this directly affects the generality of the central empirical claim.

Authors: We agree that the reported multiplier is specific to the nonconformity scores and task sequences used. The manuscript already scopes the claim to 'classification-style settings we study' and 'primarily from classification and multiple-choice benchmarks,' but we acknowledge that additional evidence would strengthen generality. In the revised manuscript we will add an ablation section comparing the softmax-margin nonconformity score to negative log-likelihood and entropy-based alternatives on the same sequences. We will also expand the experimental discussion to characterize the diversity of the eight sequences and report results on two additional task sequences drawn from the same benchmark families to probe stability of the factor. These changes will be presented as supplementary evidence rather than altering the core observations or theorems. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements and standard conformal guarantees are self-contained

full rationale

The paper's central results consist of direct empirical measurements of coverage and accuracy degradation across fine-tuned models on fixed task sequences, which are independent of the proposed calibration replay procedure. The finite-sample recovery theorem and mixture-validity proposition are stated as holding exactly under the standard exchangeability assumption for task-specific buffers in classification settings, without reducing to any fitted parameters or self-referential definitions from the current work. The calibration replay refits thresholds post-hoc on held-out buffers, but the reported restoration to nominal coverage follows directly from conformal calibration mechanics rather than being derived as a prediction from prior inputs. No load-bearing self-citations, ansatzes, or uniqueness theorems are invoked, and the 3.4× degradation factor is an observed statistic from the experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard conformal-prediction exchangeability plus the new empirical observation and lightweight procedure; no new physical entities are introduced.

free parameters (1)

buffer size m
Chosen at 200 to restore coverage to within two points of nominal; acts as a practical hyperparameter rather than a fitted constant.

axioms (2)

domain assumption Finite-sample exact conformal validity holds under exchangeability when using task-specific buffers
Invoked to justify the recovery theorem for classification-style tasks.
domain assumption Pooled thresholds across tasks do not preserve validity under distribution drift
Basis for the mixture-validity proposition explaining why naive calibration fails.

pith-pipeline@v0.9.0 · 5576 in / 1248 out tokens · 39480 ms · 2026-05-08T04:33:12.431267+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Selective risk certification for LLM outputs via information-lift statistics: PAC - B ayes, robustness, and skeleton design

Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. Selective risk certification for LLM outputs via information-lift statistics: PAC - B ayes, robustness, and skeleton design. arXiv preprint arXiv:2509.12527, 2025 a

work page arXiv 2025
[2]

Anytime-valid answer sufficiency certificates for LLM generation via sequential information lift

Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. Anytime-valid answer sufficiency certificates for LLM generation via sequential information lift. arXiv preprint arXiv:2510.06478, 2025 b

work page arXiv 2025
[3]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. Foundations and Trends in Machine Learning, 16 0 (4): 0 494--591, 2023

work page 2023
[4]

Angelopoulos, Stephen Bates, Emmanuel J

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Cand \`e s, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. In ICLR, 2022

work page 2022
[5]

Domain-shift-aware conformal prediction for large language models

Andreas Auer, Martin Gauch, Daniel Klotz, and Sepp Hochreiter. Domain-shift-aware conformal prediction for large language models. arXiv preprint arXiv:2510.05566, 2024

work page arXiv 2024
[6]

Distribution-free, risk-controlling prediction sets

Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM, 68 0 (6): 0 1--34, 2021

work page 2021
[7]

A continual learning survey: Defying forgetting in classification tasks

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 0 (7): 0 3366--3385, 2021

work page 2021
[8]

Conformal prediction for natural language processing: A survey

Nicolas Deutschmann, Mateo Moisescu-Pareja, and Mingxiao Gao. Conformal prediction for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 12: 0 1497--1516, 2024

work page 2024
[9]

Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator

Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27 0 (3): 0 642--669, 1956

work page 1956
[10]

Adaptive conformal inference under distribution shift

Isaac Gibbs and Emmanuel Cand \`e s. Adaptive conformal inference under distribution shift. In NeurIPS, 2021

work page 2021
[11]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In ICML, 2017

work page 2017
[12]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In ICLR, 2022

work page 2022
[13]

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. In ACL, 2024

work page 2024
[14]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Charles Blundell, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Charles Blundell, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. PNAS, 114 0 (13): 0 3521--3526, 2017

work page 2017
[15]

arXiv preprint arXiv:2305.18404 , year=

Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023

work page arXiv 2023
[16]

Jordan, and Ramesh Raskar

Charles Lu, Yaodong Yu, Sai Praneeth Karimireddy, Michael I. Jordan, and Ramesh Raskar. Federated conformal predictors for distributed uncertainty quantification. In ICML, 2023

work page 2023
[17]

Language models with conformal factuality guarantees

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In ICML, 2024

work page 2024
[18]

Sculley, Sebastian Nowozin, Joshua V

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. In NeurIPS, 2019

work page 2019
[19]

Calibrated prediction with covariate shift via unsupervised domain adaptation

Sangdon Park, Osbert Bastani, James Weimer, and Insup Lee. Calibrated prediction with covariate shift via unsupervised domain adaptation. In AISTATS, 2020

work page 2020
[20]

Conformal prediction for federated uncertainty quantification under label shift

Vincent Plassier, Mehdi Makni, Aleksandr Rubashevskii, Eric Moulines, and Maxim Panov. Conformal prediction for federated uncertainty quantification under label shift. In ICML, 2023

work page 2023
[21]

John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 1999

work page 1999
[22]

Jaakkola, and Regina Barzilay

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling. In ICLR, 2024

work page 2024
[23]

Lillicrap, and Gregory Wayne

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Gregory Wayne. Experience replay for continual learning. In NeurIPS, 2019

work page 2019
[24]

Algorithmic Learning in a Random World

Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005

work page 2005
[25]

A comprehensive survey of continual learning: Theory, method and application

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[26]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. In Findings of EMNLP, 2023

work page 2023
[27]

Continual learning of large language models: A comprehensive survey

Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning of large language models: A comprehensive survey. ACM Computing Surveys, 2025

work page 2025
[28]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, 2017

work page 2017
[29]

Spurious forgetting in continual learning of language models

Junhao Zheng, Xidi Qiu, Chengming Ma, Zhongqi Shen, Haoran Sun, and Qianli Ma. Spurious forgetting in continual learning of language models. In ICLR, 2025

work page 2025

[1] [1]

Selective risk certification for LLM outputs via information-lift statistics: PAC - B ayes, robustness, and skeleton design

Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. Selective risk certification for LLM outputs via information-lift statistics: PAC - B ayes, robustness, and skeleton design. arXiv preprint arXiv:2509.12527, 2025 a

work page arXiv 2025

[2] [2]

Anytime-valid answer sufficiency certificates for LLM generation via sequential information lift

Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. Anytime-valid answer sufficiency certificates for LLM generation via sequential information lift. arXiv preprint arXiv:2510.06478, 2025 b

work page arXiv 2025

[3] [3]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. Foundations and Trends in Machine Learning, 16 0 (4): 0 494--591, 2023

work page 2023

[4] [4]

Angelopoulos, Stephen Bates, Emmanuel J

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Cand \`e s, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. In ICLR, 2022

work page 2022

[5] [5]

Domain-shift-aware conformal prediction for large language models

Andreas Auer, Martin Gauch, Daniel Klotz, and Sepp Hochreiter. Domain-shift-aware conformal prediction for large language models. arXiv preprint arXiv:2510.05566, 2024

work page arXiv 2024

[6] [6]

Distribution-free, risk-controlling prediction sets

Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM, 68 0 (6): 0 1--34, 2021

work page 2021

[7] [7]

A continual learning survey: Defying forgetting in classification tasks

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 0 (7): 0 3366--3385, 2021

work page 2021

[8] [8]

Conformal prediction for natural language processing: A survey

Nicolas Deutschmann, Mateo Moisescu-Pareja, and Mingxiao Gao. Conformal prediction for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 12: 0 1497--1516, 2024

work page 2024

[9] [9]

Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator

Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27 0 (3): 0 642--669, 1956

work page 1956

[10] [10]

Adaptive conformal inference under distribution shift

Isaac Gibbs and Emmanuel Cand \`e s. Adaptive conformal inference under distribution shift. In NeurIPS, 2021

work page 2021

[11] [11]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In ICML, 2017

work page 2017

[12] [12]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In ICLR, 2022

work page 2022

[13] [13]

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal

Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. In ACL, 2024

work page 2024

[14] [14]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Charles Blundell, Dharshan Kumaran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Charles Blundell, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. PNAS, 114 0 (13): 0 3521--3526, 2017

work page 2017

[15] [15]

arXiv preprint arXiv:2305.18404 , year=

Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023

work page arXiv 2023

[16] [16]

Jordan, and Ramesh Raskar

Charles Lu, Yaodong Yu, Sai Praneeth Karimireddy, Michael I. Jordan, and Ramesh Raskar. Federated conformal predictors for distributed uncertainty quantification. In ICML, 2023

work page 2023

[17] [17]

Language models with conformal factuality guarantees

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In ICML, 2024

work page 2024

[18] [18]

Sculley, Sebastian Nowozin, Joshua V

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua V. Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. In NeurIPS, 2019

work page 2019

[19] [19]

Calibrated prediction with covariate shift via unsupervised domain adaptation

Sangdon Park, Osbert Bastani, James Weimer, and Insup Lee. Calibrated prediction with covariate shift via unsupervised domain adaptation. In AISTATS, 2020

work page 2020

[20] [20]

Conformal prediction for federated uncertainty quantification under label shift

Vincent Plassier, Mehdi Makni, Aleksandr Rubashevskii, Eric Moulines, and Maxim Panov. Conformal prediction for federated uncertainty quantification under label shift. In ICML, 2023

work page 2023

[21] [21]

John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 1999

work page 1999

[22] [22]

Jaakkola, and Regina Barzilay

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling. In ICLR, 2024

work page 2024

[23] [23]

Lillicrap, and Gregory Wayne

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Gregory Wayne. Experience replay for continual learning. In NeurIPS, 2019

work page 2019

[24] [24]

Algorithmic Learning in a Random World

Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005

work page 2005

[25] [25]

A comprehensive survey of continual learning: Theory, method and application

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[26] [26]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. In Findings of EMNLP, 2023

work page 2023

[27] [27]

Continual learning of large language models: A comprehensive survey

Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning of large language models: A comprehensive survey. ACM Computing Surveys, 2025

work page 2025

[28] [28]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, 2017

work page 2017

[29] [29]

Spurious forgetting in continual learning of language models

Junhao Zheng, Xidi Qiu, Chengming Ma, Zhongqi Shen, Haoran Sun, and Qianli Ma. Spurious forgetting in continual learning of language models. In ICLR, 2025

work page 2025