Inverse-Hessian Regularization for Continual Learning in ASR

Hugo Van hamme; Steven Vander Eeckt

arxiv: 2601.14751 · v1 · pith:NVDKCNLInew · submitted 2026-01-21 · 📡 eess.AS

Inverse-Hessian Regularization for Continual Learning in ASR

Steven Vander Eeckt , Hugo Van hamme This is my paper

Pith reviewed 2026-05-16 12:37 UTC · model grok-4.3

classification 📡 eess.AS

keywords continual learningautomatic speech recognitioninverse Hessian regularizationcatastrophic forgettingmodel mergingKronecker factorizationloss landscape curvature

0 comments

The pith

Inverse-Hessian Regularization adjusts post-fine-tuning ASR updates with prior-task curvature to limit forgetting while preserving adaptability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Inverse-Hessian Regularization as a memory-free continual learning method for automatic speech recognition. After fine-tuning a model on new data, it modifies the update by applying a Kronecker-factored inverse-Hessian approximation drawn from the previous task. This steers parameter changes toward directions that least damage earlier performance. The technique improves on simple weight-averaging approaches by explicitly using loss-landscape curvature rather than treating all directions equally. Experiments on two standard benchmarks show reduced forgetting together with better adaptation to new domains.

Core claim

After fine-tuning on a new task, the adaptation is adjusted through a Kronecker-factored inverse Hessian approximation of the previous task, ensuring that the model moves primarily in directions less harmful to past performance, while keeping the method lightweight.

What carries the argument

Kronecker-factored inverse-Hessian approximation applied in the merging step to incorporate curvature information from the prior task.

If this is right

ASR models can be updated sequentially across domains with smaller drops on previously learned conditions.
Memory-free continual learning becomes competitive with methods that store past data or gradients.
Weight-averaging merges can be strengthened by folding in second-order information without added storage cost.
The same adjustment step can be applied after each new fine-tuning round as the number of tasks grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The curvature-guided merge could be tested on other sequence-to-sequence tasks such as machine translation or text-to-speech.
Replacing the Kronecker factorization with a cheaper diagonal approximation might trade some accuracy for speed on very large models.
Combining the regularization with selective replay of a small number of past utterances could compound the forgetting reduction.

Load-bearing premise

The Kronecker-factored inverse-Hessian approximation sufficiently captures the loss-landscape curvature of earlier ASR tasks to steer updates safely.

What would settle it

Re-running the reported benchmarks with IHR and finding equal or higher forgetting rates than plain weight averaging would disprove the claimed benefit.

read the original abstract

Catastrophic forgetting remains a major challenge for continual learning (CL) in automatic speech recognition (ASR), where models must adapt to new domains without losing performance on previously learned conditions. Several CL methods have been proposed for ASR, and, recently, weight averaging - where models are averaged in a merging step after fine-tuning - has proven effective as a simple memory-free strategy. However, it is heuristic in nature and ignores the underlying loss landscapes of the tasks, hindering adaptability. In this work, we propose Inverse Hessian Regularization (IHR), a memory-free approach for CL in ASR that incorporates curvature information into the merging step. After fine-tuning on a new task, the adaptation is adjusted through a Kronecker-factored inverse Hessian approximation of the previous task, ensuring that the model moves primarily in directions less harmful to past performance, while keeping the method lightweight. We evaluate IHR on two CL benchmarks and show that it significantly outperforms state-of-the-art baselines, reducing forgetting while improving adaptability. Ablation studies and analyses further confirm its effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This replaces plain weight averaging in ASR continual learning with a K-FAC inverse-Hessian adjustment after fine-tuning, and reports better retention plus adaptability on two benchmarks.

read the letter

The main advance is taking the recent weight-merging approach for memory-free CL in ASR and making the merge step curvature-aware. After fine-tuning on the new domain they scale the update using a Kronecker-factored inverse Hessian from the prior task so that parameter changes avoid directions that would hurt old performance most. That is a direct, lightweight extension of the heuristic averaging baseline and stays within the same memory-free regime. They evaluate on two standard CL benchmarks for ASR, show reduced forgetting while preserving or improving new-task accuracy, and include ablations that isolate the Hessian term as the source of the gain. Those results are the concrete contribution worth noting. The experimental reporting is still thin: no error bars, no clear statement of baseline re-implementations, and no mention of statistical tests, so the size and reliability of the improvement are hard to judge from the abstract alone. The K-FAC approximation itself drops cross-layer covariances, which are likely relevant in Conformer or attention-based ASR models; without an ablation against a denser curvature estimate or an error bound, the claim that updates stay “primarily in directions less harmful” rests on an unverified assumption. The work is incremental but cleanly executed on its own terms and engages the existing CL and Hessian literature without circularity. It is the sort of targeted practical tweak that groups working on deployable speech systems would want to examine. I would send it to peer review so the full experimental protocol, variance numbers, and any checks on the approximation quality can be verified.

Referee Report

2 major / 2 minor

Summary. The paper proposes Inverse-Hessian Regularization (IHR), a memory-free continual learning method for ASR. After fine-tuning on a new task, the model update is adjusted via a Kronecker-factored inverse-Hessian approximation of the prior task's loss landscape so that the adaptation moves primarily along directions that increase prior-task loss least. The method is evaluated on two CL benchmarks and reported to significantly outperform weight-averaging and other baselines while remaining lightweight.

Significance. If the central claim holds, IHR would supply a principled, curvature-aware alternative to heuristic merging in ASR continual learning, potentially improving the forgetting-adaptability trade-off without storing data or full Hessians.

major comments (2)

[§3.2] §3.2, Eq. (3)–(5): The K-FAC factorization (A ⊗ G per layer, inverted independently) discards inter-layer covariances that are pronounced in ASR models containing shared embeddings, attention, and Conformer blocks. No error bound, comparison to a more accurate curvature estimator, or ablation quantifying the resulting mismatch in “safe direction” identification is provided, leaving the guarantee that updates are “primarily in directions less harmful” dependent on an unverified approximation.
[§5.1] §5.1–5.3: The reported outperformance on the two benchmarks lacks error bars, statistical significance tests, exact baseline re-implementation details, and hyper-parameter sensitivity analysis. Without these, the magnitude of improvement cannot be assessed as robust support for the central claim.

minor comments (2)

[Abstract] Abstract: the two benchmarks are not named; adding their identities would improve immediate readability.
[§3] Notation: the symbol for the inverse-Hessian approximation is introduced without an explicit definition equation; a single numbered equation would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [§3.2] §3.2, Eq. (3)–(5): The K-FAC factorization (A ⊗ G per layer, inverted independently) discards inter-layer covariances that are pronounced in ASR models containing shared embeddings, attention, and Conformer blocks. No error bound, comparison to a more accurate curvature estimator, or ablation quantifying the resulting mismatch in “safe direction” identification is provided, leaving the guarantee that updates are “primarily in directions less harmful” dependent on an unverified approximation.

Authors: We acknowledge that layer-wise K-FAC is an approximation that neglects inter-layer covariances, a known limitation when applied to architectures with shared parameters such as Conformer-based ASR models. This choice is driven by the need for a memory-efficient, scalable curvature estimate; full Hessian or block-diagonal alternatives would be prohibitive. In the revision we will (i) add an explicit discussion of this approximation and its relation to prior K-FAC usage in continual learning, and (ii) include an ablation that replaces the Kronecker factors with a diagonal inverse-Hessian baseline to quantify the practical benefit of the factorization on the two benchmarks. revision: partial
Referee: [§5.1] §5.1–5.3: The reported outperformance on the two benchmarks lacks error bars, statistical significance tests, exact baseline re-implementation details, and hyper-parameter sensitivity analysis. Without these, the magnitude of improvement cannot be assessed as robust support for the central claim.

Authors: We agree that stronger statistical reporting is required. In the revised manuscript we will re-run all experiments with at least five random seeds, report mean and standard deviation, perform paired statistical significance tests against the strongest baselines, supply exact hyper-parameter values and re-implementation notes for every baseline, and add a dedicated sensitivity analysis subsection for the regularization coefficient and merging step size. revision: yes

Circularity Check

0 steps flagged

No significant circularity; IHR applies standard K-FAC to merging without reducing to self-inputs

full rationale

The paper defines IHR as post-fine-tuning adjustment of the adaptation step using a Kronecker-factored inverse-Hessian approximation drawn from the previous task's loss landscape. This is a direct, non-circular extension of established curvature approximations (K-FAC) into the weight-averaging merge; no equation equates a fitted parameter to its own prediction, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled. Empirical results on two benchmarks plus ablations provide independent verification. The derivation chain remains self-contained against external curvature estimators.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that a Kronecker-factored inverse Hessian provides a useful curvature signal for ASR loss landscapes; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Kronecker-factored approximation accurately represents task curvature for safe model merging in ASR
Invoked to justify the lightweight adjustment step after fine-tuning.

pith-pipeline@v0.9.0 · 5478 in / 1179 out tokens · 56853 ms · 2026-05-16T12:37:01.098594+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the adaptation is adjusted through a Kronecker-factored inverse Hessian approximation of the previous task

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

To be accurate and inclusive, they must adapt to new speakers, accents, domains, or recording conditions

INTRODUCTION Automatic speech recognition (ASR) systems are widely deployed in everyday applications, from voice assistants to transcription ser- vices. To be accurate and inclusive, they must adapt to new speakers, accents, domains, or recording conditions. Yet such adaptation often causescatastrophic forgetting[1], where performance on previously learne...

work page
[2]

ASR Model We consider an encoder–decoder ASR model

CONTINUAL LEARNING FOR ASR 2.1. ASR Model We consider an encoder–decoder ASR model. Given an utterance X∈R l×di oflacoustic frames of dimensiond i, the model predicts a sequence ˆyof˜wword pieces. Parameters are denoted byθ∈ RN . The model is trained (and predicts) in a hybrid fashion [12], combining a CTC loss and a decoder cross-entropy loss with weight...

work page
[3]

Inverse-Hessian Regularization for Continual Learning in ASR

INVERSE HESSIAN REGULARIZATION After fine-tuning the model on tasktusing its training dataD t, start- ing from parametersθ t−1, we obtain updated parameters ˜θt. These arXiv:2601.14751v1 [eess.AS] 21 Jan 2026 Fig. 1. Illustration of catastrophic forgetting. Fine-tuning moves the model fromθ t−1 to ˜θt, entering the low-loss region of the new taskt (orange...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

More information, including code and detailed results, can be found in our Github repository 1

EXPERIMENTS Experiments are done in ESPnet2 [17]. More information, including code and detailed results, can be found in our Github repository 1. Data.We consider two CL benchmarks: (Exp. 1) Following [9], we use Common V oice (CV) [18] English data set, divided into five accents: United States (US), England (ENG), Australia (AUS), India (IND), and Scotla...

work page 2048
[5]

Experiment 1 As shown by Table 1, our method (IHR) significantly outperforms all baselines, being able to learn with close to zero forgetting (as shown by its -0.1 BWT)

RESULTS 5.1. Experiment 1 As shown by Table 1, our method (IHR) significantly outperforms all baselines, being able to learn with close to zero forgetting (as shown by its -0.1 BWT). In addition, we observe the following: First, compared to FTA, the strongest memory-free baseline, IHR achieves a clear performance gain. The improvement is partly due to sli...

work page
[6]

As shown Table 2

DISCUSSION In ASR, tasks are often highly similar, and many parameters are si- multaneously important for both past and new domains. As shown Table 2. Ablation study of our method. ”+” indicates modifica- tions are made compared to the line directly above. Method Average WER↓BWT↑ Fine-Tuning 15.07 -3.6 + IHR using Pt−1 i=1 H i i andα p = 0.5013.54 a -0.6 ...

work page
[7]

CONCLUSION We present Inverse Hessian Regularization (IHR), a novel memory- free method for continual learning in ASR. By correcting task- specific updates through a Kronecker-factored, layerwise inverse Hessian, IHR steers adaptation into directions less sensitive for pre- vious tasks, thereby reducing forgetting while maintaining strong adaptability. As...

work page
[8]

Catastrophic interference in con- nectionist networks: The sequential learning problem,

Michael McCloskey et al., “Catastrophic interference in con- nectionist networks: The sequential learning problem,” vol. 24 ofPsychology of Learning and Motivation, pp. 109–165. 1989

work page 1989
[9]

Using adapters to overcome catastrophic forgetting in end-to-end automatic speech recognition,

Steven Vander Eeckt and Hugo Van hamme, “Using adapters to overcome catastrophic forgetting in end-to-end automatic speech recognition,” inICASSP 2023, 2023

work page 2023
[10]

Dealing with Unknowns in Continual Learning for End-to-end Auto- matic Speech Recognition,

Martin Sustek, Samik Sadhu, and Hynek Hermansky, “Dealing with Unknowns in Continual Learning for End-to-end Auto- matic Speech Recognition,” inProc. Interspeech 2022, 2022, pp. 1046–1050

work page 2022
[11]

Continual learning for on-device speech recog- nition using disentangled conformers,

Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, and Abdelrahman Mohamed, “Continual learning for on-device speech recog- nition using disentangled conformers,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[12]

Towards Lifelong Learning of End-to- End ASR,

Heng-Jui Chang et al., “Towards Lifelong Learning of End-to- End ASR,” inProc. Interspeech 2021, 2021, pp. 2551–2555

work page 2021
[13]

Continual learn- ing for monolingual end-to-end automatic speech recognition,

Steven Vander Eeckt and Hugo Van hamme, “Continual learn- ing for monolingual end-to-end automatic speech recognition,” in2022 30th European Signal Processing Conference (EU- SIPCO), 2022

work page 2022
[14]

Updating only encoders prevents catas- trophic forgetting of end-to-end ASR models,

Yuki Takashima et al., “Updating only encoders prevents catas- trophic forgetting of end-to-end ASR models,” inProc. Inter- speech 2022. 2022, pp. 2218–2222, ISCA

work page 2022
[15]

Clrl-tuning: A novel continual learning approach for automatic speech recog- nition,

Zhihan Wang, Feng Hou, and Ruili Wang, “Clrl-tuning: A novel continual learning approach for automatic speech recog- nition,” inInterspeech 2023, 2023, pp. 1279–1283

work page 2023
[16]

Weight averag- ing: A simple yet effective method to overcome catastrophic forgetting in automatic speech recognition,

Steven Vander Eeckt and Hugo Van hamme, “Weight averag- ing: A simple yet effective method to overcome catastrophic forgetting in automatic speech recognition,” inICASSP 2023, 2023

work page 2023
[17]

Online structured laplace approximations for overcoming catastrophic forgetting,

Hippolyt Ritter, Aleksandar Botev, and David Barber, “Online structured laplace approximations for overcoming catastrophic forgetting,” inAdvances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. 2018, vol. 31, Curran Associates, Inc

work page 2018
[18]

Optimizing neural net- works with kronecker-factored approximate curvature,

James Martens and Roger Grosse, “Optimizing neural net- works with kronecker-factored approximate curvature,” in Proceedings of the 32nd International Conference on Interna- tional Conference on Machine Learning - Volume 37. 2015, ICML’15, p. 2408–2417, JMLR.org

work page 2015
[19]

Improving Transformer-Based End-to- End Speech Recognition with Connectionist Temporal Classi- fication and Language Model Integration,

Shigeki Karita et al., “Improving Transformer-Based End-to- End Speech Recognition with Connectionist Temporal Classi- fication and Language Model Integration,” inProc. Interspeech 2019

work page 2019
[20]

Natural con- tinual learning: success is a journey, not (just) a destination,

Ta-Chu Kao, Kristopher T Jensen, Gido Martijn van de Ven, Alberto Bernacchia, and Guillaume Hennequin, “Natural con- tinual learning: success is a journey, not (just) a destination,” inThirty-Fifth Conference on Neural Information Processing Systems, 2021

work page 2021
[21]

Overcoming catastrophic forgetting in neural networks,

James Kirkpatrick et al., “Overcoming catastrophic forgetting in neural networks,”Proceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017

work page 2017
[22]

Continual learn- ing with quasi-newton methods,

Steven Vander Eeckt and Hugo Van Hamme, “Continual learn- ing with quasi-newton methods,”IEEE Access, vol. 13, pp. 47485–47499, 2025

work page 2025
[23]

Continual lifelong learning in natural language processing: A survey,

Magdalena Biesialska, Katarzyna Biesialska, and Marta R. Costa-juss`a, “Continual lifelong learning in natural language processing: A survey,” inProceedings of the 28th Interna- tional Conference on Computational Linguistics, Barcelona, Spain (Online), Dec. 2020, pp. 6523–6541, International Com- mittee on Computational Linguistics

work page 2020
[24]

ESPnet: End-to-end speech processing toolkit,

Shinji Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” inProceedings of Interspeech, 2018, pp. 2207–2211

work page 2018
[25]

Common voice: A massively-multilingual speech corpus,

R. Ardila et al., “Common voice: A massively-multilingual speech corpus,” inProceedings of the 12th Conference on Lan- guage Resources and Evaluation (LREC 2020), 2020

work page 2020
[26]

Unsupervised online continual learning for automatic speech recognition,

Steven Vander Eeckt and Hugo Van hamme, “Unsupervised online continual learning for automatic speech recognition,” in Interspeech 2024, 2024, pp. 2845–2849

work page 2024
[27]

Librispeech: An asr corpus based on public domain audio books,

Vassil Panayotov et al., “Librispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2015

work page 2015
[28]

Libri-adapt: a new speech dataset for un- supervised domain adaptation,

Akhil Mathur et al., “Libri-adapt: a new speech dataset for un- supervised domain adaptation,”2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7439–7443, 2020

work page 2020
[29]

Conformer: Convolution-augmented Transformer for Speech Recognition,

Anmol Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” inProc. Interspeech 2020, 2020, pp. 5036–5040

work page 2020
[30]

Attention is all you need,

Ashish Vaswani et al., “Attention is all you need,” inAdvances in Neural Information Processing Systems, 2017, vol. 30

work page 2017
[31]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,

Taku Kudo and John Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71

work page 2018
[32]

Adam: A method for stochastic optimization,

Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2015

work page 2015
[33]

Experience replay for continual learn- ing,

David Rolnick et al., “Experience replay for continual learn- ing,” inAdvances in Neural Information Processing Systems, 2019, vol. 32

work page 2019
[34]

Comparing the recognition performance of csrs: in search of an adequate metric and statistical significance test,

Helmer Strik, Catia Cucchiarini, and Judith M. Kessens, “Comparing the recognition performance of csrs: in search of an adequate metric and statistical significance test,” inINTER- SPEECH, 2000

work page 2000
[35]

A continual learning survey: Defy- ing forgetting in classification tasks,

Matthias Delange et al., “A continual learning survey: Defy- ing forgetting in classification tasks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1–1, 2021

work page 2021
[36]

Memory aware synapses: Learning what (not) to forget,

Rahaf Aljundi et al., “Memory aware synapses: Learning what (not) to forget,” inComputer Vision – ECCV 2018. 2018, pp. 144–161, Springer International Publishing

work page 2018

[1] [1]

To be accurate and inclusive, they must adapt to new speakers, accents, domains, or recording conditions

INTRODUCTION Automatic speech recognition (ASR) systems are widely deployed in everyday applications, from voice assistants to transcription ser- vices. To be accurate and inclusive, they must adapt to new speakers, accents, domains, or recording conditions. Yet such adaptation often causescatastrophic forgetting[1], where performance on previously learne...

work page

[2] [2]

ASR Model We consider an encoder–decoder ASR model

CONTINUAL LEARNING FOR ASR 2.1. ASR Model We consider an encoder–decoder ASR model. Given an utterance X∈R l×di oflacoustic frames of dimensiond i, the model predicts a sequence ˆyof˜wword pieces. Parameters are denoted byθ∈ RN . The model is trained (and predicts) in a hybrid fashion [12], combining a CTC loss and a decoder cross-entropy loss with weight...

work page

[3] [3]

Inverse-Hessian Regularization for Continual Learning in ASR

INVERSE HESSIAN REGULARIZATION After fine-tuning the model on tasktusing its training dataD t, start- ing from parametersθ t−1, we obtain updated parameters ˜θt. These arXiv:2601.14751v1 [eess.AS] 21 Jan 2026 Fig. 1. Illustration of catastrophic forgetting. Fine-tuning moves the model fromθ t−1 to ˜θt, entering the low-loss region of the new taskt (orange...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

More information, including code and detailed results, can be found in our Github repository 1

EXPERIMENTS Experiments are done in ESPnet2 [17]. More information, including code and detailed results, can be found in our Github repository 1. Data.We consider two CL benchmarks: (Exp. 1) Following [9], we use Common V oice (CV) [18] English data set, divided into five accents: United States (US), England (ENG), Australia (AUS), India (IND), and Scotla...

work page 2048

[5] [5]

Experiment 1 As shown by Table 1, our method (IHR) significantly outperforms all baselines, being able to learn with close to zero forgetting (as shown by its -0.1 BWT)

RESULTS 5.1. Experiment 1 As shown by Table 1, our method (IHR) significantly outperforms all baselines, being able to learn with close to zero forgetting (as shown by its -0.1 BWT). In addition, we observe the following: First, compared to FTA, the strongest memory-free baseline, IHR achieves a clear performance gain. The improvement is partly due to sli...

work page

[6] [6]

As shown Table 2

DISCUSSION In ASR, tasks are often highly similar, and many parameters are si- multaneously important for both past and new domains. As shown Table 2. Ablation study of our method. ”+” indicates modifica- tions are made compared to the line directly above. Method Average WER↓BWT↑ Fine-Tuning 15.07 -3.6 + IHR using Pt−1 i=1 H i i andα p = 0.5013.54 a -0.6 ...

work page

[7] [7]

CONCLUSION We present Inverse Hessian Regularization (IHR), a novel memory- free method for continual learning in ASR. By correcting task- specific updates through a Kronecker-factored, layerwise inverse Hessian, IHR steers adaptation into directions less sensitive for pre- vious tasks, thereby reducing forgetting while maintaining strong adaptability. As...

work page

[8] [8]

Catastrophic interference in con- nectionist networks: The sequential learning problem,

Michael McCloskey et al., “Catastrophic interference in con- nectionist networks: The sequential learning problem,” vol. 24 ofPsychology of Learning and Motivation, pp. 109–165. 1989

work page 1989

[9] [9]

Using adapters to overcome catastrophic forgetting in end-to-end automatic speech recognition,

Steven Vander Eeckt and Hugo Van hamme, “Using adapters to overcome catastrophic forgetting in end-to-end automatic speech recognition,” inICASSP 2023, 2023

work page 2023

[10] [10]

Dealing with Unknowns in Continual Learning for End-to-end Auto- matic Speech Recognition,

Martin Sustek, Samik Sadhu, and Hynek Hermansky, “Dealing with Unknowns in Continual Learning for End-to-end Auto- matic Speech Recognition,” inProc. Interspeech 2022, 2022, pp. 1046–1050

work page 2022

[11] [11]

Continual learning for on-device speech recog- nition using disentangled conformers,

Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, and Abdelrahman Mohamed, “Continual learning for on-device speech recog- nition using disentangled conformers,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2023, pp. 1–5

work page 2023

[12] [12]

Towards Lifelong Learning of End-to- End ASR,

Heng-Jui Chang et al., “Towards Lifelong Learning of End-to- End ASR,” inProc. Interspeech 2021, 2021, pp. 2551–2555

work page 2021

[13] [13]

Continual learn- ing for monolingual end-to-end automatic speech recognition,

Steven Vander Eeckt and Hugo Van hamme, “Continual learn- ing for monolingual end-to-end automatic speech recognition,” in2022 30th European Signal Processing Conference (EU- SIPCO), 2022

work page 2022

[14] [14]

Updating only encoders prevents catas- trophic forgetting of end-to-end ASR models,

Yuki Takashima et al., “Updating only encoders prevents catas- trophic forgetting of end-to-end ASR models,” inProc. Inter- speech 2022. 2022, pp. 2218–2222, ISCA

work page 2022

[15] [15]

Clrl-tuning: A novel continual learning approach for automatic speech recog- nition,

Zhihan Wang, Feng Hou, and Ruili Wang, “Clrl-tuning: A novel continual learning approach for automatic speech recog- nition,” inInterspeech 2023, 2023, pp. 1279–1283

work page 2023

[16] [16]

Weight averag- ing: A simple yet effective method to overcome catastrophic forgetting in automatic speech recognition,

Steven Vander Eeckt and Hugo Van hamme, “Weight averag- ing: A simple yet effective method to overcome catastrophic forgetting in automatic speech recognition,” inICASSP 2023, 2023

work page 2023

[17] [17]

Online structured laplace approximations for overcoming catastrophic forgetting,

Hippolyt Ritter, Aleksandar Botev, and David Barber, “Online structured laplace approximations for overcoming catastrophic forgetting,” inAdvances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. 2018, vol. 31, Curran Associates, Inc

work page 2018

[18] [18]

Optimizing neural net- works with kronecker-factored approximate curvature,

James Martens and Roger Grosse, “Optimizing neural net- works with kronecker-factored approximate curvature,” in Proceedings of the 32nd International Conference on Interna- tional Conference on Machine Learning - Volume 37. 2015, ICML’15, p. 2408–2417, JMLR.org

work page 2015

[19] [19]

Improving Transformer-Based End-to- End Speech Recognition with Connectionist Temporal Classi- fication and Language Model Integration,

Shigeki Karita et al., “Improving Transformer-Based End-to- End Speech Recognition with Connectionist Temporal Classi- fication and Language Model Integration,” inProc. Interspeech 2019

work page 2019

[20] [20]

Natural con- tinual learning: success is a journey, not (just) a destination,

Ta-Chu Kao, Kristopher T Jensen, Gido Martijn van de Ven, Alberto Bernacchia, and Guillaume Hennequin, “Natural con- tinual learning: success is a journey, not (just) a destination,” inThirty-Fifth Conference on Neural Information Processing Systems, 2021

work page 2021

[21] [21]

Overcoming catastrophic forgetting in neural networks,

James Kirkpatrick et al., “Overcoming catastrophic forgetting in neural networks,”Proceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017

work page 2017

[22] [22]

Continual learn- ing with quasi-newton methods,

Steven Vander Eeckt and Hugo Van Hamme, “Continual learn- ing with quasi-newton methods,”IEEE Access, vol. 13, pp. 47485–47499, 2025

work page 2025

[23] [23]

Continual lifelong learning in natural language processing: A survey,

Magdalena Biesialska, Katarzyna Biesialska, and Marta R. Costa-juss`a, “Continual lifelong learning in natural language processing: A survey,” inProceedings of the 28th Interna- tional Conference on Computational Linguistics, Barcelona, Spain (Online), Dec. 2020, pp. 6523–6541, International Com- mittee on Computational Linguistics

work page 2020

[24] [24]

ESPnet: End-to-end speech processing toolkit,

Shinji Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” inProceedings of Interspeech, 2018, pp. 2207–2211

work page 2018

[25] [25]

Common voice: A massively-multilingual speech corpus,

R. Ardila et al., “Common voice: A massively-multilingual speech corpus,” inProceedings of the 12th Conference on Lan- guage Resources and Evaluation (LREC 2020), 2020

work page 2020

[26] [26]

Unsupervised online continual learning for automatic speech recognition,

Steven Vander Eeckt and Hugo Van hamme, “Unsupervised online continual learning for automatic speech recognition,” in Interspeech 2024, 2024, pp. 2845–2849

work page 2024

[27] [27]

Librispeech: An asr corpus based on public domain audio books,

Vassil Panayotov et al., “Librispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2015

work page 2015

[28] [28]

Libri-adapt: a new speech dataset for un- supervised domain adaptation,

Akhil Mathur et al., “Libri-adapt: a new speech dataset for un- supervised domain adaptation,”2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7439–7443, 2020

work page 2020

[29] [29]

Conformer: Convolution-augmented Transformer for Speech Recognition,

Anmol Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” inProc. Interspeech 2020, 2020, pp. 5036–5040

work page 2020

[30] [30]

Attention is all you need,

Ashish Vaswani et al., “Attention is all you need,” inAdvances in Neural Information Processing Systems, 2017, vol. 30

work page 2017

[31] [31]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,

Taku Kudo and John Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71

work page 2018

[32] [32]

Adam: A method for stochastic optimization,

Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2015

work page 2015

[33] [33]

Experience replay for continual learn- ing,

David Rolnick et al., “Experience replay for continual learn- ing,” inAdvances in Neural Information Processing Systems, 2019, vol. 32

work page 2019

[34] [34]

Comparing the recognition performance of csrs: in search of an adequate metric and statistical significance test,

Helmer Strik, Catia Cucchiarini, and Judith M. Kessens, “Comparing the recognition performance of csrs: in search of an adequate metric and statistical significance test,” inINTER- SPEECH, 2000

work page 2000

[35] [35]

A continual learning survey: Defy- ing forgetting in classification tasks,

Matthias Delange et al., “A continual learning survey: Defy- ing forgetting in classification tasks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1–1, 2021

work page 2021

[36] [36]

Memory aware synapses: Learning what (not) to forget,

Rahaf Aljundi et al., “Memory aware synapses: Learning what (not) to forget,” inComputer Vision – ECCV 2018. 2018, pp. 144–161, Springer International Publishing

work page 2018