pith. sign in

arxiv: 2601.14751 · v1 · pith:NVDKCNLInew · submitted 2026-01-21 · 📡 eess.AS

Inverse-Hessian Regularization for Continual Learning in ASR

Pith reviewed 2026-05-16 12:37 UTC · model grok-4.3

classification 📡 eess.AS
keywords continual learningautomatic speech recognitioninverse Hessian regularizationcatastrophic forgettingmodel mergingKronecker factorizationloss landscape curvature
0
0 comments X

The pith

Inverse-Hessian Regularization adjusts post-fine-tuning ASR updates with prior-task curvature to limit forgetting while preserving adaptability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Inverse-Hessian Regularization as a memory-free continual learning method for automatic speech recognition. After fine-tuning a model on new data, it modifies the update by applying a Kronecker-factored inverse-Hessian approximation drawn from the previous task. This steers parameter changes toward directions that least damage earlier performance. The technique improves on simple weight-averaging approaches by explicitly using loss-landscape curvature rather than treating all directions equally. Experiments on two standard benchmarks show reduced forgetting together with better adaptation to new domains.

Core claim

After fine-tuning on a new task, the adaptation is adjusted through a Kronecker-factored inverse Hessian approximation of the previous task, ensuring that the model moves primarily in directions less harmful to past performance, while keeping the method lightweight.

What carries the argument

Kronecker-factored inverse-Hessian approximation applied in the merging step to incorporate curvature information from the prior task.

If this is right

  • ASR models can be updated sequentially across domains with smaller drops on previously learned conditions.
  • Memory-free continual learning becomes competitive with methods that store past data or gradients.
  • Weight-averaging merges can be strengthened by folding in second-order information without added storage cost.
  • The same adjustment step can be applied after each new fine-tuning round as the number of tasks grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The curvature-guided merge could be tested on other sequence-to-sequence tasks such as machine translation or text-to-speech.
  • Replacing the Kronecker factorization with a cheaper diagonal approximation might trade some accuracy for speed on very large models.
  • Combining the regularization with selective replay of a small number of past utterances could compound the forgetting reduction.

Load-bearing premise

The Kronecker-factored inverse-Hessian approximation sufficiently captures the loss-landscape curvature of earlier ASR tasks to steer updates safely.

What would settle it

Re-running the reported benchmarks with IHR and finding equal or higher forgetting rates than plain weight averaging would disprove the claimed benefit.

read the original abstract

Catastrophic forgetting remains a major challenge for continual learning (CL) in automatic speech recognition (ASR), where models must adapt to new domains without losing performance on previously learned conditions. Several CL methods have been proposed for ASR, and, recently, weight averaging - where models are averaged in a merging step after fine-tuning - has proven effective as a simple memory-free strategy. However, it is heuristic in nature and ignores the underlying loss landscapes of the tasks, hindering adaptability. In this work, we propose Inverse Hessian Regularization (IHR), a memory-free approach for CL in ASR that incorporates curvature information into the merging step. After fine-tuning on a new task, the adaptation is adjusted through a Kronecker-factored inverse Hessian approximation of the previous task, ensuring that the model moves primarily in directions less harmful to past performance, while keeping the method lightweight. We evaluate IHR on two CL benchmarks and show that it significantly outperforms state-of-the-art baselines, reducing forgetting while improving adaptability. Ablation studies and analyses further confirm its effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Inverse-Hessian Regularization (IHR), a memory-free continual learning method for ASR. After fine-tuning on a new task, the model update is adjusted via a Kronecker-factored inverse-Hessian approximation of the prior task's loss landscape so that the adaptation moves primarily along directions that increase prior-task loss least. The method is evaluated on two CL benchmarks and reported to significantly outperform weight-averaging and other baselines while remaining lightweight.

Significance. If the central claim holds, IHR would supply a principled, curvature-aware alternative to heuristic merging in ASR continual learning, potentially improving the forgetting-adaptability trade-off without storing data or full Hessians.

major comments (2)
  1. [§3.2] §3.2, Eq. (3)–(5): The K-FAC factorization (A ⊗ G per layer, inverted independently) discards inter-layer covariances that are pronounced in ASR models containing shared embeddings, attention, and Conformer blocks. No error bound, comparison to a more accurate curvature estimator, or ablation quantifying the resulting mismatch in “safe direction” identification is provided, leaving the guarantee that updates are “primarily in directions less harmful” dependent on an unverified approximation.
  2. [§5.1] §5.1–5.3: The reported outperformance on the two benchmarks lacks error bars, statistical significance tests, exact baseline re-implementation details, and hyper-parameter sensitivity analysis. Without these, the magnitude of improvement cannot be assessed as robust support for the central claim.
minor comments (2)
  1. [Abstract] Abstract: the two benchmarks are not named; adding their identities would improve immediate readability.
  2. [§3] Notation: the symbol for the inverse-Hessian approximation is introduced without an explicit definition equation; a single numbered equation would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [§3.2] §3.2, Eq. (3)–(5): The K-FAC factorization (A ⊗ G per layer, inverted independently) discards inter-layer covariances that are pronounced in ASR models containing shared embeddings, attention, and Conformer blocks. No error bound, comparison to a more accurate curvature estimator, or ablation quantifying the resulting mismatch in “safe direction” identification is provided, leaving the guarantee that updates are “primarily in directions less harmful” dependent on an unverified approximation.

    Authors: We acknowledge that layer-wise K-FAC is an approximation that neglects inter-layer covariances, a known limitation when applied to architectures with shared parameters such as Conformer-based ASR models. This choice is driven by the need for a memory-efficient, scalable curvature estimate; full Hessian or block-diagonal alternatives would be prohibitive. In the revision we will (i) add an explicit discussion of this approximation and its relation to prior K-FAC usage in continual learning, and (ii) include an ablation that replaces the Kronecker factors with a diagonal inverse-Hessian baseline to quantify the practical benefit of the factorization on the two benchmarks. revision: partial

  2. Referee: [§5.1] §5.1–5.3: The reported outperformance on the two benchmarks lacks error bars, statistical significance tests, exact baseline re-implementation details, and hyper-parameter sensitivity analysis. Without these, the magnitude of improvement cannot be assessed as robust support for the central claim.

    Authors: We agree that stronger statistical reporting is required. In the revised manuscript we will re-run all experiments with at least five random seeds, report mean and standard deviation, perform paired statistical significance tests against the strongest baselines, supply exact hyper-parameter values and re-implementation notes for every baseline, and add a dedicated sensitivity analysis subsection for the regularization coefficient and merging step size. revision: yes

Circularity Check

0 steps flagged

No significant circularity; IHR applies standard K-FAC to merging without reducing to self-inputs

full rationale

The paper defines IHR as post-fine-tuning adjustment of the adaptation step using a Kronecker-factored inverse-Hessian approximation drawn from the previous task's loss landscape. This is a direct, non-circular extension of established curvature approximations (K-FAC) into the weight-averaging merge; no equation equates a fitted parameter to its own prediction, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled. Empirical results on two benchmarks plus ablations provide independent verification. The derivation chain remains self-contained against external curvature estimators.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that a Kronecker-factored inverse Hessian provides a useful curvature signal for ASR loss landscapes; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Kronecker-factored approximation accurately represents task curvature for safe model merging in ASR
    Invoked to justify the lightweight adjustment step after fine-tuning.

pith-pipeline@v0.9.0 · 5478 in / 1179 out tokens · 56853 ms · 2026-05-16T12:37:01.098594+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    To be accurate and inclusive, they must adapt to new speakers, accents, domains, or recording conditions

    INTRODUCTION Automatic speech recognition (ASR) systems are widely deployed in everyday applications, from voice assistants to transcription ser- vices. To be accurate and inclusive, they must adapt to new speakers, accents, domains, or recording conditions. Yet such adaptation often causescatastrophic forgetting[1], where performance on previously learne...

  2. [2]

    ASR Model We consider an encoder–decoder ASR model

    CONTINUAL LEARNING FOR ASR 2.1. ASR Model We consider an encoder–decoder ASR model. Given an utterance X∈R l×di oflacoustic frames of dimensiond i, the model predicts a sequence ˆyof˜wword pieces. Parameters are denoted byθ∈ RN . The model is trained (and predicts) in a hybrid fashion [12], combining a CTC loss and a decoder cross-entropy loss with weight...

  3. [3]

    Inverse-Hessian Regularization for Continual Learning in ASR

    INVERSE HESSIAN REGULARIZATION After fine-tuning the model on tasktusing its training dataD t, start- ing from parametersθ t−1, we obtain updated parameters ˜θt. These arXiv:2601.14751v1 [eess.AS] 21 Jan 2026 Fig. 1. Illustration of catastrophic forgetting. Fine-tuning moves the model fromθ t−1 to ˜θt, entering the low-loss region of the new taskt (orange...

  4. [4]

    More information, including code and detailed results, can be found in our Github repository 1

    EXPERIMENTS Experiments are done in ESPnet2 [17]. More information, including code and detailed results, can be found in our Github repository 1. Data.We consider two CL benchmarks: (Exp. 1) Following [9], we use Common V oice (CV) [18] English data set, divided into five accents: United States (US), England (ENG), Australia (AUS), India (IND), and Scotla...

  5. [5]

    Experiment 1 As shown by Table 1, our method (IHR) significantly outperforms all baselines, being able to learn with close to zero forgetting (as shown by its -0.1 BWT)

    RESULTS 5.1. Experiment 1 As shown by Table 1, our method (IHR) significantly outperforms all baselines, being able to learn with close to zero forgetting (as shown by its -0.1 BWT). In addition, we observe the following: First, compared to FTA, the strongest memory-free baseline, IHR achieves a clear performance gain. The improvement is partly due to sli...

  6. [6]

    As shown Table 2

    DISCUSSION In ASR, tasks are often highly similar, and many parameters are si- multaneously important for both past and new domains. As shown Table 2. Ablation study of our method. ”+” indicates modifica- tions are made compared to the line directly above. Method Average WER↓BWT↑ Fine-Tuning 15.07 -3.6 + IHR using Pt−1 i=1 H i i andα p = 0.5013.54 a -0.6 ...

  7. [7]

    CONCLUSION We present Inverse Hessian Regularization (IHR), a novel memory- free method for continual learning in ASR. By correcting task- specific updates through a Kronecker-factored, layerwise inverse Hessian, IHR steers adaptation into directions less sensitive for pre- vious tasks, thereby reducing forgetting while maintaining strong adaptability. As...

  8. [8]

    Catastrophic interference in con- nectionist networks: The sequential learning problem,

    Michael McCloskey et al., “Catastrophic interference in con- nectionist networks: The sequential learning problem,” vol. 24 ofPsychology of Learning and Motivation, pp. 109–165. 1989

  9. [9]

    Using adapters to overcome catastrophic forgetting in end-to-end automatic speech recognition,

    Steven Vander Eeckt and Hugo Van hamme, “Using adapters to overcome catastrophic forgetting in end-to-end automatic speech recognition,” inICASSP 2023, 2023

  10. [10]

    Dealing with Unknowns in Continual Learning for End-to-end Auto- matic Speech Recognition,

    Martin Sustek, Samik Sadhu, and Hynek Hermansky, “Dealing with Unknowns in Continual Learning for End-to-end Auto- matic Speech Recognition,” inProc. Interspeech 2022, 2022, pp. 1046–1050

  11. [11]

    Continual learning for on-device speech recog- nition using disentangled conformers,

    Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, and Abdelrahman Mohamed, “Continual learning for on-device speech recog- nition using disentangled conformers,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2023, pp. 1–5

  12. [12]

    Towards Lifelong Learning of End-to- End ASR,

    Heng-Jui Chang et al., “Towards Lifelong Learning of End-to- End ASR,” inProc. Interspeech 2021, 2021, pp. 2551–2555

  13. [13]

    Continual learn- ing for monolingual end-to-end automatic speech recognition,

    Steven Vander Eeckt and Hugo Van hamme, “Continual learn- ing for monolingual end-to-end automatic speech recognition,” in2022 30th European Signal Processing Conference (EU- SIPCO), 2022

  14. [14]

    Updating only encoders prevents catas- trophic forgetting of end-to-end ASR models,

    Yuki Takashima et al., “Updating only encoders prevents catas- trophic forgetting of end-to-end ASR models,” inProc. Inter- speech 2022. 2022, pp. 2218–2222, ISCA

  15. [15]

    Clrl-tuning: A novel continual learning approach for automatic speech recog- nition,

    Zhihan Wang, Feng Hou, and Ruili Wang, “Clrl-tuning: A novel continual learning approach for automatic speech recog- nition,” inInterspeech 2023, 2023, pp. 1279–1283

  16. [16]

    Weight averag- ing: A simple yet effective method to overcome catastrophic forgetting in automatic speech recognition,

    Steven Vander Eeckt and Hugo Van hamme, “Weight averag- ing: A simple yet effective method to overcome catastrophic forgetting in automatic speech recognition,” inICASSP 2023, 2023

  17. [17]

    Online structured laplace approximations for overcoming catastrophic forgetting,

    Hippolyt Ritter, Aleksandar Botev, and David Barber, “Online structured laplace approximations for overcoming catastrophic forgetting,” inAdvances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. 2018, vol. 31, Curran Associates, Inc

  18. [18]

    Optimizing neural net- works with kronecker-factored approximate curvature,

    James Martens and Roger Grosse, “Optimizing neural net- works with kronecker-factored approximate curvature,” in Proceedings of the 32nd International Conference on Interna- tional Conference on Machine Learning - Volume 37. 2015, ICML’15, p. 2408–2417, JMLR.org

  19. [19]

    Improving Transformer-Based End-to- End Speech Recognition with Connectionist Temporal Classi- fication and Language Model Integration,

    Shigeki Karita et al., “Improving Transformer-Based End-to- End Speech Recognition with Connectionist Temporal Classi- fication and Language Model Integration,” inProc. Interspeech 2019

  20. [20]

    Natural con- tinual learning: success is a journey, not (just) a destination,

    Ta-Chu Kao, Kristopher T Jensen, Gido Martijn van de Ven, Alberto Bernacchia, and Guillaume Hennequin, “Natural con- tinual learning: success is a journey, not (just) a destination,” inThirty-Fifth Conference on Neural Information Processing Systems, 2021

  21. [21]

    Overcoming catastrophic forgetting in neural networks,

    James Kirkpatrick et al., “Overcoming catastrophic forgetting in neural networks,”Proceedings of the National Academy of Sciences, vol. 114, no. 13, pp. 3521–3526, 2017

  22. [22]

    Continual learn- ing with quasi-newton methods,

    Steven Vander Eeckt and Hugo Van Hamme, “Continual learn- ing with quasi-newton methods,”IEEE Access, vol. 13, pp. 47485–47499, 2025

  23. [23]

    Continual lifelong learning in natural language processing: A survey,

    Magdalena Biesialska, Katarzyna Biesialska, and Marta R. Costa-juss`a, “Continual lifelong learning in natural language processing: A survey,” inProceedings of the 28th Interna- tional Conference on Computational Linguistics, Barcelona, Spain (Online), Dec. 2020, pp. 6523–6541, International Com- mittee on Computational Linguistics

  24. [24]

    ESPnet: End-to-end speech processing toolkit,

    Shinji Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” inProceedings of Interspeech, 2018, pp. 2207–2211

  25. [25]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila et al., “Common voice: A massively-multilingual speech corpus,” inProceedings of the 12th Conference on Lan- guage Resources and Evaluation (LREC 2020), 2020

  26. [26]

    Unsupervised online continual learning for automatic speech recognition,

    Steven Vander Eeckt and Hugo Van hamme, “Unsupervised online continual learning for automatic speech recognition,” in Interspeech 2024, 2024, pp. 2845–2849

  27. [27]

    Librispeech: An asr corpus based on public domain audio books,

    Vassil Panayotov et al., “Librispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2015

  28. [28]

    Libri-adapt: a new speech dataset for un- supervised domain adaptation,

    Akhil Mathur et al., “Libri-adapt: a new speech dataset for un- supervised domain adaptation,”2020 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7439–7443, 2020

  29. [29]

    Conformer: Convolution-augmented Transformer for Speech Recognition,

    Anmol Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” inProc. Interspeech 2020, 2020, pp. 5036–5040

  30. [30]

    Attention is all you need,

    Ashish Vaswani et al., “Attention is all you need,” inAdvances in Neural Information Processing Systems, 2017, vol. 30

  31. [31]

    SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,

    Taku Kudo and John Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71

  32. [32]

    Adam: A method for stochastic optimization,

    Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2015

  33. [33]

    Experience replay for continual learn- ing,

    David Rolnick et al., “Experience replay for continual learn- ing,” inAdvances in Neural Information Processing Systems, 2019, vol. 32

  34. [34]

    Comparing the recognition performance of csrs: in search of an adequate metric and statistical significance test,

    Helmer Strik, Catia Cucchiarini, and Judith M. Kessens, “Comparing the recognition performance of csrs: in search of an adequate metric and statistical significance test,” inINTER- SPEECH, 2000

  35. [35]

    A continual learning survey: Defy- ing forgetting in classification tasks,

    Matthias Delange et al., “A continual learning survey: Defy- ing forgetting in classification tasks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1–1, 2021

  36. [36]

    Memory aware synapses: Learning what (not) to forget,

    Rahaf Aljundi et al., “Memory aware synapses: Learning what (not) to forget,” inComputer Vision – ECCV 2018. 2018, pp. 144–161, Springer International Publishing