arxiv: 2604.22407 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.AI

Recognition: unknown

Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair

Yuelin Hu , Zhenbo Yu , Zhengxue Cheng , Wei Liu , Li Song

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords continual learningAdam optimizergradient modificationcatastrophic forgettingmoment estimationprojection methodsadaptive routingdomain streams

0 comments

The pith

Gradient modifications with Adam inflate effective learning rates for old directions in continual learning, leading to collapse unless changes are routed only to the first moment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many continual learning methods modify gradients to combat forgetting while assuming Adam acts as a neutral backend optimizer. The paper demonstrates that this combination creates a hidden failure mode: projection and similar modifications inflate the effective learning rate for previous task directions by a factor of 1 over 1 minus alpha through Adam's second-moment pathway. In high-overlap multi-domain language model streams, all shared-routing baselines perform close to vanilla training with severe forgetting, while a 0.5 percent replay buffer still falls short. Only the proposed adaptive decoupled moment routing, which applies modifications solely to the first moment and adapts strength based on overlap, stays stable and reduces forgetting by 3.8 units on eight domains and 4.5 to 4.8 units on sixteen domains. This issue stays invisible on standard clean benchmarks but appears across penalty methods, replay mixing, and large-scale LoRA setups.

Core claim

The paper establishes that upstream gradient modification under Adam induces a 1/(1-alpha) inflation of the old-direction effective learning rate via the second-moment pathway, matching measurements within 8 percent across eight alpha values. In a high-overlap non-adaptive 8-domain continual LM, shared-routing projection baselines reach 12.5-12.8 forgetting versus vanilla's 13.2, with fixed-strength decoupling worsening to 14.1 and replay at 11.6. Adaptive decoupled moment routing achieves 9.4 forgetting. The repair routes the modified gradient exclusively to the first moment while preserving magnitude-faithful second-moment statistics with overlap-aware adaptive strength, and this is the唯一

What carries the argument

Adaptive decoupled moment routing: applying the gradient modification only to Adam's first moment while preserving faithful second-moment statistics and using overlap-aware adaptive strength to control the change.

If this is right

All shared-routing projection baselines collapse close to vanilla forgetting levels in high-overlap non-adaptive 8-domain continual LM.
Adaptive decoupled routing improves over vanilla by 3.8 units on 8-domain streams and by 4.5-4.8 units on 16-domain streams.
The same second-moment conflict occurs with penalty methods, replay mixing, and at 7B scale under LoRA.
The failure mode is largely invisible on clean benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Re-testing existing continual learning methods that rely on gradient modifications under Adam could uncover similar performance gaps hidden on standard benchmarks.
The decoupling approach may extend to other adaptive optimizers that maintain second-moment estimates.
Varying task overlap levels in new experiments would clarify when the adaptive strength component provides the largest benefit.

Load-bearing premise

The 1 over 1 minus alpha inflation of old-direction effective learning rate is the dominant mechanism and the adaptive strength generalizes beyond the tested high-overlap 8- and 16-domain streams.

What would settle it

An experiment in a new high-overlap continual learning setup where measured effective learning rate inflation under projection deviates by more than 8 percent from the 1/(1-alpha) prediction, or where adaptive decoupled routing fails to reduce forgetting below the strongest shared-routing baseline.

Figures

Figures reproduced from arXiv: 2604.22407 by Li Song, Wei Liu, Yuelin Hu, Zhenbo Yu, Zhengxue Cheng.

**Figure 1.** Figure 1: Overview of the paper. Panel 1 (Failure). Gradient-modification modules (projection/penalty/replay) attenuate the update in protected directions and the standard practice routes g mod into both Adam moments m𝑡, v𝑡; under continual streams this collapses to near-vanilla forgetting ( view at source ↗

**Figure 2.** Figure 2: Fingerprint of the attenuate-then-adapt conflict in the projection home case (256M HOPE, view at source ↗

**Figure 3.** Figure 3: Full distribution of protected-subspace overlap view at source ↗

**Figure 4.** Figure 4: Optimizer-state fingerprint in the penalty family (256M HOPE, 2-task, view at source ↗

**Figure 5.** Figure 5: Optimizer-state dynamics (256M HOPE, 2-task, view at source ↗

**Figure 6.** Figure 6: Predicted vs. measured values for three theoretical quantities. All within view at source ↗

read the original abstract

Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5--12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5--4.8 units. The failure is largely invisible on clean benchmarks. We explain this effect through Adam's second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the old-direction effective learning rate, matching measurements within 8% across eight alpha values. The same conflict appears with penalty methods, replay mixing, and at 7B scale under LoRA. Our fix routes the modified gradient only to the first moment while preserving magnitude-faithful second-moment statistics, with overlap-aware adaptive strength. This simple change is the only tested configuration that consistently avoids collapse across methods, optimizers, and scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gradient modifications upstream of Adam inflate old-direction steps via the second moment by roughly 1/(1-alpha), and the paper's decoupled routing fix avoids collapse in their high-overlap continual LM tests.

read the letter

The paper's main point is that common gradient modifications in continual learning, like projection, don't play nice with Adam because they distort the second-moment estimates, leading to an inflated effective learning rate on directions from previous tasks by a factor of about 1/(1-alpha). This causes most shared-routing methods to collapse to near-vanilla forgetting levels in their high-overlap tests. They demonstrate this on 8-domain and 16-domain continual language model streams. Shared projection baselines hit 12.5-12.8 forgetting compared to vanilla's 13.2, with replay at 11.6 and fixed decoupling worse at 14.1. Their adaptive decoupled moment routing, which sends modified gradients only to the first moment and uses overlap-aware strength for the second, gets down to 9.4, a 3.8 improvement, and the gap widens on the longer stream. This is new in spelling out the Adam-specific failure mode and providing a targeted repair that preserves magnitude in the second moment. The empirical match within 8% across alpha values gives some credence to the mechanism, and they check it holds with penalties, replay, and at 7B LoRA scale. The weaker part is that the inflation explanation comes from matching measurements rather than a full step-by-step derivation from Adam's equations, leaving room for other interactions or the adaptive tuning to be doing more of the work. The benefits are clearest in high-overlap non-adaptive settings, and the paper notes the issue is invisible on clean benchmarks, so real-world relevance depends on the training scenario. This work is for people doing continual learning on large models with adaptive optimizers. It has solid enough experiments and a clear, implementable idea to deserve peer review, even if the theoretical backing could be tightened.

Referee Report

2 major / 2 minor

Summary. The paper claims that composing common gradient modifications (projection, penalty rescaling, replay mixing) with the Adam optimizer induces a hidden failure mode in continual learning: projection inflates the effective learning rate on old directions by a factor of 1/(1-α) through Adam's second-moment pathway, causing shared-routing baselines to collapse to near-vanilla forgetting levels (12.5–12.8 vs. 13.2) in high-overlap 8-domain LM streams. Only the proposed Adaptive Decoupled Moment Routing (modified gradient routed solely to the first moment, with overlap-aware adaptive strength) remains stable at 9.4 (3.8-unit gain), with gains growing to 4.5–4.8 units on 16-domain streams; the effect is invisible on clean benchmarks and appears at 7B LoRA scale.

Significance. If the mechanism is correctly identified, the work would be significant for continual learning: it reveals a previously overlooked interaction between widely used gradient-modification techniques and the default Adam optimizer that produces unexpected forgetting not captured by standard benchmarks. The reported quantitative match within 8% across eight α values, consistent gains over strong baselines (including 0.5% replay), and demonstration at 7B scale are concrete strengths that could influence how future methods combine gradient surgery with adaptive optimizers.

major comments (2)

[Explanation of the failure mode (Adam second-moment analysis)] The central explanation states that projection induces a 1/(1-α) inflation of the old-direction effective learning rate via Adam's second-moment pathway, matching measurements within 8% across eight alpha values. However, the manuscript provides no step-by-step closed-form derivation isolating this factor from bias correction, ε, or cross-term interactions in the Adam equations (see the paragraph beginning 'We explain this effect through Adam's second-moment pathway'). The 8% empirical tolerance leaves open the possibility that other mechanisms, such as the overlap-aware adaptive strength tuning, are the actual source of the observed stability.
[Experimental results on 8-domain and 16-domain streams] The assumption that the 1/(1-α) inflation is the dominant mechanism and that the adaptive strength generalizes is load-bearing for the claim that shared-routing methods collapse while the proposed repair succeeds. The experiments are confined to high-overlap 8- and 16-domain streams; additional ablations on low-overlap regimes or alternative optimizers would be needed to confirm the mechanism is not specific to the tested non-adaptive, high-overlap setting.

minor comments (2)

[Abstract] The abstract reports 'within 8% across eight alpha values' and 'improving over vanilla by 3.8 units' but does not reference the corresponding figure or table; adding explicit cross-references would improve traceability.
[Description of Adaptive Decoupled Moment Routing] The term 'overlap-aware adaptive strength' is introduced as part of the repair but its exact functional form and hyperparameter sensitivity are not detailed in the provided text; a short pseudocode or equation would clarify reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped us clarify and strengthen the presentation of the failure mode and its repair. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Explanation of the failure mode (Adam second-moment analysis)] The central explanation states that projection induces a 1/(1-α) inflation of the old-direction effective learning rate via Adam's second-moment pathway, matching measurements within 8% across eight alpha values. However, the manuscript provides no step-by-step closed-form derivation isolating this factor from bias correction, ε, or cross-term interactions in the Adam equations (see the paragraph beginning 'We explain this effect through Adam's second-moment pathway'). The 8% empirical tolerance leaves open the possibility that other mechanisms, such as the overlap-aware adaptive strength tuning, are the actual source of the observed stability.

Authors: We agree that a formal derivation would improve rigor. The original manuscript presented the 1/(1-α) factor as the direct consequence of the second-moment update under projection in the high-overlap regime, validated empirically to within 8%. In the revised manuscript we have added a dedicated appendix section containing a step-by-step closed-form derivation. It begins from the Adam moment equations, applies the projection operator to the gradient, and isolates the inflation factor while explicitly accounting for bias correction and the ε term under the stated assumptions. We have also inserted an ablation that compares the proposed adaptive-strength decoupling against a fixed-strength variant; the fixed-strength version still collapses, indicating that the adaptive component is not the primary source of stability. These changes directly address the concern about alternative mechanisms. revision: yes
Referee: [Experimental results on 8-domain and 16-domain streams] The assumption that the 1/(1-α) inflation is the dominant mechanism and that the adaptive strength generalizes is load-bearing for the claim that shared-routing methods collapse while the proposed repair succeeds. The experiments are confined to high-overlap 8- and 16-domain streams; additional ablations on low-overlap regimes or alternative optimizers would be needed to confirm the mechanism is not specific to the tested non-adaptive, high-overlap setting.

Authors: We concur that broader testing strengthens the claim. The manuscript deliberately focuses on high-overlap streams because this is the practical regime in which the hidden failure mode appears and where standard benchmarks do not expose it. To respond to the request, the revised version includes new experiments on low-overlap domain streams. These confirm that the measured inflation factor decreases with reduced overlap, consistent with the second-moment analysis. We have also added results using alternative optimizers (RMSprop and momentum SGD) that demonstrate the collapse is specific to Adam’s second-moment pathway. The new results appear in an expanded experimental section and support the generality of the identified mechanism within the continual-learning settings where gradient modifications are typically applied. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper's central explanation—that projection induces a 1/(1-alpha) inflation of effective learning rate via Adam's second-moment pathway—is presented as following from the optimizer's standard update rules and then validated by direct measurement (within 8% across alpha values). This is an empirical consistency check rather than a reduction of the observed stability gains (3.8–4.8 units) to a fitted parameter or self-referential definition. The proposed adaptive decoupled routing repair is introduced as a new configuration and evaluated across baselines, optimizers, and scales without the performance claims depending on a tautological input or load-bearing self-citation. No equations are shown to equal their own inputs by construction, and the failure-mode analysis rests on comparative experiments rather than renaming or smuggling prior ansatzes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard Adam moment-update equations and empirical observations in specific continual-learning regimes; the new routing method is introduced without independent external validation.

free parameters (1)

overlap-aware adaptive strength
The strength scaling term is described as adaptive and overlap-aware; its exact computation or any tuning constants are not specified in the abstract.

axioms (1)

standard math Adam maintains separate exponential moving averages for first and second moments with fixed decay parameters beta1 and beta2
This is the standard formulation of Adam invoked to derive the 1/(1-alpha) inflation effect.

invented entities (1)

Adaptive Decoupled Moment Routing no independent evidence
purpose: Route modified gradients exclusively to the first moment while leaving second-moment statistics computed from unmodified gradients
This is the novel repair mechanism proposed to avoid the identified failure mode.

pith-pipeline@v0.9.0 · 11453 in / 1487 out tokens · 133333 ms · 2026-05-08T12:19:49.836830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 9 canonical work pages

[1]

Continual learn- ing in large language models: Methods, challenges, and opportunities.arXiv preprint arXiv:2603.12658, 2026

Hongyang Chen, Zhongwu Sun, Hongfei Ye, Kunchi Li, and Xuemin Lin. Continual learning in large language models: Methods, challenges, and opportunities.arXiv preprint arXiv:2603.12658,

work page arXiv
[2]

MoFO: Momentum-filtered optimizer for mitigating forgetting in LLM fine-tuning

Yupeng Chen, Senmiao Wang, Yushun Zhang, Zhihang Lin, Haozhe Zhang, Weijian Sun, Tian Ding, and Ruoyu Sun. MoFO: Momentum-filtered optimizer for mitigating forgetting in LLM fine-tuning. arXiv preprint arXiv:2407.20999,

work page arXiv
[3]

Fisher-orthogonal projected natural gradient descent for continual learning.arXiv preprint arXiv:2601.12816,

Ishir Garg, Neel Kolhe, Andy Peng, and Rohan Gopalam. Fisher-orthogonal projected natural gradient descent for continual learning.arXiv preprint arXiv:2601.12816,

work page arXiv
[4]

Gradient projection for parameter-efficient continual learning.arXiv preprint arXiv:2405.13383,

Jingyang Qiao et al. Gradient projection for parameter-efficient continual learning.arXiv preprint arXiv:2405.13383,

work page arXiv
[5]

SplitLoRA: Balancing stability and plasticity in continual learning through gradient space splitting.arXiv preprint arXiv:2505.22370,

Haomin Qiu, Miao Zhang, and Zicheng Qiao. SplitLoRA: Balancing stability and plasticity in continual learning through gradient space splitting.arXiv preprint arXiv:2505.22370,

work page arXiv
[6]

Sculpting subspaces: Constrained full fine-tuning in llms for continual learning, 2025

Nikhil Shivakumar Nayak et al. Sculpting subspaces: Constrained full fine-tuning in LLMs for continual learning.arXiv preprint arXiv:2504.07097,

work page arXiv
[7]

Friedemann Zenke, Ben Poole, and Surya Ganguli

arXiv:2301.12131. Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InICML,

work page arXiv
[8]

An adaptive and momental bound method for stochastic learning.arXiv preprint arXiv:1910.12249,

Jianbang Ding, Xuancheng Ren, Ruixuan Luo, and Xu Sun. An adaptive and momental bound method for stochastic learning.arXiv preprint arXiv:1910.12249,

work page arXiv 1910
[9]

Continual learning in deep neural network by using a Kalman optimiser.arXiv preprint arXiv:1905.08119,

Honglin Li, Shirin Enshaeifar, and Payam Barnaghi. Continual learning in deep neural network by using a Kalman optimiser.arXiv preprint arXiv:1905.08119,

work page arXiv 1905
[10]

Here we report the full two-importance-source grid, a symmetric routing decomposition, and the Figure 2 analogue

reports one penalty-family row (EWC-style Fisher, 𝜌-matched). Here we report the full two-importance-source grid, a symmetric routing decomposition, and the Figure 2 analogue. 0 500 1000 1500 2000 Training step 0.2 0.4 0.6 0.8 1.0 v(u) t /σ2 vanilla = 0.71 decoupled = 0.62 shared = 0.30 Task B (A) vt old-task directional energy Vanilla Adam Penalty (share...

2000
[11]

J Replay-Gradient Mixing at 7B Scale Table 16 reports the replay-mixing comparison at 256M in the main text

while remaining intentionally empirical. J Replay-Gradient Mixing at 7B Scale Table 16 reports the replay-mixing comparison at 256M in the main text. For completeness, Table 33 reports the same routing contrast at 7B scale on TRACE, showing that the pattern persists under LoRA fine-tuning. 16 Method Family Rank / budget Routing Forgetting↓ OGD projection ...

2024
[12]

0 500 1000 1500 2000 Training step 0.2 0.4 0.6 0.8 1.0 v(u) t /σ2 σ2 window mean = 0.26 pred

The diagnostic forgetting proposition is intentionally excluded from this quantitative table because it is a stylized scale analysis rather than a benchmark-level predictive formula. 0 500 1000 1500 2000 Training step 0.2 0.4 0.6 0.8 1.0 v(u) t /σ2 σ2 window mean = 0.26 pred. limit (1−α)2σ2 = 0.25 Task B meas. ratio = 3.84 × (A) vt old-task directional en...

2000