On the Stability of Growth in Structural Plasticity

Lute Lillo; Nick Cheney

arxiv: 2605.15435 · v1 · pith:N7VNTZ53new · submitted 2026-05-14 · 💻 cs.LG · cs.NE

On the Stability of Growth in Structural Plasticity

Lute Lillo , Nick Cheney This is my paper

Pith reviewed 2026-05-19 15:34 UTC · model grok-4.3

classification 💻 cs.LG cs.NE

keywords structural plasticitynetwork growthpruninggradient starvationcontinual learningadaptive architectureinsertion problem

0 comments

The pith

Newborn network units participate in the forward pass but receive weaker gradients than units present from the start.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that adding units to a neural network during training is not the simple opposite of pruning. Pruning chooses among units that have been trained from the beginning, while growth places new units into an optimization trajectory that has already specialized. This leaves the new units active in computing outputs yet starved of the gradient signals needed to learn quickly. A reader should care because the effect is small in basic multilayer perceptrons but becomes noticeable in convolutional networks on image tasks and in continual-learning settings where plasticity matters. The authors demonstrate that interventions can help new units integrate, yet this does not always translate into stronger final sparse networks.

Core claim

Newborn units are often forward-active but backward-starved: they participate in the forward computation, yet receive much weaker gradient signal than incumbent units. This disadvantage is minor in small MLP benchmarks but becomes clear in harder image-classification settings with a convolutional trunk. Grow can achieve high final accuracy during the structural-editing procedure, while Prune is stronger when performance is averaged over the training trajectory or when the final sparse network is retrained from scratch.

What carries the argument

The insertion problem: new units are added into an already specialized optimization trajectory, producing systematically weaker back-propagated gradients than those received by units that have been present from initialization.

Load-bearing premise

By the time new units are inserted the optimization trajectory has already specialized, so their gradient signals are weaker than those of the original units.

What would settle it

Direct measurement of per-unit gradient magnitudes right after insertion across multiple runs; if new units consistently receive gradient norms comparable to incumbent units, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2605.15435 by Lute Lillo, Nick Cheney.

**Figure 1.** Figure 1: Cycle vs. Winning-Ticket performance on CIFAR-100. Panels (a)–(d) show mean ± 95% CI with individual seed points. (a) Grow achieves higher final cycle accuracy than Prune, but (b) this advantage vanishes when retraining the final mask from scratch. (c) Viewing the overall trajectory, Prune maintains a stronger or comparable TAA over the cycle, while (d) winning-ticket TAA remains similar across all sparse … view at source ↗

**Figure 2.** Figure 2: Growth inserts units that participate in the forward pass but receive weak backward signal. Event-aligned cohort diagnostics on CIFAR-100, where log-parity 0 denotes equality between the compared cohorts. (a) At birth, newborn Grow units have positive activation parity, showing that they are not inactive or dead on arrival; however, this forward participation weakens across successive grow cycles. (b) The … view at source ↗

**Figure 3.** Figure 3: Newborn units approach forward-activity parity, but not backward-signal parity. Postbirth dynamics on CIFAR-100 measure newly grown units relative to already-active units over the remaining training segment after each growth event; parity is marked by the red dashed line at 1. Across compactness levels, activation ratio (blue) stays near parity and sometimes exceeds it, indicating that newborn units parti… view at source ↗

**Figure 4.** Figure 4: Early newborn integration predicts adaptive-cycle quality, more than final ticket quality. Panels (a)–(b) relate early post-birth parity to Cycle-TAA. Early parity is the average log-ratio between newborn and previously active units over the early post-birth window; values closer to 0 indicate closer parity. Large labeled markers denote method means across compactness; lighter points show individual compa… view at source ↗

**Figure 5.** Figure 5: Repeated-shift benchmarks favor pruning, while integration-friendly growth is the most reliable growth variant. Across six CL benchmarks, Prune is the strongest or near-strongest structural baseline in most rapid-shift settings, consistent with the advantage of preserving mature capacity. Among growth-family methods, Grow + Rand. Smooth-Leaky is the most robust: it consistently improves over Grow, narrows … view at source ↗

**Figure 6.** Figure 6: Gradient parity as the primary mechanistic signal (CIFAR-100). Cycle-TAA vs. eventlocal gradient parity across compactness. The vertical dashed line marks parity (0). Grow occupies the negative-parity regime, indicating newborn gradient disadvantage, whereas Prune occupies the positive-parity regime, indicating that kept units receive stronger learning signal than pruned ones [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 7.** Figure 7: Activation parity as a sanity check (CIFAR-100). Cycle-TAA vs. event-local activation parity across compactness. The vertical dashed line marks parity (0). Activation parity shows that newborn units are not trivially inactive, but it does not account for the main performance separation as directly as gradient parity. B.3.5 Parity geometry of structural edits [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Parity geometry of structural edits. Each panel corresponds to a compactness target 𝑐. Points plot log activation parity (x-axis) against log gradient parity (y-axis), color-coded by cycle. Markers distinguish event type: Grow-birth (new vs. old) and Prune-exit (kept vs. pruned). Positive x-values indicate greater activation in the focal cohort, while negative y-values indicate reduced per-unit learning si… view at source ↗

**Figure 9.** Figure 9: Birth-time parity under activation- and gradient-based Grow. We compare the default activation-based top-𝑘 Grow heuristic with a gradient-based top-𝑘 variant under neutral allocation bias. Top row reports birth activation parity, log(act𝑛𝑒𝑤/act𝑜𝑙𝑑 ), and bottom row reports birth gradient parity, log(grad𝑛𝑒𝑤/grad𝑜𝑙𝑑 ). The dotted line denotes parity. Both heuristics produce forward-active newborn units, but… view at source ↗

**Figure 10.** Figure 10: Post-insertion ratio under activation- and gradient-based Grow. We report cycle-level ratios from the vitality logs. Top row shows activation ratio act𝑛𝑒𝑤/act𝑜𝑙𝑑 , and bottom row shows gradient ratio, grad𝑛𝑒𝑤/grad𝑜𝑙𝑑 . The dotted line denotes parity. Although newborn activation rates remain close to parity, newborn gradient magnitudes remain below parity for both heuristics. The bottleneck is not simply a… view at source ↗

**Figure 11.** Figure 11: Cycle vs. Winning-Ticket performance on CIFAR-10 (SGD, 𝜂=0.1). Panels (a)–(d) show mean ± 95% CI with per-seed scatter across compactness for Cycle and Winning-Ticket ACC (a,b) and TAA (c,d). Panel (e) reports the per-seed gap Δ = ticket − cycle in final accuracy. scratch, however, the two methods become nearly indistinguishable: Winning-Ticket ACC and TAA are almost tied across all compactness levels. Th… view at source ↗

**Figure 12.** Figure 12: Grow cycle stress test on CIFAR-100. Left: Cycle-TAA degrades monotonically with 𝐾 at all compactness levels, indicating worse time-averaged learning when growth events become more frequent. Right: Cycle-ACC is less affected than TAA, with only mild sensitivity at higher compactness [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗

**Figure 13.** Figure 13: Post-birth dynamics under time scarcity. Gradient ratio as a function of newborn age for 𝐾 ∈ {5, 10, 20} across compactness levels. In all cases, parity improves gradually with age, showing that newborn integration is slow and continues over many epochs. Shorter cycles truncate this recovery by reducing the time available before the next growth event. C Two-Speed and Moment Transplant Explanation This app… view at source ↗

**Figure 14.** Figure 14: Activation-control analysis across eight benchmarks. Each bar reports the signed delta Δ = RSL−ReLU, positive values indicate that replacing ReLU with Rand. Smooth-Leaky improves performance for that method and compactness. The central pattern is not a uniform lift across all methods, but a redistribution of benefit across structural regimes: Rand. SmoothLeaky most strongly improves Grow in several of th… view at source ↗

**Figure 15.** Figure 15: Rand. Smooth-Leaky increases newborn forward participation at birth but does not eliminate the immediate gradient disadvantage. We compare ReLU and Rand. SmoothLeaky in the CIFAR-100 Grow setting using event-aligned newborn–old log-parity at the birth snapshot. Positive activation parity indicates that newborn units are forward-active relative to previously active units, while negative gradient parity in… view at source ↗

**Figure 16.** Figure 16: Rand. Smooth-Leaky improves post-birth dynamics. We compare activation and gradient ratios for newborn units after growth events in CIFAR-100 Grow. Parity corresponds to a ratio of 1. Under ReLU, newborn units often remain below activation parity and receive substantially weaker gradient signal than incumbents. Rand. Smooth-Leaky shifts activation ratios above or closer to parity and consistently increase… view at source ↗

**Figure 17.** Figure 17: Early-task plasticity across sequential-accumulation and repeated-shift benchmarks. We report Early Task TAA, defined as the average accuracy over only the initial portion of each task, emphasizing immediate post-shift adaptation rather than late within-task convergence. Dense or Prune-based methods retain the strongest early-window performance when adaptation time is severely limited, whereas growth-base… view at source ↗

read the original abstract

Standard deep-learning pipelines usually choose the network architecture before training and keep it fixed throughout optimization. In contrast, a model can also be adapted by editing its structure during training, for example by pruning existing hidden-neuron units or growing new ones. Although growth is appealing for adaptive and continual systems, we show that it is not simply the inverse of pruning. Pruning selects among units that have participated in training from the start, whereas growth inserts new units into an already specialized optimization trajectory. We isolate this insertion problem and show that newborn units are often forward-active but backward-starved: they participate in the forward computation, yet receive much weaker gradient signal than incumbent units. This disadvantage is minor in small MLP benchmarks, but becomes clear in harder image-classification settings with a convolutional trunk. In these settings, \textsc{Grow} can achieve high final accuracy during the structural-editing procedure, while \textsc{Prune} is stronger when performance is averaged over the training trajectory or when the final sparse network is retrained from scratch. Interventions targeting optimizer state, insertion, selection, and trainability show that improving the integration of newborn units can improve adaptive performance, but does not automatically produce better final subnetworks. In continual-learning benchmarks stressing plasticity loss, \textsc{Grow} becomes competitive mainly when new units have enough time to integrate. Together, these results suggest that \textsc{Grow} should be evaluated not only as an architecture-search operator, but as a time-sensitive optimization process whose success depends on insertion stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Growth isn't just the reverse of pruning because new units get inserted into a specialized trajectory and end up gradient-starved.

read the letter

The main thing to know is that this paper isolates a real insertion problem in structural growth: newborn units participate in the forward pass but receive weaker gradient signals than units that have been there from the start. They back this up with Grow-versus-Prune comparisons on final accuracy, trajectory averages, and retrained performance, plus interventions on optimizer state, insertion, selection, and trainability. In small MLPs the gap is minor, but it shows up more clearly in convolutional image-classification settings and in continual-learning tasks where new units need time to integrate. The result is that Grow can look competitive during the editing phase yet lose out when you average over the run or retrain the final sparse network from scratch. That framing of growth as a time-sensitive optimization process rather than a pure architecture operator is the freshest angle here. The interventions are useful for showing that targeted fixes can help integration without automatically producing stronger final subnetworks. The empirical distinctions hold up from the reported results, and there is no obvious internal contradiction or hidden fitting step. The main soft spot is that the effect size depends on task difficulty and insertion timing, so the practical impact may be narrower than the abstract suggests. Details on exact data splits, statistical tests, and error bars would strengthen the claims, but the controlled comparisons described are a reasonable start. This paper is for people working on dynamic networks, continual learning, and structural plasticity who already care about how optimization trajectories interact with architecture changes. A reader focused on adaptive systems would get concrete value from the Grow/Prune contrast and the stability angle. It is worth sending to a serious referee so the methods and interventions can be checked in detail.

Referee Report

2 major / 3 minor

Summary. The paper examines structural plasticity in deep networks by comparing growth (adding new units mid-training) to pruning (removing units). It argues that growth is not the inverse of pruning because new units are inserted into an already-specialized optimization trajectory, resulting in newborn units that are forward-active but backward-starved (participating in the forward pass yet receiving systematically weaker gradient signals than incumbent units). This is demonstrated through controlled experiments on MLPs and convolutional image-classification tasks, interventions targeting optimizer state/insertion/selection/trainability, distinctions between final accuracy and trajectory-averaged/retrained performance, and continual-learning settings where integration time affects competitiveness.

Significance. If the central empirical observations hold, the work usefully highlights an asymmetry in dynamic architecture methods and frames growth as a time-sensitive optimization process rather than a pure architecture-search operator. The targeted interventions and separation of final vs. averaged performance provide concrete guidance for adaptive and continual-learning systems. Credit is due for the reproducible-style interventions and the focus on plasticity loss in continual benchmarks.

major comments (2)

[§4.2] §4.2 (gradient-signal experiments): the claim that newborn units receive 'much weaker gradient signal' is central to the insertion-problem argument, yet the manuscript does not specify the precise metric (e.g., per-unit L2 norm, cosine similarity to incumbent gradients, or layer-wise averages) nor whether any scaling or normalization is applied before comparison; without this, it is difficult to assess whether the observed difference is an artifact of initialization scale or a genuine optimization-trajectory effect.
[Table 3] Table 3 and associated text (image-classification results): the reported advantage of Grow on final accuracy versus Prune on trajectory-averaged or retrained performance lacks error bars, number of random seeds, or statistical tests; given that the central claim rests on these performance distinctions, the absence of variability measures makes it impossible to judge whether the differences are robust or could be explained by post-hoc run selection.

minor comments (3)

[Abstract] Abstract and §2: the phrase 'backward-starved' is used repeatedly before any quantitative definition or figure reference is given; a short parenthetical or footnote linking to the measurement protocol would improve readability.
[§5] §5 (continual-learning experiments): the statement that 'Grow becomes competitive mainly when new units have enough time to integrate' would benefit from an explicit plot or table showing accuracy as a function of insertion timing or number of subsequent epochs.
[§3] Notation: the distinction between 'final accuracy during the structural-editing procedure' and 'performance averaged over the training trajectory' is important but introduced without a compact symbol or equation; adding a brief definition in §3 would reduce ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript examining the asymmetry between growth and pruning in structural plasticity. We address each major comment below and outline the revisions we will make to improve clarity and reporting standards.

read point-by-point responses

Referee: [§4.2] §4.2 (gradient-signal experiments): the claim that newborn units receive 'much weaker gradient signal' is central to the insertion-problem argument, yet the manuscript does not specify the precise metric (e.g., per-unit L2 norm, cosine similarity to incumbent gradients, or layer-wise averages) nor whether any scaling or normalization is applied before comparison; without this, it is difficult to assess whether the observed difference is an artifact of initialization scale or a genuine optimization-trajectory effect.

Authors: We agree that the gradient metric requires explicit definition to rule out artifacts. In the experiments, gradient signal strength was quantified as the L2 norm of the gradient with respect to each unit's incoming parameters, averaged across the batch and over post-insertion training steps; no additional per-unit scaling or normalization was applied beyond the standard back-propagation and Adam optimizer state. We have revised §4.2 to state this definition verbatim, added a short paragraph explaining why the L2 norm is appropriate for comparing backward starvation, and included a supplementary check confirming that the observed gap remains after matching initialization variance. These changes directly address the concern. revision: yes
Referee: Table 3 and associated text (image-classification results): the reported advantage of Grow on final accuracy versus Prune on trajectory-averaged or retrained performance lacks error bars, number of random seeds, or statistical tests; given that the central claim rests on these performance distinctions, the absence of variability measures makes it impossible to judge whether the differences are robust or could be explained by post-hoc run selection.

Authors: The referee correctly identifies a reporting gap. The Table 3 results were obtained from five independent random seeds, with the reported trends consistent across all runs. In the revised manuscript we will add standard-deviation error bars to the table, explicitly state the number of seeds, and include a brief statistical note (paired t-tests on the key metrics) either in the main text or as a short supplement paragraph. This will allow readers to assess robustness directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper advances empirical claims about the stability of structural growth versus pruning in neural networks, supported by benchmark comparisons, interventions on optimizer state and insertion timing, and observations of gradient signals in convolutional and continual-learning settings. No derivation chain, first-principles equations, or fitted parameters are presented that reduce by construction to inputs defined within the paper itself; the central distinction between Grow and Prune rests on externally measurable performance differences rather than self-referential definitions or self-citation load-bearing arguments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard deep-learning assumptions about gradient flow and trajectory specialization. No new entities are introduced and no free parameters are explicitly fitted to produce the central observation.

axioms (1)

domain assumption Once training has progressed, the optimization trajectory has specialized such that newly inserted units receive systematically weaker gradients.
This premise is invoked to explain why growth is not the inverse of pruning.

pith-pipeline@v0.9.0 · 5797 in / 1288 out tokens · 60106 ms · 2026-05-19T15:34:07.989745+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 14 internal anchors

[1]

Asadi, K., Fakoor, R., and Sabach, S. (2023). Resetting the optimizer in deep rl: An empirical study.Advances in Neural Information Processing Systems, 36:72284–72324

work page 2023
[2]

Behrouz, A., Razaviyayn, M., Zhong, P., and Mirrokni, V. (2025). Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695

work page arXiv 2025
[3]

Bellec, G., Kappel, D., Maass, W., and Legenstein, R. (2017). Deep rewiring: Training very sparse deep networks.arXiv preprint arXiv:1711.05136

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Cai, Z., Sener, O., and Koltun, V. (2021). Online continual learning with natural distribution shifts: An empirical study with visual data. InProceedings of the IEEE/CVF international conference on computer vision, pages 8281–8290

work page 2021
[5]

Chen, T., Goodfellow, I., and Shlens, J. (2016). Net2net: Accelerating learning via knowledge transfer. InICLR

work page 2016
[6]

Cheney, N., Schrimpf, M., and Kreiman, G. (2017). On the robustness of convolutional neural networks to internal architecture and weight perturbations.arXiv preprint arXiv:1703.08245

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Dai, X., Yin, H., and Jha, N. K. (2019). Nest: A neural network synthesis tool based on a grow-and-prune paradigm.IEEE Transactions on Computers, 68(10):1487–1497

work page 2019
[8]

Sparse networks from scratch: Faster training without losing performance, 2019

Dettmers, T. and Zettlemoyer, L. (2019). Sparse networks from scratch: Faster training without losing performance.arXiv preprint arXiv:1907.04840

work page arXiv 2019
[9]

F., Lan, Q., Rahman, P., Mahmood, A

Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. (2024). Loss of plasticity in deep continual learning.Nature, 632:768—774

work page 2024
[10]

Evci, U., Gale, T., Menick, J., Sampedro, P., Lorch, E., and Sohl-Dickstein, J. (2020). Rigging the lottery: Making all tickets winners. InNeurIPS

work page 2020
[11]

Evci, U., van Merrienboer, B., Unterthiner, T., Pedregosa, F., and Vladymyrov, M. (2022). Gradmax: Growing neural networks using gradient information. InInternational Conference on Learning Representations

work page 2022
[12]

PathNet: Evolution Channels Gradient Descent in Super Neural Networks

Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A. A., Pritzel, A., and Wierstra, D. (2017). Pathnet: Evolution channels gradient descent in super neural networks.arXiv preprint arXiv:1701.08734

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Frankle, J. and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks.arXiv preprint arXiv:1803.03635

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

K., Roy, D

Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. (2019). Stabilizing the lottery ticket hypothesis.arXiv: Learning

work page 2019
[15]

A., Prabhu, A., Torr, P

Ghunaim, Y., Bibi, A., Alhamoud, K., Alfarra, M., Al Kader Hammoud, H. A., Prabhu, A., Torr, P. H., and Ghanem, B. (2023). Real-time evaluation in online continual learning: A new hope. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11888–11897

work page 2023
[16]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211. 10

work page internal anchor Pith review Pith/arXiv arXiv 2013
[17]

Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.-J., and Choi, E. (2018). Morphnet: Fast & simple resource-constrained structure learning of deep networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1586–1595

work page 2018
[18]

Han, S., Mao, H., and Dally, W. J. (2015a). Deep compression: Compressing deep neural net- works with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Han, S., Pool, J., Tran, J., and Dally, W. (2015b). Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28

work page
[20]

Kang, H., Mina, R. J. L., Madjid, S. R. H., Yoon, J., Hasegawa-Johnson, M., Hwang, S. J., and Yoo, C. D. (2022). Forget-free continual learning with winning subnetworks. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of...

work page 2022
[21]

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[22]

Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. InUniversity of Toronto Technical Report

work page 2009
[23]

Kumar, S., Marklund, H., and Van Roy, B. (2023). Maintaining plasticity in continual learning via regenerative regularization.arXiv preprint arXiv:2308.11958

work page arXiv 2023
[24]

Lasby, M., Golubeva, A., Evci, U., Nica, M., and Ioannou, Y. (2023). Dynamic sparse training with structured sparsity.arXiv preprint arXiv:2305.02299

work page arXiv 2023
[25]

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324

work page 1998
[26]

Li, X., Zhou, Y., Wu, T., Socher, R., and Xiong, C. (2019). Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. InInternational conference on machine learning, pages 3925–3934. PMLR

work page 2019
[27]

Activation Function Design Sustains Plasticity in Continual Learning

Lillo, L. and Cheney, N. (2025). Activation function design sustains plasticity in continual learning.arXiv preprint arXiv:2509.22562

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

A., Pascanu, R., and Dabney, W

Lyle, C., Zheng, Z., Nikishin, E., Pires, B. A., Pascanu, R., and Dabney, W. (2023). Understanding plasticity in neural networks. InInternational Conference on Machine Learning, pages 23190– 23211. PMLR

work page 2023
[29]

Mallya, A., Davis, D., and Lazebnik, S. (2018). Piggyback: Adapting a single network to multiple tasks by learning to mask weights. InProceedings of the European conference on computer vision (ECCV), pages 67–82

work page 2018
[30]

and Lazebnik, S

Mallya, A. and Lazebnik, S. (2018). Packnet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773

work page 2018
[31]

Miconi, T. (2016). Neural networks with differentiable structure.arXiv preprint arXiv:1606.06216

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

Mocanu, D. C. et al. (2018). Scalable training of artificial neural networks with adaptive sparse connectivity. InAAAI. 11

work page 2018
[33]

Mosbach, M., Andriushchenko, M., and Klakow, D. (2020). On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines.arXiv preprint arXiv:2006.04884

work page arXiv 2020
[34]

Prabhu, A., Cai, Z., Dokania, P., Torr, P., Koltun, V., and Sener, O. (2023). Online continual learning without the storage constraint.arXiv preprint arXiv:2305.09253

work page arXiv 2023
[35]

On the Convergence of Adam and Beyond

Reddi, S. J., Kale, S., and Kumar, S. (2019). On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237

work page internal anchor Pith review Pith/arXiv arXiv 2019
[36]

C., and Fei-Fei, L

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252

work page 2015
[37]

Progressive Neural Networks

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. (2016). Progressive neural networks.arXiv preprint arXiv:1606.04671

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Wei, T., Wang, C., Rui, Y., and Chen, C. W. (2016). Network morphism. InProceedings of the 33rd International Conference on Machine Learning (ICML)

work page 2016
[39]

Wu, L., Liu, B., Stone, P., and Liu, Q. (2020). Firefly neural architecture descent: a general approach for growing neural networks.Advances in neural information processing systems, 33:22373–22383

work page 2020
[40]

Wu, L., Wang, D., and Liu, Q. (2019). Splitting steepest descent for growing neural architectures. Advances in neural information processing systems, 32

work page 2019
[41]

Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms.arXiv preprint arXiv:1708.07747

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR

work page 2020
[43]

Yang, L., Lin, S., Zhang, J., and Fan, D. (2021). Grown: Grow only when necessary for continual learning.arXiv preprint arXiv:2110.00908

work page arXiv 2021
[44]

Yoon, J., Yang, E., Lee, J., and Hwang, S. J. (2017). Lifelong learning with dynamically expand- able networks.arXiv preprint arXiv:1708.01547

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962

work page internal anchor Pith review arXiv 2019
[46]

Yuan, X., Savarese, P., and Maire, M. (2023). Accelerated training via incrementally growing neural networks using variance transfer and learning rate adaptation.Advances in Neural Information Processing Systems, 36:16673–16692

work page 2023
[47]

Zhao, Y., Saxena, D., Cao, J., Liu, X., and Song, C. (2024). Overcoming growth-induced forgetting in task-agnostic continual learning.arXiv preprint arXiv:2408.10566

work page arXiv 2024
[48]

C., Dvornek, N., Papademetris, X., and Duncan, J

Zhuang, J., Tang, T., Ding, Y., Tatikonda, S. C., Dvornek, N., Papademetris, X., and Duncan, J. (2020). Adabelief optimizer: Adapting stepsizes by the belief in observed gradients.Advances in neural information processing systems, 33:18795–18806. 12 Submission Checklist

work page 2020
[49]

For all authors. . . (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] The abstract and introduction accurately reflect the paper’s contributions and scope. (b) Did you describe the limitations of your work? [Yes] We discuss limitations, including the restricted architectural setting...

work page 2022
[50]

If you ran experiments. . . (a) Did you use the same evaluation protocol for all methods being compared (e.g., same benchmarks, data (sub)sets, available resources, etc.)? [Yes] All compared methods use the same datasets, architectures, training budgets, evaluation checkpoints, and compactness targets unless explicitly stated. (b) Did you specify all the ...

work page
[51]

With respect to the code used to obtain your results. . . (a) Did you include the code, data, and instructions needed to reproduce the main experimental results, including all dependencies (e.g., requirements.txt with explicit versions), random seeds, an instructive README with installation instructions, and execution commands (ei- ther in the supplementa...

work page
[52]

(a) Did you cite the creators of used assets? [Yes] We cite the creators of all datasets, algorithms, and software assets used

If you used existing assets (e.g., code, data, models). . . (a) Did you cite the creators of used assets? [Yes] We cite the creators of all datasets, algorithms, and software assets used. (b) Did you discuss whether and how consent was obtained from people whose data you’re using/curating if the license requires it? [N/A] We use standard public benchmark ...

work page
[53]

(a) Did you mention the license of the new assets (e.g., as part of your code submission)? [Yes] The license for released code and assets is specified

If you created/released new assets (e.g., code, data, models). . . (a) Did you mention the license of the new assets (e.g., as part of your code submission)? [Yes] The license for released code and assets is specified. (b) Did you include the new assets either in the supplemental material or as aurl(to, e.g., GitHub or Hugging Face)? [Yes] The released as...

work page
[54]

(a) Did you include the full text of instructions given to participants and screenshots, if appli- cable? [No] No crowdsourcing or human-subject experiments were conducted

If you used crowdsourcing or conducted research with human subjects. . . (a) Did you include the full text of instructions given to participants and screenshots, if appli- cable? [No] No crowdsourcing or human-subject experiments were conducted. (b) Did you describe any potential participant risks, with links to institutional review board (irb) approvals,...

work page
[55]

no replay,

If you included theoretical results. . . (a) Did you state the full set of assumptions of all theoretical results? [No] The paper does not present theoretical results. (b) Did you include complete proofs of all theoretical results? [No] The paper does not present theoretical results. 15 A Datasets, Benchmarks and Hyperparameters A.1 Datasets and Benchmark...

work page arXiv 1955

[1] [1]

Asadi, K., Fakoor, R., and Sabach, S. (2023). Resetting the optimizer in deep rl: An empirical study.Advances in Neural Information Processing Systems, 36:72284–72324

work page 2023

[2] [2]

Behrouz, A., Razaviyayn, M., Zhong, P., and Mirrokni, V. (2025). Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695

work page arXiv 2025

[3] [3]

Bellec, G., Kappel, D., Maass, W., and Legenstein, R. (2017). Deep rewiring: Training very sparse deep networks.arXiv preprint arXiv:1711.05136

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Cai, Z., Sener, O., and Koltun, V. (2021). Online continual learning with natural distribution shifts: An empirical study with visual data. InProceedings of the IEEE/CVF international conference on computer vision, pages 8281–8290

work page 2021

[5] [5]

Chen, T., Goodfellow, I., and Shlens, J. (2016). Net2net: Accelerating learning via knowledge transfer. InICLR

work page 2016

[6] [6]

Cheney, N., Schrimpf, M., and Kreiman, G. (2017). On the robustness of convolutional neural networks to internal architecture and weight perturbations.arXiv preprint arXiv:1703.08245

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Dai, X., Yin, H., and Jha, N. K. (2019). Nest: A neural network synthesis tool based on a grow-and-prune paradigm.IEEE Transactions on Computers, 68(10):1487–1497

work page 2019

[8] [8]

Sparse networks from scratch: Faster training without losing performance, 2019

Dettmers, T. and Zettlemoyer, L. (2019). Sparse networks from scratch: Faster training without losing performance.arXiv preprint arXiv:1907.04840

work page arXiv 2019

[9] [9]

F., Lan, Q., Rahman, P., Mahmood, A

Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. (2024). Loss of plasticity in deep continual learning.Nature, 632:768—774

work page 2024

[10] [10]

Evci, U., Gale, T., Menick, J., Sampedro, P., Lorch, E., and Sohl-Dickstein, J. (2020). Rigging the lottery: Making all tickets winners. InNeurIPS

work page 2020

[11] [11]

Evci, U., van Merrienboer, B., Unterthiner, T., Pedregosa, F., and Vladymyrov, M. (2022). Gradmax: Growing neural networks using gradient information. InInternational Conference on Learning Representations

work page 2022

[12] [12]

PathNet: Evolution Channels Gradient Descent in Super Neural Networks

Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A. A., Pritzel, A., and Wierstra, D. (2017). Pathnet: Evolution channels gradient descent in super neural networks.arXiv preprint arXiv:1701.08734

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Frankle, J. and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks.arXiv preprint arXiv:1803.03635

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

K., Roy, D

Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. (2019). Stabilizing the lottery ticket hypothesis.arXiv: Learning

work page 2019

[15] [15]

A., Prabhu, A., Torr, P

Ghunaim, Y., Bibi, A., Alhamoud, K., Alfarra, M., Al Kader Hammoud, H. A., Prabhu, A., Torr, P. H., and Ghanem, B. (2023). Real-time evaluation in online continual learning: A new hope. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11888–11897

work page 2023

[16] [16]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211. 10

work page internal anchor Pith review Pith/arXiv arXiv 2013

[17] [17]

Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.-J., and Choi, E. (2018). Morphnet: Fast & simple resource-constrained structure learning of deep networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1586–1595

work page 2018

[18] [18]

Han, S., Mao, H., and Dally, W. J. (2015a). Deep compression: Compressing deep neural net- works with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Han, S., Pool, J., Tran, J., and Dally, W. (2015b). Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28

work page

[20] [20]

Kang, H., Mina, R. J. L., Madjid, S. R. H., Yoon, J., Hasegawa-Johnson, M., Hwang, S. J., and Yoo, C. D. (2022). Forget-free continual learning with winning subnetworks. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of...

work page 2022

[21] [21]

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[22] [22]

Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. InUniversity of Toronto Technical Report

work page 2009

[23] [23]

Kumar, S., Marklund, H., and Van Roy, B. (2023). Maintaining plasticity in continual learning via regenerative regularization.arXiv preprint arXiv:2308.11958

work page arXiv 2023

[24] [24]

Lasby, M., Golubeva, A., Evci, U., Nica, M., and Ioannou, Y. (2023). Dynamic sparse training with structured sparsity.arXiv preprint arXiv:2305.02299

work page arXiv 2023

[25] [25]

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324

work page 1998

[26] [26]

Li, X., Zhou, Y., Wu, T., Socher, R., and Xiong, C. (2019). Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. InInternational conference on machine learning, pages 3925–3934. PMLR

work page 2019

[27] [27]

Activation Function Design Sustains Plasticity in Continual Learning

Lillo, L. and Cheney, N. (2025). Activation function design sustains plasticity in continual learning.arXiv preprint arXiv:2509.22562

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

A., Pascanu, R., and Dabney, W

Lyle, C., Zheng, Z., Nikishin, E., Pires, B. A., Pascanu, R., and Dabney, W. (2023). Understanding plasticity in neural networks. InInternational Conference on Machine Learning, pages 23190– 23211. PMLR

work page 2023

[29] [29]

Mallya, A., Davis, D., and Lazebnik, S. (2018). Piggyback: Adapting a single network to multiple tasks by learning to mask weights. InProceedings of the European conference on computer vision (ECCV), pages 67–82

work page 2018

[30] [30]

and Lazebnik, S

Mallya, A. and Lazebnik, S. (2018). Packnet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773

work page 2018

[31] [31]

Miconi, T. (2016). Neural networks with differentiable structure.arXiv preprint arXiv:1606.06216

work page internal anchor Pith review Pith/arXiv arXiv 2016

[32] [32]

Mocanu, D. C. et al. (2018). Scalable training of artificial neural networks with adaptive sparse connectivity. InAAAI. 11

work page 2018

[33] [33]

Mosbach, M., Andriushchenko, M., and Klakow, D. (2020). On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines.arXiv preprint arXiv:2006.04884

work page arXiv 2020

[34] [34]

Prabhu, A., Cai, Z., Dokania, P., Torr, P., Koltun, V., and Sener, O. (2023). Online continual learning without the storage constraint.arXiv preprint arXiv:2305.09253

work page arXiv 2023

[35] [35]

On the Convergence of Adam and Beyond

Reddi, S. J., Kale, S., and Kumar, S. (2019). On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237

work page internal anchor Pith review Pith/arXiv arXiv 2019

[36] [36]

C., and Fei-Fei, L

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252

work page 2015

[37] [37]

Progressive Neural Networks

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. (2016). Progressive neural networks.arXiv preprint arXiv:1606.04671

work page internal anchor Pith review Pith/arXiv arXiv 2016

[38] [38]

Wei, T., Wang, C., Rui, Y., and Chen, C. W. (2016). Network morphism. InProceedings of the 33rd International Conference on Machine Learning (ICML)

work page 2016

[39] [39]

Wu, L., Liu, B., Stone, P., and Liu, Q. (2020). Firefly neural architecture descent: a general approach for growing neural networks.Advances in neural information processing systems, 33:22373–22383

work page 2020

[40] [40]

Wu, L., Wang, D., and Liu, Q. (2019). Splitting steepest descent for growing neural architectures. Advances in neural information processing systems, 32

work page 2019

[41] [41]

Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms.arXiv preprint arXiv:1708.07747

work page internal anchor Pith review Pith/arXiv arXiv 2017

[42] [42]

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR

work page 2020

[43] [43]

Yang, L., Lin, S., Zhang, J., and Fan, D. (2021). Grown: Grow only when necessary for continual learning.arXiv preprint arXiv:2110.00908

work page arXiv 2021

[44] [44]

Yoon, J., Yang, E., Lee, J., and Hwang, S. J. (2017). Lifelong learning with dynamically expand- able networks.arXiv preprint arXiv:1708.01547

work page internal anchor Pith review Pith/arXiv arXiv 2017

[45] [45]

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962

work page internal anchor Pith review arXiv 2019

[46] [46]

Yuan, X., Savarese, P., and Maire, M. (2023). Accelerated training via incrementally growing neural networks using variance transfer and learning rate adaptation.Advances in Neural Information Processing Systems, 36:16673–16692

work page 2023

[47] [47]

Zhao, Y., Saxena, D., Cao, J., Liu, X., and Song, C. (2024). Overcoming growth-induced forgetting in task-agnostic continual learning.arXiv preprint arXiv:2408.10566

work page arXiv 2024

[48] [48]

C., Dvornek, N., Papademetris, X., and Duncan, J

Zhuang, J., Tang, T., Ding, Y., Tatikonda, S. C., Dvornek, N., Papademetris, X., and Duncan, J. (2020). Adabelief optimizer: Adapting stepsizes by the belief in observed gradients.Advances in neural information processing systems, 33:18795–18806. 12 Submission Checklist

work page 2020

[49] [49]

For all authors. . . (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] The abstract and introduction accurately reflect the paper’s contributions and scope. (b) Did you describe the limitations of your work? [Yes] We discuss limitations, including the restricted architectural setting...

work page 2022

[50] [50]

If you ran experiments. . . (a) Did you use the same evaluation protocol for all methods being compared (e.g., same benchmarks, data (sub)sets, available resources, etc.)? [Yes] All compared methods use the same datasets, architectures, training budgets, evaluation checkpoints, and compactness targets unless explicitly stated. (b) Did you specify all the ...

work page

[51] [51]

With respect to the code used to obtain your results. . . (a) Did you include the code, data, and instructions needed to reproduce the main experimental results, including all dependencies (e.g., requirements.txt with explicit versions), random seeds, an instructive README with installation instructions, and execution commands (ei- ther in the supplementa...

work page

[52] [52]

(a) Did you cite the creators of used assets? [Yes] We cite the creators of all datasets, algorithms, and software assets used

If you used existing assets (e.g., code, data, models). . . (a) Did you cite the creators of used assets? [Yes] We cite the creators of all datasets, algorithms, and software assets used. (b) Did you discuss whether and how consent was obtained from people whose data you’re using/curating if the license requires it? [N/A] We use standard public benchmark ...

work page

[53] [53]

(a) Did you mention the license of the new assets (e.g., as part of your code submission)? [Yes] The license for released code and assets is specified

If you created/released new assets (e.g., code, data, models). . . (a) Did you mention the license of the new assets (e.g., as part of your code submission)? [Yes] The license for released code and assets is specified. (b) Did you include the new assets either in the supplemental material or as aurl(to, e.g., GitHub or Hugging Face)? [Yes] The released as...

work page

[54] [54]

(a) Did you include the full text of instructions given to participants and screenshots, if appli- cable? [No] No crowdsourcing or human-subject experiments were conducted

If you used crowdsourcing or conducted research with human subjects. . . (a) Did you include the full text of instructions given to participants and screenshots, if appli- cable? [No] No crowdsourcing or human-subject experiments were conducted. (b) Did you describe any potential participant risks, with links to institutional review board (irb) approvals,...

work page

[55] [55]

no replay,

If you included theoretical results. . . (a) Did you state the full set of assumptions of all theoretical results? [No] The paper does not present theoretical results. (b) Did you include complete proofs of all theoretical results? [No] The paper does not present theoretical results. 15 A Datasets, Benchmarks and Hyperparameters A.1 Datasets and Benchmark...

work page arXiv 1955