When Does Structure Matter in Continual Learning? Dimensionality Controls When Modularity Shapes Representational Geometry

Eleni Nisioti; Joachim Winter Pedersen; Kathrin Korte; Sebastian Risi

arxiv: 2604.27656 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.AI· cs.NE

When Does Structure Matter in Continual Learning? Dimensionality Controls When Modularity Shapes Representational Geometry

Kathrin Korte , Joachim Winter Pedersen , Eleni Nisioti , Sebastian Risi This is my paper

Pith reviewed 2026-05-07 06:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE

keywords continual learningrepresentational geometrymodularitydimensionalityrecurrent networkstask similaritystability-plasticity

0 comments

The pith

Representational dimensionality controls when modularity shapes representational geometry in continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study examines how network modularity, task similarity, and representational dimensionality interact in sequential learning tasks using recurrent networks. By varying weight initialization scales to create high- and low-dimensional regimes, the authors compare a task-partitioned modular network to a single-module baseline. In high-dimensional regimes, architecture has little effect as representations can handle multiple tasks without interference. In lower-dimensional regimes, however, modular networks develop a graded geometry with task subspaces that overlap for similar tasks, partially orthogonalize for moderate dissimilarity, and separate for dissimilar ones, an effect missing in the non-modular case. This positions dimensionality as the organizing factor for when structural separation becomes useful in managing the stability-plasticity trade-off.

Core claim

When the effective dimensionality of representations is low, due to small weight initialization scales, a modular recurrent network exhibits task-specific subspaces with alignment proportional to task similarity, enabling appropriate transfer and reduced interference, whereas a monolithic network does not develop such structured geometry. In high-dimensional regimes from larger initialization, both architectures show similar behavior with minimal impact from modularity. The paper thus establishes dimensionality as the variable that determines the functional relevance of architectural structure in continual learning.

What carries the argument

The effective dimensionality of learned representations, modulated by weight initialization scale to induce rich versus lazy regimes, which in turn determines the degree to which modular architecture influences the alignment and separation of task-specific representational subspaces.

If this is right

Modular architectures are functionally relevant primarily in constrained, low-dimensional settings where they enable similarity-dependent subspace geometry.
High-dimensional representations mitigate interference without requiring explicit structural separation.
Continual learning systems can leverage adaptive representational geometry as a design principle to balance plasticity and stability.
Task similarity measures can predict the extent of representational overlap in modular networks under rich regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Controlling dimensionality dynamically during learning could allow systems to adaptively use modularity only when beneficial.
Similar dimensionality effects might appear in other architectures, suggesting a general principle for continual learning beyond recurrent networks.
Empirical validation on diverse task sequences could test whether the observed graded geometry improves long-term retention and transfer.

Load-bearing premise

Varying the scale of weight initialization creates reliably distinct rich and lazy regimes with corresponding effective dimensionalities that generalize to other architectures and task types.

What would settle it

If experiments show that modular and single networks produce equivalent representational geometries across all initialization scales and task similarities, or if low-dimensional regimes do not produce the predicted graded alignment patterns.

Figures

Figures reproduced from arXiv: 2604.27656 by Eleni Nisioti, Joachim Winter Pedersen, Kathrin Korte, Sebastian Risi.

**Figure 1.** Figure 1: Overview of the continual-learning setup, architectures, and representational regimes. (a) Sequential training protocol. Networks are trained on task A in phase A1, then on task B in phase B, and finally retested on task A in phase A2. Task B is instantiated in three similarity conditions relative to task A: same, near, and far. The schematic highlights two competing pressures: the opportunity for transfer… view at source ↗

**Figure 2.** Figure 2: Modular structure attenuates transfer-interference costs in sequential learning in constrained regimes. (a, b) Accuracy across the sequential A1 → B → A2 training protocol for the modular network (a) and the single network (b) under the three task-similarity conditions (same, near, far) and across initialization weight scales. In both architectures, learning on A1 and B rapidly reaches high accuracy, but t… view at source ↗

**Figure 3.** Figure 3: Initialization weight scaling controlled representational dimensionality reveals architecturedependent representational geometry. (a, b) Effective dimensionality of hidden representations, measured as the number of principal components required to explain 99% of the variance, for modular (a) and the single networks (b). Columns correspond to decreasing initialization weight scale, from the lazy regime to … view at source ↗

**Figure 4.** Figure 4: 3D PCA projections reveal similarity-dependent organization of task representations under reduced dimensionality. Each panel shows hidden-state trajectories projected onto the first three principal components of a PCA fitted jointly to Post A1, Post B, and Post A2 activations for a given network and initialization regime. For each phase, trajectories are computed over a fixed sweep of 12 inputs, with the … view at source ↗

read the original abstract

To preserve previously learned representations, continual learning systems must strike a balance between plasticity, the ability to acquire new knowledge, and stability. This stability-plasticity dilemma affects how representations can be reused across tasks: shared structure enables transfer when tasks are similar but may also induce interference when new learning disrupts existing representations. However, it remains unclear when and why structural separation influences this trade-off. In this study, we examine how network architecture, task similarity, and representational dimensionality jointly shape learning in a sequential task paradigm inspired by transfer-interference studies. We compare a task-partitioned modular recurrent network with a single-module baseline by systematically varying task similarity (low, medium, high) and the scale of weight initialization, which induces different learning regimes that we empirically characterize through the effective dimensionality of the learned representations. We find that architecture has minimal impact in high-dimensional regimes where representations are sufficiently unconstrained to accommodate multiple tasks without strong interference. In contrast, in lower-dimensional (rich) regimes, architectural separation is decisive: modular networks exhibit graded alignment of task-specific subspaces with overlap for similar tasks, partial orthogonalization for moderately dissimilar tasks, and stronger separation for dissimilar tasks. This graded geometry is absent in the single network baseline. Our findings suggest that representational dimensionality acts as a key organizing variable governing when structural separation becomes functionally relevant, and highlight adaptive geometry as a central principle for designing continual learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds that low representational dimensionality makes modular separation matter for graded task subspace alignment in continual learning, while high dimensionality renders architecture irrelevant, but the init-scale manipulation confounds dimensionality with gradient and feature-learning effects.

read the letter

The central finding is that representational dimensionality organizes when modularity affects continual learning outcomes. In low-dimensional regimes the modular recurrent network produces task subspaces that align proportionally to task similarity, with overlap for similar tasks and stronger separation for dissimilar ones. The single-module baseline lacks this graded geometry. In high-dimensional regimes the architecture difference shrinks to near zero. This pattern is the main new piece: prior studies examined modularity or continual learning in isolation, but this work ties the functional value of structural separation to an empirically measured dimensionality variable induced by init scale and task similarity levels. The systematic comparison across three similarity levels and the post-training dimensionality characterization is a clean empirical contribution. The graded alignment result in the modular case is the clearest positive evidence the paper presents. The soft spot is exactly the one flagged in the stress-test note. Varying weight initialization scale simultaneously alters effective dimensionality, the NTK-to-feature-learning balance, per-layer gradient magnitudes in the recurrent dynamics, and the speed of task-specific feature acquisition. These factors are not orthogonalized, so the reported interaction between modularity and similarity could be driven by any of them rather than dimensionality per se. The paper would be tighter if it included an orthogonal manipulation, such as explicit dimensionality reduction or hidden-size variation, to isolate the claimed causal role. No circularity or self-defined quantities appear in the main claims. The work is aimed at researchers in continual learning who already think about representational geometry and architecture trade-offs. Readers who design sequential systems or study interference will get concrete guidance on when modular separation is worth the cost. It is coherent on its own terms and shows honest engagement with the stability-plasticity tension, so it deserves a serious referee even though the causal interpretation needs tightening. I would send it to review with a request for additional controls on the dimensionality manipulation.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical study comparing a task-partitioned modular recurrent network to a single-module baseline in a continual learning setting. By varying task similarity (low, medium, high) and weight initialization scale (to induce different effective dimensionalities), the authors claim that in high-dimensional regimes, network architecture has minimal impact on learning, whereas in lower-dimensional regimes, modularity enables a graded representational geometry—characterized by subspace alignment for similar tasks, partial orthogonalization for moderate dissimilarity, and stronger separation for dissimilar tasks—that is absent in the single-network baseline. The central conclusion is that representational dimensionality serves as a key organizing variable determining when structural separation becomes functionally relevant.

Significance. If the central claim holds, this work would provide valuable insight into the conditions under which architectural modularity aids continual learning, particularly by linking it to representational dimensionality and adaptive geometry. The controlled comparison of architectures across task similarities and regimes is a strength, and the emphasis on geometry as a design principle could inform future continual learning systems. However, the significance is tempered by the need to confirm that dimensionality, rather than co-varying factors, drives the observed effects.

major comments (3)

[Methods] The manipulation of weight initialization scale to control effective dimensionality (as described in the experimental setup) simultaneously alters multiple other quantities, such as the NTK/feature-learning balance, per-layer gradient magnitudes in the recurrent dynamics, and the rate of task-specific feature acquisition. Since these are not orthogonalized, it is unclear whether the graded geometry observed only in the modular network under low-dimensional conditions is caused by dimensionality per se or by one of the co-varying factors. This is load-bearing for the claim that dimensionality is the organizing variable.
[Results] The manuscript lacks details on statistical tests, error bars, data exclusion rules, and full methods for measuring effective dimensionality and subspace alignments. Without these, it is difficult to verify the reliability of the reported graded geometry and the absence of similar patterns in the baseline.
[Discussion] The generalizability of the rich vs. lazy regimes induced by initialization scale beyond the specific recurrent architectures and task similarity measures used should be discussed more explicitly, as the assumption that these reliably create distinct regimes is central to interpreting the architecture-by-dimensionality interaction.

minor comments (2)

[Abstract] The abstract is clear but could more precisely define 'effective dimensionality' and 'graded alignment' for readers unfamiliar with the terms.
[Figures] Ensure all figures include error bars and clear legends for the different conditions (low/medium/high similarity, modular vs baseline).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us identify opportunities to strengthen the clarity and rigor of our manuscript. We address each major comment point by point below, indicating where revisions will be made.

read point-by-point responses

Referee: [Methods] The manipulation of weight initialization scale to control effective dimensionality (as described in the experimental setup) simultaneously alters multiple other quantities, such as the NTK/feature-learning balance, per-layer gradient magnitudes in the recurrent dynamics, and the rate of task-specific feature acquisition. Since these are not orthogonalized, it is unclear whether the graded geometry observed only in the modular network under low-dimensional conditions is caused by dimensionality per se or by one of the co-varying factors. This is load-bearing for the claim that dimensionality is the organizing variable.

Authors: We acknowledge that varying initialization scale affects multiple co-varying factors, including the NTK/feature-learning balance, gradient magnitudes, and feature acquisition dynamics, and that these are not fully orthogonalized by design. Our empirical strategy relies on directly measuring effective dimensionality of the learned representations to characterize regimes and correlate them with the emergence of graded geometry. In the revised manuscript, we will add a dedicated subsection in Methods and an expanded paragraph in Discussion that explicitly lists these co-varying quantities, reports additional post-hoc analyses (e.g., correlations between measured dimensionality and NTK spectral properties), and discusses the limitations of attributing effects solely to dimensionality. We will also outline why full orthogonalization would require new experimental controls that lie beyond the current scope. revision: partial
Referee: [Results] The manuscript lacks details on statistical tests, error bars, data exclusion rules, and full methods for measuring effective dimensionality and subspace alignments. Without these, it is difficult to verify the reliability of the reported graded geometry and the absence of similar patterns in the baseline.

Authors: We thank the referee for highlighting these reporting gaps. In the revised manuscript we will: (i) add error bars (standard error across random seeds) to all quantitative figures; (ii) report the statistical tests performed (including t-tests or ANOVA with p-values and effect sizes for key comparisons); (iii) state explicitly that no data were excluded beyond standard preprocessing; and (iv) expand the Methods section with precise definitions, formulas, and hyperparameter values for effective dimensionality (participation ratio on the covariance spectrum) and subspace alignment metrics (principal-angle cosine similarities), together with pseudocode and a pointer to the released analysis scripts. revision: yes
Referee: [Discussion] The generalizability of the rich vs. lazy regimes induced by initialization scale beyond the specific recurrent architectures and task similarity measures used should be discussed more explicitly, as the assumption that these reliably create distinct regimes is central to interpreting the architecture-by-dimensionality interaction.

Authors: We agree that the generalizability of the rich/lazy regimes merits explicit discussion. In the revised Discussion we will add a new paragraph that (a) situates our initialization-scale manipulation within the broader literature on rich versus lazy training, (b) notes that the observed regime distinctions are demonstrated for the specific recurrent architectures and task-similarity metric employed, and (c) outlines the conditions under which similar regime effects are expected (or not) in feedforward networks, alternative similarity measures, and other continual-learning benchmarks. We will also include a brief limitations subsection flagging the need for future cross-architecture validation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations with no derived predictions or self-referential definitions

full rationale

The paper reports experimental results comparing modular and single-module recurrent networks under varying task similarity and weight initialization scales. Effective dimensionality is measured post-training as a characterization of learning regimes rather than used to derive or fit any target quantity. No equations, predictions, or central claims reduce by construction to fitted inputs, self-citations, or renamed known results; the graded subspace alignment findings are direct observations of representational geometry. The work is self-contained against external benchmarks because all quantities (dimensionality, alignment metrics) are computed from the trained networks without circular re-use of the same data for both fitting and prediction.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that initialization scale controls effective dimensionality and that this dimensionality in turn controls interference. No new mathematical axioms or invented entities are introduced; the work is purely experimental.

free parameters (2)

weight initialization scale
Used to induce different learning regimes; its specific values are chosen to produce high vs low effective dimensionality.
task similarity levels
Low/medium/high similarity is defined by the authors and controls expected overlap.

pith-pipeline@v0.9.0 · 5563 in / 1196 out tokens · 41585 ms · 2026-05-07T06:51:34.825501+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Achterberg, J., Akarca, D., Strouse, D., Duncan, J., and Astle, D. E. (2023). Spatially embedded recurrent neural networks reveal widespread links between structural and functional neuro- science findings.Nature Machine Intelligence, 5(12):1369–

work page 2023
[2]

and Goodman, D

B´ena, G. and Goodman, D. F. (2025). Dynamics of specialization in neural modules under resource constraints.Nature Com- munications, 16(1):187. Clune, J., Mouret, J.-B., and Lipson, H. (2013). The evolutionary origins of modularity.Proceedings of the Royal Society b: Biological sciences, 280(1755):20122863. Ellefsen, K. O., Mouret, J.-B., and Clune, J. (...

work page arXiv 2025
[3]

Salatiello, A. (2026). Modularity is the bedrock of natural and artificial intelligence. Seguin, C., Sporns, O., Zalesky, A., Calamante, F., et al. (2022). Network communication models narrow the gap between the modular organization of structural and functional brain net- works.Neuroimage, 257:119323. Van de Ven, G. M., Siegelmann, H. T., and Tolias, A. S...

work page 2026

[1] [1]

Achterberg, J., Akarca, D., Strouse, D., Duncan, J., and Astle, D. E. (2023). Spatially embedded recurrent neural networks reveal widespread links between structural and functional neuro- science findings.Nature Machine Intelligence, 5(12):1369–

work page 2023

[2] [2]

and Goodman, D

B´ena, G. and Goodman, D. F. (2025). Dynamics of specialization in neural modules under resource constraints.Nature Com- munications, 16(1):187. Clune, J., Mouret, J.-B., and Lipson, H. (2013). The evolutionary origins of modularity.Proceedings of the Royal Society b: Biological sciences, 280(1755):20122863. Ellefsen, K. O., Mouret, J.-B., and Clune, J. (...

work page arXiv 2025

[3] [3]

Salatiello, A. (2026). Modularity is the bedrock of natural and artificial intelligence. Seguin, C., Sporns, O., Zalesky, A., Calamante, F., et al. (2022). Network communication models narrow the gap between the modular organization of structural and functional brain net- works.Neuroimage, 257:119323. Van de Ven, G. M., Siegelmann, H. T., and Tolias, A. S...

work page 2026