When Does Structure Matter in Continual Learning? Dimensionality Controls When Modularity Shapes Representational Geometry
Pith reviewed 2026-05-07 06:51 UTC · model grok-4.3
The pith
Representational dimensionality controls when modularity shapes representational geometry in continual learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the effective dimensionality of representations is low, due to small weight initialization scales, a modular recurrent network exhibits task-specific subspaces with alignment proportional to task similarity, enabling appropriate transfer and reduced interference, whereas a monolithic network does not develop such structured geometry. In high-dimensional regimes from larger initialization, both architectures show similar behavior with minimal impact from modularity. The paper thus establishes dimensionality as the variable that determines the functional relevance of architectural structure in continual learning.
What carries the argument
The effective dimensionality of learned representations, modulated by weight initialization scale to induce rich versus lazy regimes, which in turn determines the degree to which modular architecture influences the alignment and separation of task-specific representational subspaces.
If this is right
- Modular architectures are functionally relevant primarily in constrained, low-dimensional settings where they enable similarity-dependent subspace geometry.
- High-dimensional representations mitigate interference without requiring explicit structural separation.
- Continual learning systems can leverage adaptive representational geometry as a design principle to balance plasticity and stability.
- Task similarity measures can predict the extent of representational overlap in modular networks under rich regimes.
Where Pith is reading between the lines
- Controlling dimensionality dynamically during learning could allow systems to adaptively use modularity only when beneficial.
- Similar dimensionality effects might appear in other architectures, suggesting a general principle for continual learning beyond recurrent networks.
- Empirical validation on diverse task sequences could test whether the observed graded geometry improves long-term retention and transfer.
Load-bearing premise
Varying the scale of weight initialization creates reliably distinct rich and lazy regimes with corresponding effective dimensionalities that generalize to other architectures and task types.
What would settle it
If experiments show that modular and single networks produce equivalent representational geometries across all initialization scales and task similarities, or if low-dimensional regimes do not produce the predicted graded alignment patterns.
Figures
read the original abstract
To preserve previously learned representations, continual learning systems must strike a balance between plasticity, the ability to acquire new knowledge, and stability. This stability-plasticity dilemma affects how representations can be reused across tasks: shared structure enables transfer when tasks are similar but may also induce interference when new learning disrupts existing representations. However, it remains unclear when and why structural separation influences this trade-off. In this study, we examine how network architecture, task similarity, and representational dimensionality jointly shape learning in a sequential task paradigm inspired by transfer-interference studies. We compare a task-partitioned modular recurrent network with a single-module baseline by systematically varying task similarity (low, medium, high) and the scale of weight initialization, which induces different learning regimes that we empirically characterize through the effective dimensionality of the learned representations. We find that architecture has minimal impact in high-dimensional regimes where representations are sufficiently unconstrained to accommodate multiple tasks without strong interference. In contrast, in lower-dimensional (rich) regimes, architectural separation is decisive: modular networks exhibit graded alignment of task-specific subspaces with overlap for similar tasks, partial orthogonalization for moderately dissimilar tasks, and stronger separation for dissimilar tasks. This graded geometry is absent in the single network baseline. Our findings suggest that representational dimensionality acts as a key organizing variable governing when structural separation becomes functionally relevant, and highlight adaptive geometry as a central principle for designing continual learning systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study comparing a task-partitioned modular recurrent network to a single-module baseline in a continual learning setting. By varying task similarity (low, medium, high) and weight initialization scale (to induce different effective dimensionalities), the authors claim that in high-dimensional regimes, network architecture has minimal impact on learning, whereas in lower-dimensional regimes, modularity enables a graded representational geometry—characterized by subspace alignment for similar tasks, partial orthogonalization for moderate dissimilarity, and stronger separation for dissimilar tasks—that is absent in the single-network baseline. The central conclusion is that representational dimensionality serves as a key organizing variable determining when structural separation becomes functionally relevant.
Significance. If the central claim holds, this work would provide valuable insight into the conditions under which architectural modularity aids continual learning, particularly by linking it to representational dimensionality and adaptive geometry. The controlled comparison of architectures across task similarities and regimes is a strength, and the emphasis on geometry as a design principle could inform future continual learning systems. However, the significance is tempered by the need to confirm that dimensionality, rather than co-varying factors, drives the observed effects.
major comments (3)
- [Methods] The manipulation of weight initialization scale to control effective dimensionality (as described in the experimental setup) simultaneously alters multiple other quantities, such as the NTK/feature-learning balance, per-layer gradient magnitudes in the recurrent dynamics, and the rate of task-specific feature acquisition. Since these are not orthogonalized, it is unclear whether the graded geometry observed only in the modular network under low-dimensional conditions is caused by dimensionality per se or by one of the co-varying factors. This is load-bearing for the claim that dimensionality is the organizing variable.
- [Results] The manuscript lacks details on statistical tests, error bars, data exclusion rules, and full methods for measuring effective dimensionality and subspace alignments. Without these, it is difficult to verify the reliability of the reported graded geometry and the absence of similar patterns in the baseline.
- [Discussion] The generalizability of the rich vs. lazy regimes induced by initialization scale beyond the specific recurrent architectures and task similarity measures used should be discussed more explicitly, as the assumption that these reliably create distinct regimes is central to interpreting the architecture-by-dimensionality interaction.
minor comments (2)
- [Abstract] The abstract is clear but could more precisely define 'effective dimensionality' and 'graded alignment' for readers unfamiliar with the terms.
- [Figures] Ensure all figures include error bars and clear legends for the different conditions (low/medium/high similarity, modular vs baseline).
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which have helped us identify opportunities to strengthen the clarity and rigor of our manuscript. We address each major comment point by point below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [Methods] The manipulation of weight initialization scale to control effective dimensionality (as described in the experimental setup) simultaneously alters multiple other quantities, such as the NTK/feature-learning balance, per-layer gradient magnitudes in the recurrent dynamics, and the rate of task-specific feature acquisition. Since these are not orthogonalized, it is unclear whether the graded geometry observed only in the modular network under low-dimensional conditions is caused by dimensionality per se or by one of the co-varying factors. This is load-bearing for the claim that dimensionality is the organizing variable.
Authors: We acknowledge that varying initialization scale affects multiple co-varying factors, including the NTK/feature-learning balance, gradient magnitudes, and feature acquisition dynamics, and that these are not fully orthogonalized by design. Our empirical strategy relies on directly measuring effective dimensionality of the learned representations to characterize regimes and correlate them with the emergence of graded geometry. In the revised manuscript, we will add a dedicated subsection in Methods and an expanded paragraph in Discussion that explicitly lists these co-varying quantities, reports additional post-hoc analyses (e.g., correlations between measured dimensionality and NTK spectral properties), and discusses the limitations of attributing effects solely to dimensionality. We will also outline why full orthogonalization would require new experimental controls that lie beyond the current scope. revision: partial
-
Referee: [Results] The manuscript lacks details on statistical tests, error bars, data exclusion rules, and full methods for measuring effective dimensionality and subspace alignments. Without these, it is difficult to verify the reliability of the reported graded geometry and the absence of similar patterns in the baseline.
Authors: We thank the referee for highlighting these reporting gaps. In the revised manuscript we will: (i) add error bars (standard error across random seeds) to all quantitative figures; (ii) report the statistical tests performed (including t-tests or ANOVA with p-values and effect sizes for key comparisons); (iii) state explicitly that no data were excluded beyond standard preprocessing; and (iv) expand the Methods section with precise definitions, formulas, and hyperparameter values for effective dimensionality (participation ratio on the covariance spectrum) and subspace alignment metrics (principal-angle cosine similarities), together with pseudocode and a pointer to the released analysis scripts. revision: yes
-
Referee: [Discussion] The generalizability of the rich vs. lazy regimes induced by initialization scale beyond the specific recurrent architectures and task similarity measures used should be discussed more explicitly, as the assumption that these reliably create distinct regimes is central to interpreting the architecture-by-dimensionality interaction.
Authors: We agree that the generalizability of the rich/lazy regimes merits explicit discussion. In the revised Discussion we will add a new paragraph that (a) situates our initialization-scale manipulation within the broader literature on rich versus lazy training, (b) notes that the observed regime distinctions are demonstrated for the specific recurrent architectures and task-similarity metric employed, and (c) outlines the conditions under which similar regime effects are expected (or not) in feedforward networks, alternative similarity measures, and other continual-learning benchmarks. We will also include a brief limitations subsection flagging the need for future cross-architecture validation. revision: yes
Circularity Check
No circularity: purely empirical observations with no derived predictions or self-referential definitions
full rationale
The paper reports experimental results comparing modular and single-module recurrent networks under varying task similarity and weight initialization scales. Effective dimensionality is measured post-training as a characterization of learning regimes rather than used to derive or fit any target quantity. No equations, predictions, or central claims reduce by construction to fitted inputs, self-citations, or renamed known results; the graded subspace alignment findings are direct observations of representational geometry. The work is self-contained against external benchmarks because all quantities (dimensionality, alignment metrics) are computed from the trained networks without circular re-use of the same data for both fitting and prediction.
Axiom & Free-Parameter Ledger
free parameters (2)
- weight initialization scale
- task similarity levels
Reference graph
Works this paper leans on
-
[1]
Achterberg, J., Akarca, D., Strouse, D., Duncan, J., and Astle, D. E. (2023). Spatially embedded recurrent neural networks reveal widespread links between structural and functional neuro- science findings.Nature Machine Intelligence, 5(12):1369–
work page 2023
-
[2]
B´ena, G. and Goodman, D. F. (2025). Dynamics of specialization in neural modules under resource constraints.Nature Com- munications, 16(1):187. Clune, J., Mouret, J.-B., and Lipson, H. (2013). The evolutionary origins of modularity.Proceedings of the Royal Society b: Biological sciences, 280(1755):20122863. Ellefsen, K. O., Mouret, J.-B., and Clune, J. (...
-
[3]
Salatiello, A. (2026). Modularity is the bedrock of natural and artificial intelligence. Seguin, C., Sporns, O., Zalesky, A., Calamante, F., et al. (2022). Network communication models narrow the gap between the modular organization of structural and functional brain net- works.Neuroimage, 257:119323. Van de Ven, G. M., Siegelmann, H. T., and Tolias, A. S...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.