pith. sign in

arxiv: 2606.11123 · v1 · pith:GH6P74ROnew · submitted 2026-06-09 · 💻 cs.LG

Overcoming Rank Collapse in Feedback Alignment

Pith reviewed 2026-06-27 13:51 UTC · model grok-4.3

classification 💻 cs.LG
keywords feedback alignmentrank collapseerror signal dimensionalitybackpropagation alternativesorthogonal updatesactivity normalisationneural network scalingCIFAR benchmarks
0
0 comments X

The pith

Feedback alignment fails to scale because its error signals occupy lower-dimensional subspaces than backpropagation gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that feedback alignment produces error signals with substantially lower effective rank than those from backpropagation. This confinement to a smaller subspace restricts how much the network can adjust its parameters during training, especially in deeper models. The authors introduce two interventions that raise the dimensionality of the updates: the Muon optimizer, which orthogonalizes weight changes, and hidden activity normalisation, which encourages orthogonal activations. Both interventions produce measurable gains over standard feedback alignment on larger networks and datasets. The central finding frames low-dimensional gradient dynamics as the primary barrier to making feedback alignment work at scale.

Core claim

When a network is trained with fixed random feedback weights, the resulting error signal has considerably lower rank than the true gradient from backpropagation and is therefore confined to a lower-dimensional subspace; this limits exploration of the parameter space and prevents effective learning in deeper architectures. Mechanisms that orthogonalise weight updates or promote activation orthogonality increase the effective dimensionality of the updates and produce consistent accuracy improvements, including a nine-point gain on CIFAR100 with a ResNet-18.

What carries the argument

The effective rank of the error signal, which determines the dimensionality of the subspace available for parameter updates.

If this is right

  • Feedback alignment error signals are confined to lower-dimensional subspaces than backpropagation gradients.
  • Muon and hidden activity normalisation each raise the effective rank of the updates.
  • These changes produce consistent gains across larger architectures and benchmarks.
  • Accuracy on CIFAR100 with ResNet-18 improves by nine percentage points over plain feedback alignment.
  • Low-dimensional gradient dynamics constitute the central obstacle to scaling feedback alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Measuring effective rank during training could act as an early diagnostic for whether feedback alignment will succeed on a given architecture.
  • The same dimensionality constraint may affect other learning rules that avoid explicit backpropagation.
  • Combining rank-increasing methods with different random feedback initialisations could be tested to isolate their separate contributions.
  • The approach could be checked on recurrent or attention-based models to see whether rank collapse appears outside feed-forward convolutional networks.

Load-bearing premise

The reduced effective rank of the feedback error signal is the main reason feedback alignment fails to scale, rather than other factors such as random weight initialisation or hyperparameter choices.

What would settle it

Applying Muon or activity normalisation to deeper networks and finding no accuracy improvement, or measuring that the effective rank stays low despite these changes.

Figures

Figures reproduced from arXiv: 2606.11123 by Claudia Clopath, Gauthier Boeshertz, Razvan Pascanu.

Figure 1
Figure 1. Figure 1: FA suffers from low dimensional gradients. A Increasing the number of layers results in lower performance in FA networks. B-H Metrics for the 4 convolutional layer model. B The process of feedback alignment makes the FA network converge much later than BP networks. C Weight alignment measured by the cosine similarity of the forward and feedback weights. The upper layers align, but the bottom layers do not.… view at source ↗
Figure 2
Figure 2. Figure 2: Orthogonalising the updates results in better FA. A Adding layers on the accuracy of FA networks. The accuracy of models trained with standard optimisers decreases, while it increases with Muon. B-G Metrics for the 4 convolutional layer model. B Models trained with Muon converge earlier and with a smoother curve. C Models trained with Muon have higher weight alignment. D Models trained with Muon have highe… view at source ↗
Figure 3
Figure 3. Figure 3: Using the top-K directions as the op￾posite of Muon. Using low-rank SGD results in much worse performance with FA, whereas the effect is not so large with BP. Given that an optimiser that increases the di￾mensionality of the update improves the perfor￾mance of FA models. We investigated how an optimiser that does the opposite affects FA. To do so, we replaced the orthogonalisation func￾tion in Muon with th… view at source ↗
Figure 4
Figure 4. Figure 4: Optimising FA networks with BN results in better FA. A Effect of adding batch normalisation layers on the accuracy of FA networks. The accuracy of models trained without optimisers decreases, while the accuracy of models trained with BN does not. B-G Metrics for the 4 convolutional layer model. B Models with BN layers converge earlier and with a smoother curve. C Models with BN layers have higher weight al… view at source ↗
Figure 5
Figure 5. Figure 5: Adding a local loss to the FA gradients. A Adding layers on the accuracy of FA networks. The accuracy of models trained with standard optimisers decreases, while those trained with the local loss B-G Metrics for the 4 convolutional layer model. B Models trained with the local loss converge earlier and with a smoother curve. C Models trained with local loss have higher weight alignment. D Models trained wit… view at source ↗
Figure 6
Figure 6. Figure 6: A more balanced singular value spectrum improves performance. Increasing the exponent in Freon optimiser [Shumaylov et al., 2026], which, until 0.5, flattens the spectrum of the updates, then makes it skew the other way, increases the performance of the FA networks. 1 2 3 4 Number of layers 0.54 0.56 0.58 0.60 0.62 0.64 0.66 Accuracy Effect of adding noise to the updates in FA Optimizer SGD AdamW Noise Sca… view at source ↗
Figure 7
Figure 7. Figure 7: Adding noise to the updates does not make the FA models better. We added Gaussian noise to the parameter updates, matching their norm or 0.1 of their norm. SGD deteriorates with even small noise compared to updates, whereas AdamW is not so affected. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: FA has lower gradient dimensionality across layers and depths. Effective rank of the gradients for BP and FA networks with different depths. The reduction in gradient dimensionality is visible beyond the last convolutional layer and becomes stronger in deeper networks. 0 20000 40000 6 8 10 Depth 1 Conv Layer 1 0 20000 40000 6 8 10 Depth 2 Conv Layer 1 0 20000 40000 20 30 40 Conv Layer 2 0 20000 40000 6 8 1… view at source ↗
Figure 9
Figure 9. Figure 9: FA has lower-dimensional gradient trajectories across layers and depths. Effective rank of the Gram matrix of the gradient trajectory for BP and FA networks. Compared with BP, FA gradients span a smaller subspace during training, especially in deeper networks. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Muon increases the gradient dimensionality of FA networks across layers and depths. Effective rank of the gradients for FA networks trained with different optimisers. Orthogonalised updates lead to higher-dimensional gradients, particularly in deeper networks where standard FA suffers most. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Muon increases the dimensionality of the FA gradient trajectory. Effective rank of the Gram matrix of the gradient trajectory for FA networks trained with different optimisers. Muon prevents the trajectory from collapsing onto a small number of directions, resulting in richer update dynamics across layers. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: BN increases the gradient dimensionality of FA networks across layers and depths. Effective rank of the gradients for FA networks trained with and without BN. Normalising the activities helps prevent the gradients from becoming overly low-dimensional, especially in deeper models. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: BN increases the dimensionality of the FA gradient trajectory. Effective rank of the Gram matrix of the gradient trajectory for FA networks trained with and without BN. The higher trajectory dimensionality indicates that BN helps FA explore a richer set of update directions during training. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Backpropagation (BP) is widely viewed as biologically implausible, in part because it requires feedback weights to be the transpose of forward weights for error propagation. Interestingly, when training a network with fixed random feedback weights to circumvent this issue, learning aligns the forward weights with the feedback weights, leading the backpropagated error signal to become an approximation of the standard gradient used by BP. This process, called Feedback Alignment (FA), occurs in MLPs and very shallow CNNs but does not scale well to deeper architectures. In this work, we first investigated differences between BP and FA models, trained on CIFAR10, specifically focusing on the effective rank of the signal. We found that the FA error has a considerably lower rank and hence is constrained to a lower-dimensional subspace compared to BP, limiting exploration of the parameter space. Motivated by this observation, we evaluated two mechanisms for increasing the effective dimensionality of FA: Muon, an optimiser that orthogonalises weight updates; and hidden activity normalisation, which promotes activation orthogonality. Across larger architectures and benchmarks, we find that these methods consistently improve over FA baselines, for example, on CIFAR100 with a Resnet-18, accuracy increases by 9 percentage points. Our results identify low-dimensional gradient dynamics as a key obstacle to scaling FA and suggest that inducing higher-dimensional update geometry is a promising route toward scaling alternatives to backpropagation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that Feedback Alignment (FA) fails to scale to deeper networks because its error signal has lower effective rank than backpropagation (BP), constraining updates to a lower-dimensional subspace. It proposes Muon (orthogonalizing updates) and hidden-activity normalization to increase update dimensionality, reporting consistent accuracy gains over FA baselines (e.g., +9 pp on CIFAR-100 with ResNet-18).

Significance. If the mechanism holds, the work identifies low-dimensional dynamics as a concrete obstacle to scaling alternatives to backpropagation and supplies practical interventions that improve FA on standard benchmarks. The direct experimental comparisons on CIFAR-10/100 and ResNet architectures constitute a strength.

major comments (3)
  1. [§4] §4 (CIFAR-100 ResNet-18 results): the 9 pp accuracy gain is presented without reported standard deviations, number of independent runs, or statistical tests, so the magnitude and reliability of the improvement cannot be assessed.
  2. [Motivation and experimental sections] Motivation and experimental sections: no ablation applies Muon or normalization to standard BP (or to FA while holding rank fixed) to test whether gains are specific to FA's rank deficiency or arise from generic conditioning/optimization effects; without this, the causal attribution of performance to rank remains untested.
  3. [Results] Results: the paper does not report measurements of effective rank after applying the proposed methods, nor any correlation between rank increase and accuracy delta across runs or architectures.
minor comments (1)
  1. The rank analysis on CIFAR-10 is described only briefly; expanding the description of how effective rank is computed (e.g., singular-value threshold) would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of statistical reporting, experimental controls, and mechanistic validation. We address each point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [§4] §4 (CIFAR-100 ResNet-18 results): the 9 pp accuracy gain is presented without reported standard deviations, number of independent runs, or statistical tests, so the magnitude and reliability of the improvement cannot be assessed.

    Authors: We agree that standard deviations, the number of independent runs, and statistical tests are necessary to assess reliability. In the revised manuscript we will report all CIFAR-100 ResNet-18 results as means over at least five independent runs together with standard deviations and appropriate statistical comparisons. revision: yes

  2. Referee: [Motivation and experimental sections] Motivation and experimental sections: no ablation applies Muon or normalization to standard BP (or to FA while holding rank fixed) to test whether gains are specific to FA's rank deficiency or arise from generic conditioning/optimization effects; without this, the causal attribution of performance to rank remains untested.

    Authors: The manuscript's primary contribution is the identification of rank collapse as a scaling obstacle for FA and the demonstration that the proposed interventions improve FA. We acknowledge that ablations on BP would help isolate whether the benefits are FA-specific. We will add these ablations (Muon and normalization applied to BP) in the revised experimental section. revision: yes

  3. Referee: [Results] Results: the paper does not report measurements of effective rank after applying the proposed methods, nor any correlation between rank increase and accuracy delta across runs or architectures.

    Authors: We will add direct measurements of effective rank of the error signals after Muon and activity normalization. We will also include correlations between observed rank increases and accuracy deltas across the reported architectures and runs to provide quantitative support for the mechanistic claim. revision: yes

Circularity Check

0 steps flagged

Empirical study; no derivation chain or self-referential reductions

full rationale

The paper reports experimental observations of lower effective rank in FA error signals compared to BP on CIFAR-10, then tests two interventions (Muon optimizer and hidden-activity normalization) on larger models and benchmarks, reporting accuracy gains. No equations, first-principles derivations, or predictions are presented that reduce to fitted inputs, self-definitions, or self-citation chains. All central claims rest on direct benchmark comparisons rather than any closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is empirical and does not introduce new mathematical axioms, free parameters, or invented entities beyond standard neural network training practices and existing optimizers.

pith-pipeline@v0.9.1-grok · 5778 in / 1050 out tokens · 23601 ms · 2026-06-27T13:51:17.093532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 1 canonical work pages

  1. [1]

    Vicreg: Variance-invariance-covariance regularization for self-supervised learning.arXiv preprint arXiv:2105.04906,

    Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning.arXiv preprint arXiv:2105.04906,

  2. [2]

    Jeremy Bernstein and Laker Newhouse

    URL https://jeremybernste.in/writing/ deriving-muon. Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325,

  3. [3]

    Reducing overfit- ting in deep networks by decorrelating representations.arXiv preprint arXiv:1511.06068,

    Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and Dhruv Batra. Reducing overfit- ting in deep networks by decorrelating representations.arXiv preprint arXiv:1511.06068,

  4. [4]

    Training large neural networks with low-dimensional error feedback.arXiv preprint arXiv:2502.20580,

    Maher Hanut and Jonathan Kadmon. Training large neural networks with low-dimensional error feedback.arXiv preprint arXiv:2502.20580,

  5. [5]

    The low-rank simplicity bias in deep networks.arXiv preprint arXiv:2103.10427,

    Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit Agrawal, and Phillip Isola. The low-rank simplicity bias in deep networks.arXiv preprint arXiv:2103.10427,

  6. [6]

    2007 Matplotlib: A 2D Graphics Environment

    doi: 10.1109/MCSE.2007.55. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr,

  7. [7]

    Principled training of neural networks with direct feedback alignment.arXiv preprint arXiv:1906.04554,

    Julien Launay, Iacopo Poli, and Florent Krzakala. Principled training of neural networks with direct feedback alignment.arXiv preprint arXiv:1906.04554,

  8. [8]

    Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

  9. [9]

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  10. [10]

    Feedback alignment in deep convolu- tional networks.arXiv preprint arXiv:1812.06488,

    Theodore H Moskovitz, Ashok Litwin-Kumar, and LF Abbott. Feedback alignment in deep convolu- tional networks.arXiv preprint arXiv:1812.06488,

  11. [11]

    Optimizers qualitatively alter solutions and we should leverage this.arXiv preprint arXiv:2507.12224,

    Razvan Pascanu, Clare Lyle, Ionut-Vlad Modoranu, Naima Elosegui Borras, Dan Alistarh, Petar Velickovic, Sarath Chandar, Soham De, and James Martens. Optimizers qualitatively alter solutions and we should leverage this.arXiv preprint arXiv:2507.12224,

  12. [12]

    Muon is not that special: Random or inverted spectra work just as well.arXiv preprint arXiv:2605.11181,

    Zakhar Shumaylov, Natha¨el Da Costa, Peter Zaika, B´alint Mucs´anyi, Alex Massucco, Yoav Gelberg, Carola-Bibiane Sch¨onlieb, Yarin Gal, and Philipp Hennig. Muon is not that special: Random or inverted spectra work just as well.arXiv preprint arXiv:2605.11181,

  13. [13]

    Deep learning generalizes because the parameter-function map is biased towards simple functions.arXiv preprint arXiv:1805.08522,

    Guillermo Valle-Perez, Chico Q Camargo, and Ard A Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions.arXiv preprint arXiv:1805.08522,

  14. [14]

    Biologically-plausible learning algorithms can scale to large datasets.arXiv preprint arXiv:1811.03567,

    Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio. Biologically-plausible learning algorithms can scale to large datasets.arXiv preprint arXiv:1811.03567,

  15. [15]

    12 A Reproducibility For sections 3 and 4, we trained all models with 2 random seeds on the CIFAR10 [Krizhevsky, 2009] dataset

    URLhttps://github.com/facebookresearch/hydra. 12 A Reproducibility For sections 3 and 4, we trained all models with 2 random seeds on the CIFAR10 [Krizhevsky, 2009] dataset. The results were stable enough not to require using more seeds. We used the cross-entropy loss, mini-batch size 64 for 50,000 steps, the configured optimiser (SGD, AdamW, Muon, or low...

  16. [16]

    When applied to FA models, performance improves as the Freon exponent is increased from the SGD regime toward the Muon regime (Figure 6), supporting the claim that making the update higher-dimensional improves FA. 14 1 2 3 4 Number of layers 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700Accuracy Interpolating between SGD and Muon in FA Exponent 0 (SGD) 0...