Self-supervised local learning rules learn the hidden hierarchical structure of high-dimensional data
Pith reviewed 2026-05-20 12:54 UTC · model grok-4.3
The pith
Self-supervised local learning rules discover hidden hierarchical structure in high-dimensional data as efficiently as backpropagation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Algorithms of the second type that rely on layerwise self-supervised contrastive or non-contrastive loss functions learn the hierarchical hidden structure of the RHM tasks. They match the data efficiency of supervised backpropagation training and align with known rules of synaptic plasticity in cortex, whereas direct-feedback approximations of error propagation cannot implement the masking nonlinearities required for success.
What carries the argument
Layerwise self-supervised contrastive or non-contrastive loss functions that operate locally at each layer without approximating global output errors.
If this is right
- Local self-supervised rules can solve tasks that require discovering hidden hierarchical structure without a symmetric error network.
- These rules achieve the same sample efficiency as full backpropagation on the Random Hierarchy Model.
- The success depends on implementing input-specific nonlinear masking that direct feedback methods miss.
- The rules remain consistent with known forms of synaptic plasticity observed in cortex.
Where Pith is reading between the lines
- Similar layerwise self-supervised mechanisms may operate in sensory cortex to build hierarchical models from raw input.
- Artificial networks could adopt these rules to learn deep representations without relying on backpropagation across all layers.
- Experiments that apply the rules to real-world sensory data would test whether the Random Hierarchy Model captures essential structure.
- Neuroscience recordings could search for local contrastive-like signals that match the proposed loss functions.
Load-bearing premise
The Random Hierarchy Model serves as a faithful proxy for the intrinsic hierarchical structure present in real high-dimensional sensory data.
What would settle it
Measuring whether these self-supervised local rules extract hierarchical features from natural image or audio datasets at data efficiency comparable to backpropagation.
Figures
read the original abstract
The brain learns abstract representations of high-dimensional sensory input, but the plasticity rules that enable such learning are unknown. We study biologically plausible algorithms on the Random Hierarchy Model (RHM), an artificial dataset designed to investigate how deep neural networks learn the intrinsic hierarchical structure of high-dimensional data. We focus on two types of local learning rules that avoid both a long convergence time and the use of a symmetric error network. The first type uses direct feedback signals to approximate error propagation from the output layer. The second type uses layerwise self-supervised contrastive or non-contrastive loss functions that do not explicitly approximate errors at the output layer. We show that all rules of the first type fail to solve the tasks of the RHM and trace this failure back to input-specific nonlinearities (`masking') that are implemented in full backpropagation and are essential for learning complex tasks. However, algorithms of the second type are able to learn the hierarchical hidden structure of the RHM tasks and are as data-efficient as supervised backpropagation training, while being compatible with known rules of synaptic plasticity in cortex.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies biologically plausible local learning rules on the Random Hierarchy Model (RHM), an artificial hierarchical dataset. It distinguishes two classes: direct-feedback rules that approximate error propagation from the output, and layer-wise self-supervised contrastive or non-contrastive rules. The central finding is that the first class fails on RHM tasks because it cannot implement the input-specific masking nonlinearities required by full backpropagation, whereas the second class learns the hidden hierarchical structure, matches the data efficiency of supervised backpropagation, and remains compatible with cortical synaptic plasticity rules.
Significance. If the experimental claims are substantiated, the work supplies controlled evidence that self-supervised local rules can extract hierarchical structure from high-dimensional inputs without symmetric error pathways or prolonged convergence, offering a candidate mechanism for cortical representation learning. The RHM benchmark is a useful controlled testbed for isolating hierarchy-learning capacity.
major comments (2)
- [Abstract / Results] Abstract and Results: The abstract states clear experimental outcomes on the RHM but provides no details on simulation parameters, statistical controls, error bars, or exact loss formulations. This absence makes it impossible to verify whether the data support the claim that second-type algorithms are as data-efficient as supervised backpropagation.
- [Methods / Results] Methods / Experimental Setup: The failure of direct-feedback rules is traced to masking nonlinearities that are present in full backpropagation; however, the manuscript does not report a quantitative ablation isolating the contribution of these nonlinearities versus other architectural or optimization differences, which is load-bearing for the explanation of why only one class succeeds.
minor comments (2)
- [Abstract] The abstract would be clearer if it briefly defined the Random Hierarchy Model and the precise distinction between the two rule classes before stating the outcomes.
- [Figures] Figure captions and legends should explicitly state the number of independent runs, random seeds, and any statistical tests used to support efficiency comparisons.
Simulated Author's Rebuttal
We thank the referee for their constructive review and positive assessment of the significance of our work. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the supporting evidence.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: The abstract states clear experimental outcomes on the RHM but provides no details on simulation parameters, statistical controls, error bars, or exact loss formulations. This absence makes it impossible to verify whether the data support the claim that second-type algorithms are as data-efficient as supervised backpropagation.
Authors: We agree that the abstract should better support verification of the claims. In the revised manuscript we have expanded the abstract to include key simulation parameters (RHM depth and width, training set sizes), note that all performance curves report means and standard errors over 10 independent random seeds, and explicitly reference the loss formulations and statistical controls detailed in the Methods section. These additions preserve abstract length while enabling readers to assess the data-efficiency comparison directly. revision: yes
-
Referee: [Methods / Results] Methods / Experimental Setup: The failure of direct-feedback rules is traced to masking nonlinearities that are present in full backpropagation; however, the manuscript does not report a quantitative ablation isolating the contribution of these nonlinearities versus other architectural or optimization differences, which is load-bearing for the explanation of why only one class succeeds.
Authors: We concur that an explicit quantitative ablation would strengthen the causal claim. The original manuscript already demonstrates the necessity of masking nonlinearities through analytic comparison of the update rules and through the observed performance gap. In the revision we have added a new ablation subsection that systematically disables the input-specific masking terms in the backpropagation baseline while keeping all other architectural and optimization elements fixed; the resulting performance collapse isolates the contribution of these nonlinearities and confirms they are the primary reason direct-feedback rules fail on the RHM. revision: yes
Circularity Check
No significant circularity; results rest on empirical simulations against external benchmark
full rationale
The paper reports simulation experiments comparing two classes of local learning rules against full backpropagation on the Random Hierarchy Model (RHM). Claims of success for contrastive/non-contrastive self-supervised rules and failure for direct-feedback rules are supported by direct performance measurements and data-efficiency comparisons rather than any mathematical derivation. The RHM is introduced as an independent artificial dataset with explicit hierarchical labels; no equations reduce a prediction to a fitted parameter by construction, no self-citation chain is load-bearing for the central result, and no ansatz or uniqueness theorem is smuggled in to force the outcome. The work is therefore self-contained through external-benchmark validation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Random Hierarchy Model captures the essential hierarchical structure of high-dimensional data.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
algorithms of the second type are able to learn the hierarchical hidden structure of the RHM tasks and are as data-efficient as supervised backpropagation training, while being compatible with known rules of synaptic plasticity in cortex
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
input-specific nonlinearities ('masking') that are implemented in full backpropagation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
bioRxiv preprint, DOI 10.1101/2024.04.10.588837
Target learning rather than backpropagation explains learning in the mammalian neocortex. bioRxiv preprint, DOI 10.1101/2024.04.10.588837. Akrout, M., Wilson, C., Humphreys, P., Lillicrap, T., Tweed, D.B.,
-
[2]
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
Vicreg: variance-invariance-covariance re- gularization for self-supervised learning. ICLR , arXiv:2105.04906v3. Bi, G., Poo, M.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Long-term potentiation: enhancing neuroscience for 30 years - introduction. Phil. Trans. R. Soc. Lond B: Biological Sciences 358, 607–611. Cagnetta,F.,Favero,A.,Wyart,M.,2023. WhatCanBeLearntWithWideConvolutionalNeuralNetworks?,in:Proceedingsofthe40thInternational Conference on Machine Learning, PMLR. pp. 3347–3379. URL:https://proceedings.mlr.press/v202/...
work page 2023
-
[4]
On the duality between contrastive and non-contrastive self-supervised learning. Elenvth Intern. Conf. Learning Repr. DOI 10.48550/arXiv.2206.02574. Gerstner, W., Kistler, W.K.,
-
[5]
doi:10.3389/fncir.2018.00053. Goltz, J., Kriener, L., Baumbach, A., Billaudelle, S., Breitweiser, O., Cramer, B., Dold, D., Kungl, A., Senn, W., Schemmel, J., Meier, K., Petrovici, M.,
-
[6]
The combination of hebbian and predictive plasticity learns invariant object representations in deep sensory networks. Nat. Neurosci. 26, 1906–1915. Hebb, D.O.,
work page 1906
-
[7]
The forward-forward algorithm: Some preliminary investi- gations.ArXiv Preprint ArXiv:2212.13345
The forward-forward algorithm: Some preliminary investigations. URL: https://arxiv.org/abs/2212.13345, arXiv:2212.13345. Hubel, D., Wiesel, T.,
-
[8]
(Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc
Local plasticity rules can learn deep representations using self-supervised contrastive predictions, in: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 30365–30379. URL: https://proceedings.neurips.cc/paper_files/paper/2021/file/ feade1d2047977cd0c...
work page 2021
-
[9]
A canonical cortical electronic circuit for neuromorphic intelligence. bioRxiv DOI 2025.03.28.646019. Max, K., Kriener, L., Garcia, G., Nowotny, T., Jaras, I., Senn, W., Petrovici, M.,
work page 2025
-
[10]
ArXiv preprints ArXiv, 2106.07887
Credit assignment in neural networks through deep feedback control. ArXiv preprints ArXiv, 2106.07887. Meulemans, A., Farinha, M.T., Cervera, M.R., Sacramento, J., Grewe, B.F.,
-
[11]
URL http://proceedings.mlr.press/v49/eldan16.html
When and Why Are Deep Networks Better Than Shallow Ones? Proceedings of the AAAI Conference onArtificialIntelligence31. URL: https://ojs.aaai.org/index.php/AAAI/article/view/10913,doi: 10.1609/aaai.v31i1.10913. number:
-
[12]
Backpropagation-free training of deep physical neural networks. Science 383, 1297–1303. DOI 10.1126/science.adi8474. Mossel, E.,
-
[13]
Deep Learning and Hierarchal Generative Models
Deep Learning and Hierarchal Generative Models. URL:http://arxiv.org/abs/1612.09057, doi:10.48550/arXiv.1612. 09057. arXiv:1612.09057 [cs]. Nøkland, A.,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1612
-
[14]
(Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc
Direct feedback alignment provides learning in deep neural networks, in: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc. URL:https://proceedings.neurips.cc/paper_ files/paper/2016/file/d490d7b4576290fa60eb31b5fc917ad1-Paper.pdf. Oja, E.,
work page 2016
-
[15]
Representation Learning with Contrastive Predictive Coding
Representation learning with contrastive predictive coding. arXiv arXiv, 1807.03748. Patel, A., Nguyen, T., Baraniuk, R.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
A Probabilistic Theory of Deep Learning
A probabilistic theory of deep learning. arXiv preprints arXiv, 1504.00641. Pawlak, V., Wickens, J., Kirkwood, A., Kerr, J.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
A stable, fast, and fully automatic learning algorithm for predictive coding networks, in: Proceedings of the 12th International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May
work page 2024
-
[18]
Srinivasan,R.F.,Mignacco,F.,Sorbaro,M.,Refinetti,M.,Cooper,A.,Kreiman,G.,Dellaferrera,G.,2024
Slowness: An objective for spike-timing-plasticity? PLoS Computational Biology 3, e112. Srinivasan,R.F.,Mignacco,F.,Sorbaro,M.,Refinetti,M.,Cooper,A.,Kreiman,G.,Dellaferrera,G.,2024. Forwardlearningwithtop-downfeedback: Empirical and analytical characterization, in: ICLR. Urbanczik, R., Senn, W.,
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.