pith. sign in

arxiv: 2605.18557 · v1 · pith:ZQN7KISTnew · submitted 2026-05-18 · 💻 cs.LG · cs.NE· q-bio.NC

Self-supervised local learning rules learn the hidden hierarchical structure of high-dimensional data

Pith reviewed 2026-05-20 12:54 UTC · model grok-4.3

classification 💻 cs.LG cs.NEq-bio.NC
keywords self-supervised learninglocal learning ruleshierarchical structureRandom Hierarchy Modelbiological plausibilitydeep neural networkssynaptic plasticity
0
0 comments X

The pith

Self-supervised local learning rules discover hidden hierarchical structure in high-dimensional data as efficiently as backpropagation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests biologically plausible local learning rules on the Random Hierarchy Model, an artificial dataset built to capture the layered structure of real sensory input. Direct feedback rules that try to approximate error signals from the output layer fail because they cannot produce the input-specific nonlinear masking needed for complex tasks. In contrast, layerwise self-supervised contrastive and non-contrastive losses succeed in extracting the hidden hierarchy. These second-type rules reach the same data efficiency as supervised backpropagation while remaining compatible with known cortical plasticity mechanisms.

Core claim

Algorithms of the second type that rely on layerwise self-supervised contrastive or non-contrastive loss functions learn the hierarchical hidden structure of the RHM tasks. They match the data efficiency of supervised backpropagation training and align with known rules of synaptic plasticity in cortex, whereas direct-feedback approximations of error propagation cannot implement the masking nonlinearities required for success.

What carries the argument

Layerwise self-supervised contrastive or non-contrastive loss functions that operate locally at each layer without approximating global output errors.

If this is right

  • Local self-supervised rules can solve tasks that require discovering hidden hierarchical structure without a symmetric error network.
  • These rules achieve the same sample efficiency as full backpropagation on the Random Hierarchy Model.
  • The success depends on implementing input-specific nonlinear masking that direct feedback methods miss.
  • The rules remain consistent with known forms of synaptic plasticity observed in cortex.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar layerwise self-supervised mechanisms may operate in sensory cortex to build hierarchical models from raw input.
  • Artificial networks could adopt these rules to learn deep representations without relying on backpropagation across all layers.
  • Experiments that apply the rules to real-world sensory data would test whether the Random Hierarchy Model captures essential structure.
  • Neuroscience recordings could search for local contrastive-like signals that match the proposed loss functions.

Load-bearing premise

The Random Hierarchy Model serves as a faithful proxy for the intrinsic hierarchical structure present in real high-dimensional sensory data.

What would settle it

Measuring whether these self-supervised local rules extract hierarchical features from natural image or audio datasets at data efficiency comparable to backpropagation.

Figures

Figures reproduced from arXiv: 2605.18557 by Ariane Delrocq, Guillaume Bellec, Wulfram Gerstner, Wu S. Zihan.

Figure 1
Figure 1. Figure 1: The Random Hierarchy Model and local learning algorithms. A. The RHM generates a family of hierarchical datasets. In a dataset, 𝑛𝑐 different top-level objects A,B,C, ... are recursively encoded into strings composed of lower-level features. At each encoding step, each object (at the top level) or feature (at intermediate levels) is encoded by 𝑚 different synonyms; each synonym is a pair of 2 lower-level fe… view at source ↗
Figure 2
Figure 2. Figure 2: Applying self-supervised algorithms to the Random Hierarchy Model. A. Pairs of encodings of the same object (blue and green arrows) or different objects (blue and dashed magenta arrows) are used as input to the network at consecutive times 𝑡 and 𝑡 + Δ𝑡. A local self-supervised loss is computed at each layer and then used to update weights of the same layer. B, C and D. Self-supervised algorithms compute th… view at source ↗
Figure 3
Figure 3. Figure 3: Input-specific masking is critical in BP-approximations. (A) Illustration of different maskings in a ReLU network. In BP or FA, the inactive feedforward neurons (which have 𝜌 ′ = 0) mask backward error flow in the corresponding feedback network. The error signal from these neurons becomes 0 (‘masked error signal’). (B) Epoch-averaged training loss and (C) accuracy on the test dataset as the network is trai… view at source ↗
Figure 4
Figure 4. Figure 4: Local self-supervised algorithms solve RHM task data-efficiently A. Schematic of forward and backward flow with the local self-supervised algorithm. B. Test error of CLAPP learning rule as a function of training data size 𝑃 in different RHM configurations: 𝑛𝑐 = 𝑚 = 𝑣 = 6, 8, 10 and 𝐿 = 2, 3 (represented by different marker shapes and linestyles). C. The minimum dataset size 𝑃 ∗ required to learn and genera… view at source ↗
Figure 5
Figure 5. Figure 5: CLAPP extracts the hierarchical structure of the RHM. A. Data from an RHM of depth 𝐿 = 5 (with 𝑚 = 𝑣 = 3 equivalent codes per feature, blue) or 𝐿 = 4, (𝑚 = 6, red) or 𝐿 = 3 (𝑚 = 8, brown) is decoded by CLAPP in networks of 5, 4 or 3 layers, respectively. Test accuracies measured by linear classification on the output of networks of different depths. Asterisks: linear classification applied to the raw input… view at source ↗
Figure 6
Figure 6. Figure 6: ICA needs both many data samples and many neurons to solve the RHM task. Test error as a function of the number of samples in the training set, for different CNN sizes (with constant number of features (pale colors and dot markers), or with increasing number of features per layer, dark colors and triangular markers). As a reference, CLAPP performance is shown in pink. The RHM task for 𝐿 = 3 and 𝑚 = 𝑣 = 𝑛𝑐 … view at source ↗
Figure 7
Figure 7. Figure 7: Other biologically plausible learning with long-range feedback also fail. The setup is the same as figure 3, except that we use mean squared error as loss and remove bias in the network. Linear gradient approximation. The linear gradient approximation is defined such that we determine the exact full-batch gradient direction of BP ∑ 𝜇 𝜕 𝜕ℎ𝜇,𝑙 at layer 𝑙 and then choose backward weights such that they minim… view at source ↗
Figure 8
Figure 8. Figure 8: RHM with four levels of hierarchy, otherwise as in [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Self-supervised learning algorithms solve RHM tasks with end-to-end training. A. Schematic of forward and backward flow with the BP algorithm. The self-supervised loss function is applied at the output layer. B. The minimum dataset size 𝑃 ∗ required to learn and generalize to the whole dataset (linear decoding error ≤ 10% of random error), as a function of 𝐷∗ = 𝑛𝑐𝑣 𝐿, for the RHMs with parameters 𝐿 = 2, 3 … view at source ↗
Figure 10
Figure 10. Figure 10: CLAPP solves the RHM in high data efficiency also in the absence of negative self-supervision. As in [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Synonymic sensitivity of CLAPP representations. A Example of RHM for 𝐿 = 3, 𝑚 = 3, 𝑣 = 4. For the reference codeword in pink, we show examples of codewords with synonyms of level 𝑙 (𝑙 = 1, blue; 𝑙 = 2, orange; 𝑙 = 3, green) exchanged. Any encodings of another object B are not synonyms of the reference codeword for any level, because they do not share an encoding at any level. B Synonymic sensitivity (as d… view at source ↗
read the original abstract

The brain learns abstract representations of high-dimensional sensory input, but the plasticity rules that enable such learning are unknown. We study biologically plausible algorithms on the Random Hierarchy Model (RHM), an artificial dataset designed to investigate how deep neural networks learn the intrinsic hierarchical structure of high-dimensional data. We focus on two types of local learning rules that avoid both a long convergence time and the use of a symmetric error network. The first type uses direct feedback signals to approximate error propagation from the output layer. The second type uses layerwise self-supervised contrastive or non-contrastive loss functions that do not explicitly approximate errors at the output layer. We show that all rules of the first type fail to solve the tasks of the RHM and trace this failure back to input-specific nonlinearities (`masking') that are implemented in full backpropagation and are essential for learning complex tasks. However, algorithms of the second type are able to learn the hierarchical hidden structure of the RHM tasks and are as data-efficient as supervised backpropagation training, while being compatible with known rules of synaptic plasticity in cortex.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies biologically plausible local learning rules on the Random Hierarchy Model (RHM), an artificial hierarchical dataset. It distinguishes two classes: direct-feedback rules that approximate error propagation from the output, and layer-wise self-supervised contrastive or non-contrastive rules. The central finding is that the first class fails on RHM tasks because it cannot implement the input-specific masking nonlinearities required by full backpropagation, whereas the second class learns the hidden hierarchical structure, matches the data efficiency of supervised backpropagation, and remains compatible with cortical synaptic plasticity rules.

Significance. If the experimental claims are substantiated, the work supplies controlled evidence that self-supervised local rules can extract hierarchical structure from high-dimensional inputs without symmetric error pathways or prolonged convergence, offering a candidate mechanism for cortical representation learning. The RHM benchmark is a useful controlled testbed for isolating hierarchy-learning capacity.

major comments (2)
  1. [Abstract / Results] Abstract and Results: The abstract states clear experimental outcomes on the RHM but provides no details on simulation parameters, statistical controls, error bars, or exact loss formulations. This absence makes it impossible to verify whether the data support the claim that second-type algorithms are as data-efficient as supervised backpropagation.
  2. [Methods / Results] Methods / Experimental Setup: The failure of direct-feedback rules is traced to masking nonlinearities that are present in full backpropagation; however, the manuscript does not report a quantitative ablation isolating the contribution of these nonlinearities versus other architectural or optimization differences, which is load-bearing for the explanation of why only one class succeeds.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it briefly defined the Random Hierarchy Model and the precise distinction between the two rule classes before stating the outcomes.
  2. [Figures] Figure captions and legends should explicitly state the number of independent runs, random seeds, and any statistical tests used to support efficiency comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and positive assessment of the significance of our work. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the supporting evidence.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results: The abstract states clear experimental outcomes on the RHM but provides no details on simulation parameters, statistical controls, error bars, or exact loss formulations. This absence makes it impossible to verify whether the data support the claim that second-type algorithms are as data-efficient as supervised backpropagation.

    Authors: We agree that the abstract should better support verification of the claims. In the revised manuscript we have expanded the abstract to include key simulation parameters (RHM depth and width, training set sizes), note that all performance curves report means and standard errors over 10 independent random seeds, and explicitly reference the loss formulations and statistical controls detailed in the Methods section. These additions preserve abstract length while enabling readers to assess the data-efficiency comparison directly. revision: yes

  2. Referee: [Methods / Results] Methods / Experimental Setup: The failure of direct-feedback rules is traced to masking nonlinearities that are present in full backpropagation; however, the manuscript does not report a quantitative ablation isolating the contribution of these nonlinearities versus other architectural or optimization differences, which is load-bearing for the explanation of why only one class succeeds.

    Authors: We concur that an explicit quantitative ablation would strengthen the causal claim. The original manuscript already demonstrates the necessity of masking nonlinearities through analytic comparison of the update rules and through the observed performance gap. In the revision we have added a new ablation subsection that systematically disables the input-specific masking terms in the backpropagation baseline while keeping all other architectural and optimization elements fixed; the resulting performance collapse isolates the contribution of these nonlinearities and confirms they are the primary reason direct-feedback rules fail on the RHM. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on empirical simulations against external benchmark

full rationale

The paper reports simulation experiments comparing two classes of local learning rules against full backpropagation on the Random Hierarchy Model (RHM). Claims of success for contrastive/non-contrastive self-supervised rules and failure for direct-feedback rules are supported by direct performance measurements and data-efficiency comparisons rather than any mathematical derivation. The RHM is introduced as an independent artificial dataset with explicit hierarchical labels; no equations reduce a prediction to a fitted parameter by construction, no self-citation chain is load-bearing for the central result, and no ansatz or uniqueness theorem is smuggled in to force the outcome. The work is therefore self-contained through external-benchmark validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on treating the Random Hierarchy Model as a valid stand-in for real sensory hierarchies and on the assumption that the tested local rules are biologically plausible.

axioms (1)
  • domain assumption The Random Hierarchy Model captures the essential hierarchical structure of high-dimensional data.
    The entire experimental program uses RHM tasks to test whether rules can discover hidden hierarchy.

pith-pipeline@v0.9.0 · 5737 in / 1212 out tokens · 43077 ms · 2026-05-20T12:54:52.221944+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

  1. [1]

    bioRxiv preprint, DOI 10.1101/2024.04.10.588837

    Target learning rather than backpropagation explains learning in the mammalian neocortex. bioRxiv preprint, DOI 10.1101/2024.04.10.588837. Akrout, M., Wilson, C., Humphreys, P., Lillicrap, T., Tweed, D.B.,

  2. [2]

    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

    Vicreg: variance-invariance-covariance re- gularization for self-supervised learning. ICLR , arXiv:2105.04906v3. Bi, G., Poo, M.,

  3. [3]

    Long-term potentiation: enhancing neuroscience for 30 years - introduction. Phil. Trans. R. Soc. Lond B: Biological Sciences 358, 607–611. Cagnetta,F.,Favero,A.,Wyart,M.,2023. WhatCanBeLearntWithWideConvolutionalNeuralNetworks?,in:Proceedingsofthe40thInternational Conference on Machine Learning, PMLR. pp. 3347–3379. URL:https://proceedings.mlr.press/v202/...

  4. [4]

    Elenvth Intern

    On the duality between contrastive and non-contrastive self-supervised learning. Elenvth Intern. Conf. Learning Repr. DOI 10.48550/arXiv.2206.02574. Gerstner, W., Kistler, W.K.,

  5. [5]

    Goltz, J., Kriener, L., Baumbach, A., Billaudelle, S., Breitweiser, O., Cramer, B., Dold, D., Kungl, A., Senn, W., Schemmel, J., Meier, K., Petrovici, M.,

    doi:10.3389/fncir.2018.00053. Goltz, J., Kriener, L., Baumbach, A., Billaudelle, S., Breitweiser, O., Cramer, B., Dold, D., Kungl, A., Senn, W., Schemmel, J., Meier, K., Petrovici, M.,

  6. [6]

    The combination of hebbian and predictive plasticity learns invariant object representations in deep sensory networks. Nat. Neurosci. 26, 1906–1915. Hebb, D.O.,

  7. [7]

    The forward-forward algorithm: Some preliminary investi- gations.ArXiv Preprint ArXiv:2212.13345

    The forward-forward algorithm: Some preliminary investigations. URL: https://arxiv.org/abs/2212.13345, arXiv:2212.13345. Hubel, D., Wiesel, T.,

  8. [8]

    (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc

    Local plasticity rules can learn deep representations using self-supervised contrastive predictions, in: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 30365–30379. URL: https://proceedings.neurips.cc/paper_files/paper/2021/file/ feade1d2047977cd0c...

  9. [9]

    bioRxiv DOI 2025.03.28.646019

    A canonical cortical electronic circuit for neuromorphic intelligence. bioRxiv DOI 2025.03.28.646019. Max, K., Kriener, L., Garcia, G., Nowotny, T., Jaras, I., Senn, W., Petrovici, M.,

  10. [10]

    ArXiv preprints ArXiv, 2106.07887

    Credit assignment in neural networks through deep feedback control. ArXiv preprints ArXiv, 2106.07887. Meulemans, A., Farinha, M.T., Cervera, M.R., Sacramento, J., Grewe, B.F.,

  11. [11]

    URL http://proceedings.mlr.press/v49/eldan16.html

    When and Why Are Deep Networks Better Than Shallow Ones? Proceedings of the AAAI Conference onArtificialIntelligence31. URL: https://ojs.aaai.org/index.php/AAAI/article/view/10913,doi: 10.1609/aaai.v31i1.10913. number:

  12. [12]

    Science 383, 1297–1303

    Backpropagation-free training of deep physical neural networks. Science 383, 1297–1303. DOI 10.1126/science.adi8474. Mossel, E.,

  13. [13]

    Deep Learning and Hierarchal Generative Models

    Deep Learning and Hierarchal Generative Models. URL:http://arxiv.org/abs/1612.09057, doi:10.48550/arXiv.1612. 09057. arXiv:1612.09057 [cs]. Nøkland, A.,

  14. [14]

    (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc

    Direct feedback alignment provides learning in deep neural networks, in: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc. URL:https://proceedings.neurips.cc/paper_ files/paper/2016/file/d490d7b4576290fa60eb31b5fc917ad1-Paper.pdf. Oja, E.,

  15. [15]

    Representation Learning with Contrastive Predictive Coding

    Representation learning with contrastive predictive coding. arXiv arXiv, 1807.03748. Patel, A., Nguyen, T., Baraniuk, R.,

  16. [16]

    A Probabilistic Theory of Deep Learning

    A probabilistic theory of deep learning. arXiv preprints arXiv, 1504.00641. Pawlak, V., Wickens, J., Kirkwood, A., Kerr, J.,

  17. [17]

    A stable, fast, and fully automatic learning algorithm for predictive coding networks, in: Proceedings of the 12th International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May

  18. [18]

    Srinivasan,R.F.,Mignacco,F.,Sorbaro,M.,Refinetti,M.,Cooper,A.,Kreiman,G.,Dellaferrera,G.,2024

    Slowness: An objective for spike-timing-plasticity? PLoS Computational Biology 3, e112. Srinivasan,R.F.,Mignacco,F.,Sorbaro,M.,Refinetti,M.,Cooper,A.,Kreiman,G.,Dellaferrera,G.,2024. Forwardlearningwithtop-downfeedback: Empirical and analytical characterization, in: ICLR. Urbanczik, R., Senn, W.,