Randomly Initialized Networks Can Learn from Peer-to-Peer Consensus
Pith reviewed 2026-05-10 05:01 UTC · model grok-4.3
The pith
Randomly initialized networks develop useful representations by reaching consensus with each other, without projectors, predictors, or pretext tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training a group of randomly initialized networks to reach peer-to-peer consensus on their predictions while removing projectors, predictors, and pretext tasks, the models acquire representations that deliver non-trivial gains over a random baseline when tested on downstream tasks. The size of the improvement varies with hyperparameters such as the number of networks involved and the strength of the consensus step. A brief analysis of the resulting models shows the kinds of patterns they begin to capture under this minimal regime.
What carries the argument
Peer-to-peer consensus among randomly initialized networks, in which each network updates toward the averaged outputs of the others.
If this is right
- Self-distillation can generate usable representations even when all typical supporting components are removed.
- The amount of improvement scales with choices such as group size and consensus weight.
- The representations that emerge contain detectable structure that can be examined directly.
- Starting from random weights is sufficient for the consensus effect to appear.
Where Pith is reading between the lines
- If the minimal case works, many current self-supervised pipelines could be simplified by dropping non-essential modules while keeping most of the benefit.
- The same consensus dynamic could be tried on other data modalities or larger model families to test whether the pattern generalizes.
- This setup may connect to broader questions about how agreement among agents produces ordered internal states without external supervision.
Load-bearing premise
The measured gains come only from the consensus process itself and not from hidden side effects of the shared optimization or network interactions.
What would settle it
If downstream-task accuracy after the consensus phase shows no improvement over a purely random network on the same architecture and data, the claim that minimal consensus produces useful representations would not hold.
Figures
read the original abstract
In self-supervised learning, self-distilled methods have shown impressive performance, learning representations useful for downstream tasks and even displaying emergent properties. However, state-of-the-art methods usually rely on ensembles of complex mechanisms, with many design choices that are empirically motivated and not well understood. In this work, we explore the role of self-distillation within learning dynamics. Specifically, we isolate the effect of self-distillation by training a group of randomly initialized networks, removing all other common components such as projectors, predictors, and even pretext tasks. Our findings show that even this minimal setup can lead to learned representations with non-trivial improvements over a random baseline on downstream tasks. We also demonstrate how this effect varies with different hyperparameters and present a short analysis of what is being learned by the models under this setup.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that a minimal peer-to-peer consensus setup—training groups of randomly initialized networks with no projectors, predictors, or pretext tasks—produces learned representations that yield non-trivial downstream-task gains over random baselines. It further reports that these gains vary with hyperparameters and provides a short analysis of the features being learned.
Significance. If the isolation of self-distillation holds, the result would indicate that basic consensus dynamics alone suffice to drive useful representation learning, offering a simpler lens on why self-distilled SSL methods succeed and a stronger baseline for dissecting more elaborate architectures.
major comments (2)
- [§3] §3 (Experimental Setup): The central claim that downstream gains arise purely from peer-to-peer self-distillation requires an explicit control in which independent networks are trained under identical optimizer, data schedule, and batching conditions but receive no peer signals. Without this ablation, improvements could stem from implicit regularization or averaging effects of joint optimization rather than the intended distillation mechanism.
- [§4] §4 (Results): Reported non-trivial improvements are presented without error bars, multiple random seeds, or statistical significance tests. This makes it impossible to assess whether the gains are reliable or could be explained by variance in the random baseline.
minor comments (2)
- [Abstract / §2] The abstract and §2 would benefit from a precise mathematical statement of the consensus objective (e.g., the exact loss or averaging rule) rather than a high-level description.
- [§4.3] Figure captions and §4.3 should explicitly state the downstream datasets, evaluation protocol (linear probe vs. fine-tuning), and number of runs to allow direct replication.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our results. We respond to each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [§3] §3 (Experimental Setup): The central claim that downstream gains arise purely from peer-to-peer self-distillation requires an explicit control in which independent networks are trained under identical optimizer, data schedule, and batching conditions but receive no peer signals. Without this ablation, improvements could stem from implicit regularization or averaging effects of joint optimization rather than the intended distillation mechanism.
Authors: We agree that this control is required to isolate the contribution of peer signals. In the revised manuscript we will add an ablation in which multiple networks are trained completely independently (no peer-to-peer distillation) while sharing the identical optimizer, data schedule, and batching procedure as the consensus runs. The downstream performance of these independent networks will be reported alongside the peer-to-peer results to demonstrate that the observed gains are attributable to the consensus mechanism rather than joint optimization alone. revision: yes
-
Referee: [§4] §4 (Results): Reported non-trivial improvements are presented without error bars, multiple random seeds, or statistical significance tests. This makes it impossible to assess whether the gains are reliable or could be explained by variance in the random baseline.
Authors: We acknowledge the absence of statistical reporting. We will repeat all reported experiments across at least five independent random seeds, present mean performance with standard-deviation error bars, and include paired statistical significance tests (e.g., t-tests) against the random baseline to establish that the improvements are reliable and not explained by initialization variance. revision: yes
Circularity Check
No circularity: purely empirical isolation experiment
full rationale
The paper reports an experimental protocol that trains groups of randomly initialized networks under a peer-to-peer consensus objective while stripping away projectors, predictors, and pretext tasks. All reported results consist of measured downstream-task improvements, hyperparameter sweeps, and qualitative analysis of learned representations. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing steps. The central claim therefore rests on external empirical observation rather than any quantity that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Representation learning: A review and new perspectives,
Y . Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013
work page 2013
-
[2]
I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. The MIT Press, 2016
work page 2016
-
[3]
Bootstrap your own latent - a new approach to self-supervised learning,
J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, k. kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent - a new approach to self-supervised learning,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,...
work page 2020
-
[4]
Emerging properties in self-supervised vision transform- ers,
M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transform- ers,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9630–9640
work page 2021
-
[5]
Towards demystifying representation learning with non-contrastive self-supervision,
X. Wang, X. Chen, S. S. Du, and Y . Tian, “Towards demystifying representation learning with non-contrastive self-supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2110.04947
-
[6]
Learning multiple layers of features from tiny images,
A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” University of Toronto, Toronto, Ontario, Tech. Rep. 0, 2009. [Online]. Available: https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf
work page 2009
-
[7]
A simple framework for contrastive learning of visual representations,
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inProceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020
work page 2020
-
[8]
Barlow twins: Self-supervised learning via redundancy reduction,
J. Zbontar, L. Jing, I. Misra, Y . LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” inInternational conference on machine learning. PMLR, 2021, pp. 12 310–12 320
work page 2021
-
[9]
Exploring simple siamese representation learn- ing,
X. Chen and K. He, “Exploring simple siamese representation learn- ing,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 745–15 753
work page 2021
-
[10]
Simplifying dino via coding rate regularization, 2025
Z. Wu, J. Zhang, D. Pai, X. Wang, C. Singh, J. Yang, J. Gao, and Y . Ma, “Simplifying DINO via coding rate regularization,” 2025. [Online]. Available: https://arxiv.org/abs/2502.10385
-
[11]
Random teachers are good teachers,
F. Sarnthein, G. Bachmann, S. Anagnostidis, and T. Hofmann, “Random teachers are good teachers,” inInternational Conference on Machine Learning, 2023. [Online]. Available: https://arxiv.org/abs/2302.12091
-
[12]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778
work page 2016
-
[13]
Very deep convolutional neural network based image classification using small training sample size,
S. Liu and W. Deng, “Very deep convolutional neural network based image classification using small training sample size,” in2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 730–734
work page 2015
-
[14]
Densely connected convolutional networks,
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in2017 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.