pith. sign in

arxiv: 2604.18390 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AI

Randomly Initialized Networks Can Learn from Peer-to-Peer Consensus

Pith reviewed 2026-05-10 05:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords self-distillationself-supervised learningrepresentation learningpeer-to-peer consensusrandom initializationdownstream tasksminimal setups
0
0 comments X

The pith

Randomly initialized networks develop useful representations by reaching consensus with each other, without projectors, predictors, or pretext tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates self-distillation by training multiple networks that start with random weights and align their outputs through mutual agreement. This stripped-down process produces representations that perform noticeably better than chance on later tasks. A sympathetic reader would care because the result points to consensus as a basic mechanism that can drive learning on its own, rather than requiring the elaborate machinery common in current self-supervised methods. The work also checks how the gains change with group size and other settings and inspects what structure appears in the learned features.

Core claim

By training a group of randomly initialized networks to reach peer-to-peer consensus on their predictions while removing projectors, predictors, and pretext tasks, the models acquire representations that deliver non-trivial gains over a random baseline when tested on downstream tasks. The size of the improvement varies with hyperparameters such as the number of networks involved and the strength of the consensus step. A brief analysis of the resulting models shows the kinds of patterns they begin to capture under this minimal regime.

What carries the argument

Peer-to-peer consensus among randomly initialized networks, in which each network updates toward the averaged outputs of the others.

If this is right

  • Self-distillation can generate usable representations even when all typical supporting components are removed.
  • The amount of improvement scales with choices such as group size and consensus weight.
  • The representations that emerge contain detectable structure that can be examined directly.
  • Starting from random weights is sufficient for the consensus effect to appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the minimal case works, many current self-supervised pipelines could be simplified by dropping non-essential modules while keeping most of the benefit.
  • The same consensus dynamic could be tried on other data modalities or larger model families to test whether the pattern generalizes.
  • This setup may connect to broader questions about how agreement among agents produces ordered internal states without external supervision.

Load-bearing premise

The measured gains come only from the consensus process itself and not from hidden side effects of the shared optimization or network interactions.

What would settle it

If downstream-task accuracy after the consensus phase shows no improvement over a purely random network on the same architecture and data, the claim that minimal consensus produces useful representations would not hold.

Figures

Figures reproduced from arXiv: 2604.18390 by Edgar Casasola-Murillo, Esteban Rodr\'iguez-Betancourt.

Figure 1
Figure 1. Figure 1: Schematic of DINOHerd setup with a single teacher and student chosen at random per batch. Listing 1. PyTorch pseudocode for DINOHerd 1 # N: Number of peers 2 # T: Number of teachers 3 peers = [Network() for _ in range(N)] 4 optimizers = [make_optimizer(p) for p in peers] 5 6 for x in dataloader: 7 # Sample student and teachers from pool 8 idxs = random.sample(range(len(peers)), T + 1) 9 student_idx = idxs[… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of linear probe accuracy of 8 peers on CIFAR-10. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Diagram of the neural network used for learning vision representations [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of accuracy varying the number of teachers. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of linear probe accuracy varying the number of peers. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the ensemble of a varying number of peer outputs. [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of KNN (K=5) accuracy varying the learning rate. [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of linear probe accuracy varying network architecture [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of linear probe accuracy varying the learning rate. [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of the cosine distances between the CIFAR-10 em [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
read the original abstract

In self-supervised learning, self-distilled methods have shown impressive performance, learning representations useful for downstream tasks and even displaying emergent properties. However, state-of-the-art methods usually rely on ensembles of complex mechanisms, with many design choices that are empirically motivated and not well understood. In this work, we explore the role of self-distillation within learning dynamics. Specifically, we isolate the effect of self-distillation by training a group of randomly initialized networks, removing all other common components such as projectors, predictors, and even pretext tasks. Our findings show that even this minimal setup can lead to learned representations with non-trivial improvements over a random baseline on downstream tasks. We also demonstrate how this effect varies with different hyperparameters and present a short analysis of what is being learned by the models under this setup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that a minimal peer-to-peer consensus setup—training groups of randomly initialized networks with no projectors, predictors, or pretext tasks—produces learned representations that yield non-trivial downstream-task gains over random baselines. It further reports that these gains vary with hyperparameters and provides a short analysis of the features being learned.

Significance. If the isolation of self-distillation holds, the result would indicate that basic consensus dynamics alone suffice to drive useful representation learning, offering a simpler lens on why self-distilled SSL methods succeed and a stronger baseline for dissecting more elaborate architectures.

major comments (2)
  1. [§3] §3 (Experimental Setup): The central claim that downstream gains arise purely from peer-to-peer self-distillation requires an explicit control in which independent networks are trained under identical optimizer, data schedule, and batching conditions but receive no peer signals. Without this ablation, improvements could stem from implicit regularization or averaging effects of joint optimization rather than the intended distillation mechanism.
  2. [§4] §4 (Results): Reported non-trivial improvements are presented without error bars, multiple random seeds, or statistical significance tests. This makes it impossible to assess whether the gains are reliable or could be explained by variance in the random baseline.
minor comments (2)
  1. [Abstract / §2] The abstract and §2 would benefit from a precise mathematical statement of the consensus objective (e.g., the exact loss or averaging rule) rather than a high-level description.
  2. [§4.3] Figure captions and §4.3 should explicitly state the downstream datasets, evaluation protocol (linear probe vs. fine-tuning), and number of runs to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results. We respond to each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [§3] §3 (Experimental Setup): The central claim that downstream gains arise purely from peer-to-peer self-distillation requires an explicit control in which independent networks are trained under identical optimizer, data schedule, and batching conditions but receive no peer signals. Without this ablation, improvements could stem from implicit regularization or averaging effects of joint optimization rather than the intended distillation mechanism.

    Authors: We agree that this control is required to isolate the contribution of peer signals. In the revised manuscript we will add an ablation in which multiple networks are trained completely independently (no peer-to-peer distillation) while sharing the identical optimizer, data schedule, and batching procedure as the consensus runs. The downstream performance of these independent networks will be reported alongside the peer-to-peer results to demonstrate that the observed gains are attributable to the consensus mechanism rather than joint optimization alone. revision: yes

  2. Referee: [§4] §4 (Results): Reported non-trivial improvements are presented without error bars, multiple random seeds, or statistical significance tests. This makes it impossible to assess whether the gains are reliable or could be explained by variance in the random baseline.

    Authors: We acknowledge the absence of statistical reporting. We will repeat all reported experiments across at least five independent random seeds, present mean performance with standard-deviation error bars, and include paired statistical significance tests (e.g., t-tests) against the random baseline to establish that the improvements are reliable and not explained by initialization variance. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical isolation experiment

full rationale

The paper reports an experimental protocol that trains groups of randomly initialized networks under a peer-to-peer consensus objective while stripping away projectors, predictors, and pretext tasks. All reported results consist of measured downstream-task improvements, hyperparameter sweeps, and qualitative analysis of learned representations. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing steps. The central claim therefore rests on external empirical observation rather than any quantity that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the empirical observation that consensus alone suffices.

pith-pipeline@v0.9.0 · 5433 in / 966 out tokens · 28103 ms · 2026-05-10T05:01:00.344540+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Representation learning: A review and new perspectives,

    Y . Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013

  2. [2]

    Goodfellow, Y

    I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. The MIT Press, 2016

  3. [3]

    Bootstrap your own latent - a new approach to self-supervised learning,

    J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, k. kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent - a new approach to self-supervised learning,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,...

  4. [4]

    Emerging properties in self-supervised vision transform- ers,

    M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transform- ers,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9630–9640

  5. [5]

    Towards demystifying representation learning with non-contrastive self-supervision,

    X. Wang, X. Chen, S. S. Du, and Y . Tian, “Towards demystifying representation learning with non-contrastive self-supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2110.04947

  6. [6]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” University of Toronto, Toronto, Ontario, Tech. Rep. 0, 2009. [Online]. Available: https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf

  7. [7]

    A simple framework for contrastive learning of visual representations,

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inProceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020

  8. [8]

    Barlow twins: Self-supervised learning via redundancy reduction,

    J. Zbontar, L. Jing, I. Misra, Y . LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” inInternational conference on machine learning. PMLR, 2021, pp. 12 310–12 320

  9. [9]

    Exploring simple siamese representation learn- ing,

    X. Chen and K. He, “Exploring simple siamese representation learn- ing,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 745–15 753

  10. [10]

    Simplifying dino via coding rate regularization, 2025

    Z. Wu, J. Zhang, D. Pai, X. Wang, C. Singh, J. Yang, J. Gao, and Y . Ma, “Simplifying DINO via coding rate regularization,” 2025. [Online]. Available: https://arxiv.org/abs/2502.10385

  11. [11]

    Random teachers are good teachers,

    F. Sarnthein, G. Bachmann, S. Anagnostidis, and T. Hofmann, “Random teachers are good teachers,” inInternational Conference on Machine Learning, 2023. [Online]. Available: https://arxiv.org/abs/2302.12091

  12. [12]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

  13. [13]

    Very deep convolutional neural network based image classification using small training sample size,

    S. Liu and W. Deng, “Very deep convolutional neural network based image classification using small training sample size,” in2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 730–734

  14. [14]

    Densely connected convolutional networks,

    G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in2017 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269