Learning Through Noise: Why Subliminal Learning Works and When It Fails
Pith reviewed 2026-05-25 05:12 UTC · model grok-4.3
The pith
Subliminal learning from noise occurs when output heads are compatible, not when initializations match.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Subliminal learning is governed by compatible output heads. Splitting outputs into an auxiliary head for task-unrelated noise and a class head for classification allows transfer of a recoverable teacher signal even with random hidden-layer initializations or architectural changes. When class heads remain compatible as well, students trained solely on noise inputs can approach and sometimes match teacher performance on the original task.
What carries the argument
Compatible output heads (auxiliary head for noise signals plus class head for classification) that keep the teacher signal recoverable in the student.
If this is right
- Subliminal learning persists without shared or matched initializations between teacher and student.
- Students reach near teacher accuracy on the task when both auxiliary and class heads stay compatible.
- Upper bounds on failure can be derived from the head-compatibility condition alone.
- Architecture modifications such as layer removal, addition, or MLP-to-CNN switches do not block transfer if heads remain compatible.
Where Pith is reading between the lines
- Design choices that enforce head compatibility could be used to control unintended bias transfer in distillation pipelines.
- The same compatibility principle might explain limits on knowledge transfer in other settings where inputs are replaced by noise or synthetic data.
- Testing the bounds on larger image or language models would show whether head compatibility remains the dominant constraint outside the MNIST regime.
Load-bearing premise
The controlled MNIST setup with explicitly split auxiliary and class heads isolates head compatibility as the decisive factor and supports general upper bounds independent of task or data.
What would settle it
Finding reliable subliminal transfer when the auxiliary or class heads are made incompatible, or finding no transfer when the heads are kept compatible across the tested architecture changes, would falsify the claim.
Figures
read the original abstract
In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input$\unicode{x2013}$output pairs. Prior explanations tie this effect to shared or closely matched teacher$\unicode{x2013}$student initialization. We show that a closely matched initialization is not necessary. Instead, subliminal learning is governed by compatible output heads. Using a controlled MNIST setting, we split outputs into an auxiliary head (for auxiliary, task-unrelated noise signals) and a class head (for classification) to demonstrate subliminal learning occurs$\unicode{x2014}$even when we randomly initialize hidden layers and remove layers, add new layers, or change the architecture (MLP-to-CNN). Compatible auxiliary heads enable transfer of a recoverable teacher signal, bringing the student's representations closer to the teacher's. When the class heads remain compatible as well, students trained only on task-unrelated noise can approach, and in favorable regimes match, teacher-level task performance. Our setting enables us to develop a theory that explains the mechanism of subliminal learning and to derive upper bounds on when subliminal learning fails. Together, our results turn subliminal learning from a surprising transfer effect into a theoretically grounded mechanism with predictable limits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that subliminal learning—the transfer of task-relevant knowledge via distillation on task-unrelated input-output pairs—is governed by compatible output heads rather than shared or matched initializations. In a controlled MNIST setup, outputs are partitioned into an auxiliary head (for task-unrelated noise) and a class head; experiments show transfer occurs even after random initialization of hidden layers, layer removal/addition, or MLP-to-CNN architecture changes. Compatible auxiliary heads bring student representations closer to the teacher's, and when class heads are also compatible, students can approach or match teacher task performance. The setting is used to derive a theory of the mechanism and upper bounds on failure conditions.
Significance. If the central claim and bounds hold beyond the specific construction, the work would convert subliminal learning from an empirical curiosity into a mechanistically understood process with testable limits, with potential implications for distillation, knowledge transfer, and bias propagation in neural networks. The controlled isolation of head compatibility and the attempt to derive bounds are positive features.
major comments (2)
- [theory / upper-bounds section] Theory/upper-bounds derivation (referenced in abstract as enabling 'upper bounds on when subliminal learning fails'): the bounds are obtained inside the MNIST split-head construction, where auxiliary noise signals are defined to be task-unrelated and the heads are explicitly factored. The derivation therefore relies on properties (e.g., orthogonality between auxiliary signals and class logits) that are introduced by the architectural partition itself; it is not shown that the same bounds remain valid for standard distillation pipelines that lack this explicit factorization, undermining the claim that the bounds are task- and distribution-independent.
- [experimental results on architecture transfer] Experimental claims (§ on architecture changes and performance matching): the demonstration that students match teacher performance when class heads remain compatible is shown only inside the auxiliary/class-head split. Because the split is an additional modeling assumption not present in conventional distillation, the results do not yet establish that head compatibility (rather than the split itself) is the governing factor in general settings.
minor comments (2)
- Clarify the precise mathematical definition of 'head compatibility' (e.g., whether it is measured by cosine similarity of weight matrices, logit correlation, or another metric) and state it before the experimental sections.
- The abstract states that 'closely matched initialization is not necessary'; the manuscript should explicitly contrast the random-initialization regime against a matched-initialization baseline in the same figure or table to make the comparison quantitative.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, clarifying the intended scope of our controlled construction while agreeing that explicit statements about its limitations are needed.
read point-by-point responses
-
Referee: [theory / upper-bounds section] Theory/upper-bounds derivation (referenced in abstract as enabling 'upper bounds on when subliminal learning fails'): the bounds are obtained inside the MNIST split-head construction, where auxiliary noise signals are defined to be task-unrelated and the heads are explicitly factored. The derivation therefore relies on properties (e.g., orthogonality between auxiliary signals and class logits) that are introduced by the architectural partition itself; it is not shown that the same bounds remain valid for standard distillation pipelines that lack this explicit factorization, undermining the claim that the bounds are task- and distribution-independent.
Authors: We agree that the upper bounds and mechanistic derivation are obtained inside the split-head MNIST construction, where the explicit auxiliary/class factorization introduces the orthogonality and task-unrelated signal properties used in the proofs. The manuscript does not demonstrate that identical bounds hold verbatim in unfactored standard distillation pipelines. In revision we will (i) qualify the abstract and theory section to state that the bounds characterize failure modes within this controlled isolation of head compatibility, and (ii) add a limitations paragraph explaining that the construction provides a tractable setting for deriving explicit limits rather than claiming immediate task- and distribution-independence for arbitrary pipelines. This revision will be made. revision: yes
-
Referee: [experimental results on architecture transfer] Experimental claims (§ on architecture changes and performance matching): the demonstration that students match teacher performance when class heads remain compatible is shown only inside the auxiliary/class-head split. Because the split is an additional modeling assumption not present in conventional distillation, the results do not yet establish that head compatibility (rather than the split itself) is the governing factor in general settings.
Authors: The split-head construction is deliberately introduced to hold all other variables fixed while varying only head compatibility, thereby isolating it from initialization and architecture effects. The reported architecture-transfer and performance-matching results therefore hold under this controlled isolation. We do not claim the split itself is present in conventional distillation; rather, the experiments show that once heads are compatible, transfer occurs even after random hidden-layer re-initialization, layer addition/removal, and MLP-to-CNN changes. In revision we will add a dedicated discussion paragraph that (a) reiterates the role of the split as an experimental control and (b) sketches how head-compatibility diagnostics could be applied in unfactored pipelines. No new experiments are planned for this revision. revision: partial
Circularity Check
No significant circularity; derivation self-contained within controlled setting
full rationale
The paper uses a controlled MNIST setup with explicitly split auxiliary and class heads to demonstrate that subliminal learning depends on head compatibility (rather than initialization) and to derive a theory plus upper bounds on failure within that framework. The abstract states the setting 'enables us to develop a theory... and to derive upper bounds,' without claiming task- or distribution-independent generality. No equations, self-citations, or reductions are quoted that would make any prediction equivalent to its inputs by construction. The central claims remain experimentally grounded in the described architecture rather than tautological or fitted-by-definition.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
subliminal learning is governed by compatible output heads... aux head Ω_A ... class head Ω_C ... random projection of the latent-space
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
⟨Δθ^(S), Δθ^(T)⟩ > 0 almost surely ... Ω_A^⊺ Ω_A ... orthogonal projection P with rank m
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Takeo Watanabe, José E. Náñez, and Yuka Sasaki. Perceptual learning without perception. Nature, 413:844–848, 2001. doi: 10.1038/35101601
-
[2]
Aaron R. Seitz and Takeo Watanabe. Is subliminal learning really passive?Nature, 422:36,
-
[3]
doi: 10.1038/422036a
-
[4]
Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Sören Mindermann, Jacob Hilton, Samuel Marks, and Owain Evans. Language models transmit behavioural traits through hidden signals in data.Nature, 652(8110):615–621, 2026
work page 2026
-
[5]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[6]
Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 4043–4068, 2025
work page 2025
-
[7]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Alireza Mohammadshahi and Yani Ioannou. What is left after distillation? how knowledge transfer impacts fairness and bias.Transactions on Machine Learning Research, 2025
work page 2025
-
[9]
Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, et al. Poisoning attacks on llms require a near-constant number of poison samples.arXiv preprint arXiv:2510.07192, 2025. 11
-
[10]
Chayanon Kitkana and Shivam Arora. Sustained gradient alignment mediates subliminal learning in a multi-step setting: Evidence from MNIST auxiliary logit distillation experiment. InICLR 2026 Workshop on Scientific Methods for Understanding Deep Learning (Sci4DL), 2026
work page 2026
-
[11]
Ishaq Aden-Ali, Noah Golowich, Allen Liu, Abhishek Shetty, Ankur Moitra, and Nika Hagh- talab. Subliminal effects in your data: A general mechanism via log-linearity.arXiv preprint arXiv:2602.04863, 2026
-
[12]
Simon Schrodi, Elias Kempf, Fazl Barez, and Thomas Brox. Towards understanding subliminal learning: When and how hidden biases transfer.arXiv preprint arXiv:2509.23886, 2025
-
[13]
Token entanglement in subliminal learning
Amir Zur, Zhuofan Ying, Alexander Russell Loftus, Kerem ¸ Sahin, Steven Yu, Lucia Quirke, Tamar Rott Shaham, Natalie Shapira, Hadas Orgad, and David Bau. Token entanglement in subliminal learning. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025
work page 2025
-
[14]
Subliminal Steering: Stronger Encoding of Hidden Signals
George Morgulis and John Hewitt. Subliminal steering: Stronger encoding of hidden signals. arXiv preprint arXiv:2604.25783, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Data-Free Knowledge Distillation for Deep Neural Networks
Raphael Gontijo Lopes, Stefano Fenu, and Thad Starner. Data-free knowledge distillation for deep neural networks.arXiv preprint arXiv:1710.07535, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K
Hongxu Yin, Pavlo Molchanov, Jose M. Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K. Jha, and Jan Kautz. Dreaming to distill: Data-free knowledge transfer via DeepInversion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8712–8721, 2020
work page 2020
-
[17]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002
work page 2002
-
[18]
Emnist: Extending mnist to handwritten letters
Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017
work page 2017
-
[19]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019
work page 2019
-
[20]
Taehyeon Kim, Jaehoon Oh, NakYil Kim, Sangwook Cho, and Se-Young Yun. Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation.arXiv preprint arXiv:2105.08919, 2021
-
[21]
An image synthesizer.ACM Siggraph Computer Graphics, 19(3):287–296, 1985
Ken Perlin. An image synthesizer.ACM Siggraph Computer Graphics, 19(3):287–296, 1985
work page 1985
-
[22]
Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space
Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 30–45, 2022
work page 2022
-
[23]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073, 2025
Andrew Lee, Melanie Weber, Fernanda Viégas, and Martin Wattenberg. Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073, 2025
-
[25]
Charles Goddard and Fernando Fernandes Neto. Training-free tokenizer transplantation via orthogonal matching pursuit.arXiv preprint arXiv:2506.06607, 2025
-
[26]
Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington.Wide neural networks of any depth evolve as linear models under gradient descent. Curran Associates Inc., Red Hook, NY , USA, 2019. 12
work page 2019
-
[27]
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015
work page 2015
-
[28]
Vardan Papyan, X. Y . Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training.Proceedings of the National Academy of Sciences, 117 (40):24652–24663, 2020. doi: 10.1073/pnas.2015509117. URL https://www.pnas.org/ doi/abs/10.1073/pnas.2015509117
-
[29]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/ guo17a.html
work page 2017
-
[30]
Piyush Raikwar and Deepak Mishra. Discovering and overcoming limitations of noise- engineered data-free knowledge distillation.Advances in Neural Information Processing Systems, 35:4902–4912, 2022
work page 2022
-
[31]
Feature visualization.Distill, 2(11): e7, 2017
Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2(11): e7, 2017
work page 2017
-
[32]
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational conference on learning representations, 2018
work page 2018
-
[33]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 13 A Mathematical Background A.1 Subliminal Learning Setting We consider a "black box" neural network model fθ that maps an input vector x(i) ∈R D into a latent space Rd. For convenience we shall call these latent representations z(i) =f θ(x(i...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[34]
The student network needs to learn the latent-representation of the teacher sufficiently well. It needs to generalize the prediction of the teacher latent output from noise inputs to the data samplesx fθ(S,final)(x)≈f θ(T ,final)(x).(9)
-
[35]
The final class head of the teacher and the student class head need to be sufficiently close Ω(T,final) C ≈Ω (S,init) C .(10) If the student has learned the teacher’s latent-output and the class head of both is similar enough, the student’s classification probabilities will be close to the teacher’s. Conversely, having an incorrect class head will degrade...
-
[36]
=:β∈ O(1) , independent of d. Importantly, for high latent dimensions d≫1 , these random vectors become pairwise (approximately) orthogonal since their cosine similarity scales as 1√ d. Hence, for typical initializations andm≪d,Ω ⊺ AΩA will effectively become a random orthogonal projection of the latent-space onto an m-dimensional sub-space (up to a the c...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.