Learning Through Noise: Why Subliminal Learning Works and When It Fails

Bel\'en Hidalgo-Ogalde; Roman D. Ventzke; Valentin Neuhaus; Vincent C. Brockers; Viola Priesemann

arxiv: 2605.23645 · v1 · pith:IRCKLKJBnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Learning Through Noise: Why Subliminal Learning Works and When It Fails

Vincent C. Brockers , Roman D. Ventzke , Valentin Neuhaus , Bel\'en Hidalgo-Ogalde , Viola Priesemann This is my paper

Pith reviewed 2026-05-25 05:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords subliminal learningknowledge distillationoutput headscompatible headstask-unrelated noiseMNIST experimentsneural network transferauxiliary heads

0 comments

The pith

Subliminal learning from noise occurs when output heads are compatible, not when initializations match.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that students can acquire task knowledge from teachers by training only on unrelated noise inputs and outputs. This transfer happens because auxiliary heads for the noise and class heads for the task stay compatible between models. Experiments on MNIST keep this compatibility while randomizing hidden layers, removing or adding layers, and switching from MLP to CNN architectures. With both heads aligned, students reach teacher-level accuracy in good cases. The setting also yields a theory and upper bounds on when the transfer must fail.

Core claim

Subliminal learning is governed by compatible output heads. Splitting outputs into an auxiliary head for task-unrelated noise and a class head for classification allows transfer of a recoverable teacher signal even with random hidden-layer initializations or architectural changes. When class heads remain compatible as well, students trained solely on noise inputs can approach and sometimes match teacher performance on the original task.

What carries the argument

Compatible output heads (auxiliary head for noise signals plus class head for classification) that keep the teacher signal recoverable in the student.

If this is right

Subliminal learning persists without shared or matched initializations between teacher and student.
Students reach near teacher accuracy on the task when both auxiliary and class heads stay compatible.
Upper bounds on failure can be derived from the head-compatibility condition alone.
Architecture modifications such as layer removal, addition, or MLP-to-CNN switches do not block transfer if heads remain compatible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Design choices that enforce head compatibility could be used to control unintended bias transfer in distillation pipelines.
The same compatibility principle might explain limits on knowledge transfer in other settings where inputs are replaced by noise or synthetic data.
Testing the bounds on larger image or language models would show whether head compatibility remains the dominant constraint outside the MNIST regime.

Load-bearing premise

The controlled MNIST setup with explicitly split auxiliary and class heads isolates head compatibility as the decisive factor and supports general upper bounds independent of task or data.

What would settle it

Finding reliable subliminal transfer when the auxiliary or class heads are made incompatible, or finding no transfer when the heads are kept compatible across the tested architecture changes, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.23645 by Bel\'en Hidalgo-Ogalde, Roman D. Ventzke, Valentin Neuhaus, Vincent C. Brockers, Viola Priesemann.

**Figure 1.** Figure 1: Subliminal learning transfers task information through task-unrelated noise via compatible teacher–student output heads. (a) We study models whose latent representation is connected to an aux head ΩA producing task-unrelated auxiliary logits and a class head ΩC producing task logits. (b) A teacher model is first trained on labeled MNIST [16] using the class head. (c) The trained teacher is then queried on … view at source ↗

**Figure 2.** Figure 2: Subliminal learning is robust to hidden-layer random initialization but fragile to output-head incompatibility. We test which shared student–teacher components are required for subliminal learning by randomly reinitializing different parts of the student before training. Orange regions indicate reinitialized components, red bars show student accuracy after training, and the blue bar shows teacher accuracy.… view at source ↗

**Figure 3.** Figure 3: Compatible output heads are sufficient for subliminal learning across architectural and dataset changes, but student capacity and task complexity determine recoverability. We test subliminal learning when teacher and student share only the aux head and class heads while keeping a fixed teacher setup whereas the student hidden architecture or task is varied. (a) Varying the student first hidden-layer dimens… view at source ↗

**Figure 4.** Figure 4: Subliminal learning depends jointly on aux-head capacity and noise samples, with clear regimes of bottleneck, recovery, and saturation. We vary the number of auxiliary neurons m and the number of noise samples N seen per student epoch to test how recoverability depends on auxhead capacity and noise exposure. (a) Student accuracy across the (m, N) plane. Increasing either m or N improves subliminal learnin… view at source ↗

**Figure 5.** Figure 5: Output-head perturbations reveal compatibility limits for subliminal learning and validate theory-derived robustness bounds. Gaussian noise of strength δ is added to either the student’s class or aux head before auxiliary training. (a–d) Perturbing either head reduces student accuracy and teacher–student head similarity, indicating that both readout and aux-head compatibility are required for successful si… view at source ↗

**Figure 6.** Figure 6: Shared initialization alone does not guarantee subliminal learning, instead excessive latent dimensionality drives head drift and eventually breaks the effect. We vary the shared latent dimension d of teacher and student while keeping architecture and initialization otherwise identical. (a) Teacher accuracy quickly saturates with increasing d, whereas student accuracy first improves and then collapses at l… view at source ↗

**Figure 7.** Figure 7: Perturbing the student’s aux head reduces the similarity between teacher and student hidden-layer updates forcing the student to learn a rotated teacher representation Without perturbation we observe an alignment of teacher and student weight-changes, as presented in eq. (19). Training was performed for a single epoch to capture changes at the beginning of training. The theory describing an upper bound for… view at source ↗

**Figure 8.** Figure 8: The class head remains stable during initial training of a randomly initialized teacher. (a) After random weight initialization, class-head vectors ωc are statistically approximately orthogonal, while the latent-representation of the training data forms an unstructured point-cloud with inseparable classes. (b) After some training symmetry is broken and the representation adapts in the directions of their … view at source ↗

**Figure 9.** Figure 9: Increasing dlatent amplifies the effective supervised signal at the class head by making the frozen latent representation more linearly separable. Teacher accuracy with only the class head trainable, plotted against latent dimension dlatent. Since all earlier layers remain fixed at initialization, performance improvements must arise from the readout alone. Larger latent spaces therefore provide a more favo… view at source ↗

**Figure 10.** Figure 10: For training a CNN student, we test different resolution levels of spatially correlated Perlin [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: We fix the aux-head weights during training, to separate the effect of self-correction and [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Output-head perturbations reveal compatibility limits for subliminal learning. The scale of the perturbation strength goes beyond the scale of ≈ 0.062 which is the average weight scale of the network. The decrease in accuracy and cosine similarity observed in fig. 5 continues beyond this level. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Fixed-head ablations identify aux-head drift as the dominant high-dimensional failure mode. We repeat the latent-dimension sweep while freezing different output heads. (a,e,i) In the baseline, student accuracy collapses at large d as both teacher class-head drift and student auxhead drift increase. (b,f,j) Fixing the aux head strongly reduces the high-d collapse even though the teacher class head still d… view at source ↗

**Figure 14.** Figure 14: The decrease in accuracy also observed for the MNIST case is caused by the size of the aux head. We repeat the latent dimension sweep performed in fig. 6 and fig. 13 for a setup trained on the balanced EMNIST with n = 47 classes. The dip observed in accuracy after d = m indicates that this dip is not caused by the number of classes and rather by the number of auxiliary neurons. 29 [PITH_FULL_IMAGE:figure… view at source ↗

read the original abstract

In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input$\unicode{x2013}$output pairs. Prior explanations tie this effect to shared or closely matched teacher$\unicode{x2013}$student initialization. We show that a closely matched initialization is not necessary. Instead, subliminal learning is governed by compatible output heads. Using a controlled MNIST setting, we split outputs into an auxiliary head (for auxiliary, task-unrelated noise signals) and a class head (for classification) to demonstrate subliminal learning occurs$\unicode{x2014}$even when we randomly initialize hidden layers and remove layers, add new layers, or change the architecture (MLP-to-CNN). Compatible auxiliary heads enable transfer of a recoverable teacher signal, bringing the student's representations closer to the teacher's. When the class heads remain compatible as well, students trained only on task-unrelated noise can approach, and in favorable regimes match, teacher-level task performance. Our setting enables us to develop a theory that explains the mechanism of subliminal learning and to derive upper bounds on when subliminal learning fails. Together, our results turn subliminal learning from a surprising transfer effect into a theoretically grounded mechanism with predictable limits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues compatible output heads drive subliminal learning instead of matched initialization, shown via split-head MNIST experiments and some derived bounds, but the bounds' independence from the specific setup is the main open question.

read the letter

The main thing to know is that this work claims subliminal learning transfers task knowledge through unrelated noise signals because of compatible output heads rather than shared initialization. In a controlled MNIST setup they split the output into an auxiliary head for noise and a class head, then show the transfer still happens after random hidden-layer initialization, layer additions or removals, and even MLP-to-CNN changes. When the class heads also match, student performance can approach the teacher's on the original task. They use this to sketch a theory and upper bounds on failure modes. That reframing and the controlled isolation of head compatibility are the actual new pieces relative to earlier initialization-focused accounts. The experiments do a reasonable job of varying architecture while keeping the head factor fixed, which lets them test the prior story directly. The soft spot is that the upper bounds are derived inside this explicit head-split construction. The stress-test note is right that this factorization is not standard in distillation, so it is unclear whether the bounds hold independently of task or data distribution without further checks. The paper is aimed at distillation researchers who care about transfer mechanisms. It has a distinct claim, usable experimental control, and an attempt at theory, so it deserves a serious referee to examine the derivations and the scope of the bounds. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that subliminal learning—the transfer of task-relevant knowledge via distillation on task-unrelated input-output pairs—is governed by compatible output heads rather than shared or matched initializations. In a controlled MNIST setup, outputs are partitioned into an auxiliary head (for task-unrelated noise) and a class head; experiments show transfer occurs even after random initialization of hidden layers, layer removal/addition, or MLP-to-CNN architecture changes. Compatible auxiliary heads bring student representations closer to the teacher's, and when class heads are also compatible, students can approach or match teacher task performance. The setting is used to derive a theory of the mechanism and upper bounds on failure conditions.

Significance. If the central claim and bounds hold beyond the specific construction, the work would convert subliminal learning from an empirical curiosity into a mechanistically understood process with testable limits, with potential implications for distillation, knowledge transfer, and bias propagation in neural networks. The controlled isolation of head compatibility and the attempt to derive bounds are positive features.

major comments (2)

[theory / upper-bounds section] Theory/upper-bounds derivation (referenced in abstract as enabling 'upper bounds on when subliminal learning fails'): the bounds are obtained inside the MNIST split-head construction, where auxiliary noise signals are defined to be task-unrelated and the heads are explicitly factored. The derivation therefore relies on properties (e.g., orthogonality between auxiliary signals and class logits) that are introduced by the architectural partition itself; it is not shown that the same bounds remain valid for standard distillation pipelines that lack this explicit factorization, undermining the claim that the bounds are task- and distribution-independent.
[experimental results on architecture transfer] Experimental claims (§ on architecture changes and performance matching): the demonstration that students match teacher performance when class heads remain compatible is shown only inside the auxiliary/class-head split. Because the split is an additional modeling assumption not present in conventional distillation, the results do not yet establish that head compatibility (rather than the split itself) is the governing factor in general settings.

minor comments (2)

Clarify the precise mathematical definition of 'head compatibility' (e.g., whether it is measured by cosine similarity of weight matrices, logit correlation, or another metric) and state it before the experimental sections.
The abstract states that 'closely matched initialization is not necessary'; the manuscript should explicitly contrast the random-initialization regime against a matched-initialization baseline in the same figure or table to make the comparison quantitative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the intended scope of our controlled construction while agreeing that explicit statements about its limitations are needed.

read point-by-point responses

Referee: [theory / upper-bounds section] Theory/upper-bounds derivation (referenced in abstract as enabling 'upper bounds on when subliminal learning fails'): the bounds are obtained inside the MNIST split-head construction, where auxiliary noise signals are defined to be task-unrelated and the heads are explicitly factored. The derivation therefore relies on properties (e.g., orthogonality between auxiliary signals and class logits) that are introduced by the architectural partition itself; it is not shown that the same bounds remain valid for standard distillation pipelines that lack this explicit factorization, undermining the claim that the bounds are task- and distribution-independent.

Authors: We agree that the upper bounds and mechanistic derivation are obtained inside the split-head MNIST construction, where the explicit auxiliary/class factorization introduces the orthogonality and task-unrelated signal properties used in the proofs. The manuscript does not demonstrate that identical bounds hold verbatim in unfactored standard distillation pipelines. In revision we will (i) qualify the abstract and theory section to state that the bounds characterize failure modes within this controlled isolation of head compatibility, and (ii) add a limitations paragraph explaining that the construction provides a tractable setting for deriving explicit limits rather than claiming immediate task- and distribution-independence for arbitrary pipelines. This revision will be made. revision: yes
Referee: [experimental results on architecture transfer] Experimental claims (§ on architecture changes and performance matching): the demonstration that students match teacher performance when class heads remain compatible is shown only inside the auxiliary/class-head split. Because the split is an additional modeling assumption not present in conventional distillation, the results do not yet establish that head compatibility (rather than the split itself) is the governing factor in general settings.

Authors: The split-head construction is deliberately introduced to hold all other variables fixed while varying only head compatibility, thereby isolating it from initialization and architecture effects. The reported architecture-transfer and performance-matching results therefore hold under this controlled isolation. We do not claim the split itself is present in conventional distillation; rather, the experiments show that once heads are compatible, transfer occurs even after random hidden-layer re-initialization, layer addition/removal, and MLP-to-CNN changes. In revision we will add a dedicated discussion paragraph that (a) reiterates the role of the split as an experimental control and (b) sketches how head-compatibility diagnostics could be applied in unfactored pipelines. No new experiments are planned for this revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained within controlled setting

full rationale

The paper uses a controlled MNIST setup with explicitly split auxiliary and class heads to demonstrate that subliminal learning depends on head compatibility (rather than initialization) and to derive a theory plus upper bounds on failure within that framework. The abstract states the setting 'enables us to develop a theory... and to derive upper bounds,' without claiming task- or distribution-independent generality. No equations, self-citations, or reductions are quoted that would make any prediction equivalent to its inputs by construction. The central claims remain experimentally grounded in the described architecture rather than tautological or fitted-by-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no information available on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5778 in / 1086 out tokens · 24754 ms · 2026-05-25T05:12:49.918853+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

subliminal learning is governed by compatible output heads... aux head Ω_A ... class head Ω_C ... random projection of the latent-space
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

⟨Δθ^(S), Δθ^(T)⟩ > 0 almost surely ... Ω_A^⊺ Ω_A ... orthogonal projection P with rank m

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 6 internal anchors

[1]

Náñez, and Yuka Sasaki

Takeo Watanabe, José E. Náñez, and Yuka Sasaki. Perceptual learning without perception. Nature, 413:844–848, 2001. doi: 10.1038/35101601

work page doi:10.1038/35101601 2001
[2]

Seitz and Takeo Watanabe

Aaron R. Seitz and Takeo Watanabe. Is subliminal learning really passive?Nature, 422:36,

work page
[3]

doi: 10.1038/422036a

work page doi:10.1038/422036a
[4]

Language models transmit behavioural traits through hidden signals in data.Nature, 652(8110):615–621, 2026

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Sören Mindermann, Jacob Hilton, Samuel Marks, and Owain Evans. Language models transmit behavioural traits through hidden signals in data.Nature, 652(8110):615–621, 2026

work page 2026
[5]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[6]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 4043–4068, 2025

work page 2025
[7]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

What is left after distillation? how knowledge transfer impacts fairness and bias.Transactions on Machine Learning Research, 2025

Alireza Mohammadshahi and Yani Ioannou. What is left after distillation? how knowledge transfer impacts fairness and bias.Transactions on Machine Learning Research, 2025

work page 2025
[9]

Poisoning attacks on llms require a near-constant number of poison samples.arXiv preprint arXiv:2510.07192, 2025

Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, et al. Poisoning attacks on llms require a near-constant number of poison samples.arXiv preprint arXiv:2510.07192, 2025. 11

work page arXiv 2025
[10]

Sustained gradient alignment mediates subliminal learning in a multi-step setting: Evidence from MNIST auxiliary logit distillation experiment

Chayanon Kitkana and Shivam Arora. Sustained gradient alignment mediates subliminal learning in a multi-step setting: Evidence from MNIST auxiliary logit distillation experiment. InICLR 2026 Workshop on Scientific Methods for Understanding Deep Learning (Sci4DL), 2026

work page 2026
[11]

Subliminal effects in your data: A general mechanism via log-linearity.arXiv preprint arXiv:2602.04863, 2026

Ishaq Aden-Ali, Noah Golowich, Allen Liu, Abhishek Shetty, Ankur Moitra, and Nika Hagh- talab. Subliminal effects in your data: A general mechanism via log-linearity.arXiv preprint arXiv:2602.04863, 2026

work page arXiv 2026
[12]

Towards understanding subliminal learning: When and how hidden biases transfer.arXiv preprint arXiv:2509.23886, 2025

Simon Schrodi, Elias Kempf, Fazl Barez, and Thomas Brox. Towards understanding subliminal learning: When and how hidden biases transfer.arXiv preprint arXiv:2509.23886, 2025

work page arXiv 2025
[13]

Token entanglement in subliminal learning

Amir Zur, Zhuofan Ying, Alexander Russell Loftus, Kerem ¸ Sahin, Steven Yu, Lucia Quirke, Tamar Rott Shaham, Natalie Shapira, Hadas Orgad, and David Bau. Token entanglement in subliminal learning. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

work page 2025
[14]

Subliminal Steering: Stronger Encoding of Hidden Signals

George Morgulis and John Hewitt. Subliminal steering: Stronger encoding of hidden signals. arXiv preprint arXiv:2604.25783, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Data-Free Knowledge Distillation for Deep Neural Networks

Raphael Gontijo Lopes, Stefano Fenu, and Thad Starner. Data-free knowledge distillation for deep neural networks.arXiv preprint arXiv:1710.07535, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K

Hongxu Yin, Pavlo Molchanov, Jose M. Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K. Jha, and Jan Kautz. Dreaming to distill: Data-free knowledge transfer via DeepInversion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8712–8721, 2020

work page 2020
[17]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

work page 2002
[18]

Emnist: Extending mnist to handwritten letters

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017

work page 2017
[19]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019
[20]

Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation.arXiv preprint arXiv:2105.08919, 2021

Taehyeon Kim, Jaehoon Oh, NakYil Kim, Sangwook Cho, and Se-Young Yun. Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation.arXiv preprint arXiv:2105.08919, 2021

work page arXiv 2021
[21]

An image synthesizer.ACM Siggraph Computer Graphics, 19(3):287–296, 1985

Ken Perlin. An image synthesizer.ACM Siggraph Computer Graphics, 19(3):287–296, 1985

work page 1985
[22]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 30–45, 2022

work page 2022
[23]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073, 2025

Andrew Lee, Melanie Weber, Fernanda Viégas, and Martin Wattenberg. Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073, 2025

work page arXiv 2025
[25]

Training-free tokenizer transplantation via orthogonal matching pursuit.arXiv preprint arXiv:2506.06607, 2025

Charles Goddard and Fernando Fernandes Neto. Training-free tokenizer transplantation via orthogonal matching pursuit.arXiv preprint arXiv:2506.06607, 2025

work page arXiv 2025
[26]

Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington.Wide neural networks of any depth evolve as linear models under gradient descent

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington.Wide neural networks of any depth evolve as linear models under gradient descent. Curran Associates Inc., Red Hook, NY , USA, 2019. 12

work page 2019
[27]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015

work page 2015
[28]

Vardan Papyan, X. Y . Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training.Proceedings of the National Academy of Sciences, 117 (40):24652–24663, 2020. doi: 10.1073/pnas.2015509117. URL https://www.pnas.org/ doi/abs/10.1073/pnas.2015509117

work page doi:10.1073/pnas.2015509117 2020
[29]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/ guo17a.html

work page 2017
[30]

Discovering and overcoming limitations of noise- engineered data-free knowledge distillation.Advances in Neural Information Processing Systems, 35:4902–4912, 2022

Piyush Raikwar and Deepak Mishra. Discovering and overcoming limitations of noise- engineered data-free knowledge distillation.Advances in Neural Information Processing Systems, 35:4902–4912, 2022

work page 2022
[31]

Feature visualization.Distill, 2(11): e7, 2017

Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2(11): e7, 2017

work page 2017
[32]

Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational conference on learning representations, 2018

work page 2018
[33]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 13 A Mathematical Background A.1 Subliminal Learning Setting We consider a "black box" neural network model fθ that maps an input vector x(i) ∈R D into a latent space Rd. For convenience we shall call these latent representations z(i) =f θ(x(i...

work page internal anchor Pith review Pith/arXiv arXiv 2014
[34]

It needs to generalize the prediction of the teacher latent output from noise inputs to the data samplesx fθ(S,final)(x)≈f θ(T ,final)(x).(9)

The student network needs to learn the latent-representation of the teacher sufficiently well. It needs to generalize the prediction of the teacher latent output from noise inputs to the data samplesx fθ(S,final)(x)≈f θ(T ,final)(x).(9)

work page
[35]

better than random

The final class head of the teacher and the student class head need to be sufficiently close Ω(T,final) C ≈Ω (S,init) C .(10) If the student has learned the teacher’s latent-output and the class head of both is similar enough, the student’s classification probabilities will be close to the teacher’s. Conversely, having an incorrect class head will degrade...

work page
[36]

stability

=:β∈ O(1) , independent of d. Importantly, for high latent dimensions d≫1 , these random vectors become pairwise (approximately) orthogonal since their cosine similarity scales as 1√ d. Hence, for typical initializations andm≪d,Ω ⊺ AΩA will effectively become a random orthogonal projection of the latent-space onto an m-dimensional sub-space (up to a the c...

work page

[1] [1]

Náñez, and Yuka Sasaki

Takeo Watanabe, José E. Náñez, and Yuka Sasaki. Perceptual learning without perception. Nature, 413:844–848, 2001. doi: 10.1038/35101601

work page doi:10.1038/35101601 2001

[2] [2]

Seitz and Takeo Watanabe

Aaron R. Seitz and Takeo Watanabe. Is subliminal learning really passive?Nature, 422:36,

work page

[3] [3]

doi: 10.1038/422036a

work page doi:10.1038/422036a

[4] [4]

Language models transmit behavioural traits through hidden signals in data.Nature, 652(8110):615–621, 2026

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Sören Mindermann, Jacob Hilton, Samuel Marks, and Owain Evans. Language models transmit behavioural traits through hidden signals in data.Nature, 652(8110):615–621, 2026

work page 2026

[5] [5]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[6] [6]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 4043–4068, 2025

work page 2025

[7] [7]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

What is left after distillation? how knowledge transfer impacts fairness and bias.Transactions on Machine Learning Research, 2025

Alireza Mohammadshahi and Yani Ioannou. What is left after distillation? how knowledge transfer impacts fairness and bias.Transactions on Machine Learning Research, 2025

work page 2025

[9] [9]

Poisoning attacks on llms require a near-constant number of poison samples.arXiv preprint arXiv:2510.07192, 2025

Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, et al. Poisoning attacks on llms require a near-constant number of poison samples.arXiv preprint arXiv:2510.07192, 2025. 11

work page arXiv 2025

[10] [10]

Sustained gradient alignment mediates subliminal learning in a multi-step setting: Evidence from MNIST auxiliary logit distillation experiment

Chayanon Kitkana and Shivam Arora. Sustained gradient alignment mediates subliminal learning in a multi-step setting: Evidence from MNIST auxiliary logit distillation experiment. InICLR 2026 Workshop on Scientific Methods for Understanding Deep Learning (Sci4DL), 2026

work page 2026

[11] [11]

Subliminal effects in your data: A general mechanism via log-linearity.arXiv preprint arXiv:2602.04863, 2026

Ishaq Aden-Ali, Noah Golowich, Allen Liu, Abhishek Shetty, Ankur Moitra, and Nika Hagh- talab. Subliminal effects in your data: A general mechanism via log-linearity.arXiv preprint arXiv:2602.04863, 2026

work page arXiv 2026

[12] [12]

Towards understanding subliminal learning: When and how hidden biases transfer.arXiv preprint arXiv:2509.23886, 2025

Simon Schrodi, Elias Kempf, Fazl Barez, and Thomas Brox. Towards understanding subliminal learning: When and how hidden biases transfer.arXiv preprint arXiv:2509.23886, 2025

work page arXiv 2025

[13] [13]

Token entanglement in subliminal learning

Amir Zur, Zhuofan Ying, Alexander Russell Loftus, Kerem ¸ Sahin, Steven Yu, Lucia Quirke, Tamar Rott Shaham, Natalie Shapira, Hadas Orgad, and David Bau. Token entanglement in subliminal learning. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

work page 2025

[14] [14]

Subliminal Steering: Stronger Encoding of Hidden Signals

George Morgulis and John Hewitt. Subliminal steering: Stronger encoding of hidden signals. arXiv preprint arXiv:2604.25783, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Data-Free Knowledge Distillation for Deep Neural Networks

Raphael Gontijo Lopes, Stefano Fenu, and Thad Starner. Data-free knowledge distillation for deep neural networks.arXiv preprint arXiv:1710.07535, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K

Hongxu Yin, Pavlo Molchanov, Jose M. Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K. Jha, and Jan Kautz. Dreaming to distill: Data-free knowledge transfer via DeepInversion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8712–8721, 2020

work page 2020

[17] [17]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

work page 2002

[18] [18]

Emnist: Extending mnist to handwritten letters

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017

work page 2017

[19] [19]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019

[20] [20]

Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation.arXiv preprint arXiv:2105.08919, 2021

Taehyeon Kim, Jaehoon Oh, NakYil Kim, Sangwook Cho, and Se-Young Yun. Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation.arXiv preprint arXiv:2105.08919, 2021

work page arXiv 2021

[21] [21]

An image synthesizer.ACM Siggraph Computer Graphics, 19(3):287–296, 1985

Ken Perlin. An image synthesizer.ACM Siggraph Computer Graphics, 19(3):287–296, 1985

work page 1985

[22] [22]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 30–45, 2022

work page 2022

[23] [23]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073, 2025

Andrew Lee, Melanie Weber, Fernanda Viégas, and Martin Wattenberg. Shared global and local geometry of language model embeddings.arXiv preprint arXiv:2503.21073, 2025

work page arXiv 2025

[25] [25]

Training-free tokenizer transplantation via orthogonal matching pursuit.arXiv preprint arXiv:2506.06607, 2025

Charles Goddard and Fernando Fernandes Neto. Training-free tokenizer transplantation via orthogonal matching pursuit.arXiv preprint arXiv:2506.06607, 2025

work page arXiv 2025

[26] [26]

Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington.Wide neural networks of any depth evolve as linear models under gradient descent

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington.Wide neural networks of any depth evolve as linear models under gradient descent. Curran Associates Inc., Red Hook, NY , USA, 2019. 12

work page 2019

[27] [27]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015

work page 2015

[28] [28]

Vardan Papyan, X. Y . Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training.Proceedings of the National Academy of Sciences, 117 (40):24652–24663, 2020. doi: 10.1073/pnas.2015509117. URL https://www.pnas.org/ doi/abs/10.1073/pnas.2015509117

work page doi:10.1073/pnas.2015509117 2020

[29] [29]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/ guo17a.html

work page 2017

[30] [30]

Discovering and overcoming limitations of noise- engineered data-free knowledge distillation.Advances in Neural Information Processing Systems, 35:4902–4912, 2022

Piyush Raikwar and Deepak Mishra. Discovering and overcoming limitations of noise- engineered data-free knowledge distillation.Advances in Neural Information Processing Systems, 35:4902–4912, 2022

work page 2022

[31] [31]

Feature visualization.Distill, 2(11): e7, 2017

Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2(11): e7, 2017

work page 2017

[32] [32]

Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational conference on learning representations, 2018

work page 2018

[33] [33]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 13 A Mathematical Background A.1 Subliminal Learning Setting We consider a "black box" neural network model fθ that maps an input vector x(i) ∈R D into a latent space Rd. For convenience we shall call these latent representations z(i) =f θ(x(i...

work page internal anchor Pith review Pith/arXiv arXiv 2014

[34] [34]

It needs to generalize the prediction of the teacher latent output from noise inputs to the data samplesx fθ(S,final)(x)≈f θ(T ,final)(x).(9)

The student network needs to learn the latent-representation of the teacher sufficiently well. It needs to generalize the prediction of the teacher latent output from noise inputs to the data samplesx fθ(S,final)(x)≈f θ(T ,final)(x).(9)

work page

[35] [35]

better than random

The final class head of the teacher and the student class head need to be sufficiently close Ω(T,final) C ≈Ω (S,init) C .(10) If the student has learned the teacher’s latent-output and the class head of both is similar enough, the student’s classification probabilities will be close to the teacher’s. Conversely, having an incorrect class head will degrade...

work page

[36] [36]

stability

=:β∈ O(1) , independent of d. Importantly, for high latent dimensions d≫1 , these random vectors become pairwise (approximately) orthogonal since their cosine similarity scales as 1√ d. Hence, for typical initializations andm≪d,Ω ⊺ AΩA will effectively become a random orthogonal projection of the latent-space onto an m-dimensional sub-space (up to a the c...

work page