Subliminal Learning is a LoRA Artifact

Ari Holtzman; Harvey Yiyun Fu; Mark Muchane; Todd Nief

arxiv: 2606.00831 · v1 · pith:S2ABX6WEnew · submitted 2026-05-30 · 💻 cs.AI · cs.LG

Subliminal Learning is a LoRA Artifact

Todd Nief , Harvey Yiyun Fu , Mark Muchane , Ari Holtzman This is my paper

Pith reviewed 2026-06-28 18:26 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords subliminal learningLoRAfinetuningbehavioral transmissionlanguage modelscontext dependenceLoRA rank

0 comments

The pith

Subliminal learning transmits behavioral traits only under LoRA finetuning and vanishes with full parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that subliminal learning, where a teacher model passes traits like cat obsession to a student via numerical sequences, occurs only when finetuning uses LoRA. Transmission strength forms an inverted U with LoRA rank and disappears entirely under full finetuning. The effect further requires matching context tokens between finetuning and evaluation phases, localizing the behavior to those shared positions. A sympathetic reader would care because this frames the phenomenon as a narrow artifact rather than a general transmission channel in language models.

Core claim

Subliminal learning is a LoRA artifact. When subliminal learning occurs, transmission has an inverted U-shaped relationship with LoRA rank; it also disappears with full finetuning. Subliminal learning is highly dependent on the context seen during finetuning and evaluation. For example, a Qwen model with the default system prompt during finetuning does not show subliminal learning during generation when no system prompt is included. Subliminal behavior is localized to computation at tokens seen during both finetuning and evaluation.

What carries the argument

LoRA finetuning whose rank and shared context tokens with evaluation control the appearance and strength of behavioral transmission.

If this is right

Behavioral transmission peaks at intermediate LoRA ranks rather than scaling with rank.
Full finetuning removes the transmission effect entirely.
Altering system prompts or chat templates between finetuning and evaluation blocks the effect.
The transmitted behavior is confined to tokens present in both finetuning and evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety evaluations of model behavior transfer should test both LoRA and full finetuning to avoid overestimating risks from narrow artifacts.
The context-token localization suggests that controlling shared templates could serve as a practical mitigation even under LoRA.
Similar rank-dependent or context-dependent artifacts may appear in other behavioral phenomena studied under LoRA-only regimes.

Load-bearing premise

Switching from LoRA to full finetuning isolates the update mechanism without changes in optimization trajectory or effective learning rate that could themselves remove the transmission.

What would settle it

Demonstrating subliminal learning under full finetuning when learning rates and optimization details are matched to the LoRA setting, or finding no inverted-U dependence on rank.

Figures

Figures reproduced from arXiv: 2606.00831 by Ari Holtzman, Harvey Yiyun Fu, Mark Muchane, Todd Nief.

**Figure 2.** Figure 2: Subliminal learning shows an inverted U-shaped relationship with LoRA rank, with “cat” [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Left: We selectively add LoRA adapters at specific token positions, components, and layers during generation. In this example, we add only the FFN down projection adapters at the “Qwen” token positions in the earlier layers, which is sufficient to recover subliminal learning. Right: Turning the LoRA adapters on only at the “Qwen” token positions during generation recovers most of the subliminal learning ef… view at source ↗

**Figure 4.** Figure 4: Left: Turning the LoRA adapters on only at the FFNs for the Qwen tokens in the first half of the model recovers the subliminal learning effect, while turning the LoRA adapters on at all other token positions, components, and layers removes the effect. Right: For all tested animals, the first singular vector of the learned BA matrix is sufficient to recover most of the subliminal learning effect at LoRA ran… view at source ↗

**Figure 5.** Figure 5: We conduct activation patching from a donor model (LoRA adapters with a subliminal [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Teacher temperature sweep across LoRA ranks for cat and eagle. Cat transfers best with [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Left: When models are finetuned with an empty string as the system prompt, the subliminal learning signal concentrates on the chat template tokens that are shared across finetuning and evaluation. Turning the LoRA adapters on only at chat template tokens during evaluation recovers the subliminal learning effect; turning the LoRA adapters off at those positions removes it. Right: When training and evaluatin… view at source ↗

**Figure 8.** Figure 8: LoRA rank sweep on Qwen2.5-7B-Instruct across 16 target animals. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: LoRA rank sweep on Qwen2.5-7B-Instruct for favorite band. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: LoRA rank sweep on Qwen2.5-7B-Instruct for favorite tree. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: We see that the variance in subliminal learning is mostly explained by the dataset seed, [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Dynamic weight grafting results for wolf and owl on Qwen2.5-7B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Per-module LoRA BA singular value spectrum at layer 14 (cat). Left: rank 8. Right: rank 32. B.8 Additional LoRA Rank Results: Gemma 1 2 4 8 16 32 64 128 256 512 LoRA rank 0% 20% 40% 60% 80% 100% % responses containing target cat eagle wolf elephant octopus otter owl raven whale [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: LoRA rank sweep on Gemma 3-4B-it for target animals: cat, eagle, wolf. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: LoRA rank sweep on Gemma 3-4B-it for favorite band. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: LoRA rank sweep on Gemma 3-4B-it for favorite tree. [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: LoRA rank sweep on Llama-3.1-8B-Instruct for target animals: cat, eagle, wolf. We see [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Optimizer sweep on Qwen2.5-7B with the default system prompt [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Training loss curves for the optimizer sweep on Qwen2.5-7B with the default system [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

**Figure 20.** Figure 20: Batch size sweep on Qwen2.5-7B with the default system prompt. Adapters are trained at [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗

**Figure 21.** Figure 21: Teacher temperature sweep for dataset generation across LoRA ranks for owl and wolf. [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗

read the original abstract

Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (Cloud et al., 2025). In subliminal learning, a teacher model with a behavioral trait (e.g. obsession with cats) can transmit this cat obsession to a student model finetuned only on numerical sequences generated by the teacher. In this paper, we ask: how does this unexpected behavioral transmission occur? We show that subliminal learning is a LoRA artifact. When subliminal learning occurs, transmission has an inverted U-shaped relationship with LoRA rank; it also disappears with full finetuning. We show that subliminal learning is highly dependent on the context seen during finetuning and evaluation. For example, a Qwen model with the default system prompt during finetuning ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") does not show subliminal learning during generation when no system prompt is included. We further demonstrate that subliminal behavior is localized to computation at tokens seen during both finetuning and evaluation (e.g. the model's default system prompt, the standard chat template tokens, etc.). Overall, subliminal learning seems to be a fragile artifact of LoRA hyperparameters and finetuning context, making it an unstable channel for behavioral transmission.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows subliminal learning vanishes under full finetuning and tracks LoRA rank plus shared context tokens, but the full-finetune ablation does not cleanly isolate the mechanism.

read the letter

The main takeaway is that the transmission effect from the earlier Cloud et al. work is absent when models are fully finetuned instead of using LoRA, follows an inverted-U with LoRA rank, and requires matching context tokens between finetuning and evaluation.

The paper does a solid job with the targeted ablations. The rank sweep and the prompt-presence tests directly check the artifact hypothesis, and the token-localization result gives a concrete explanation for why transmission appears only in certain setups. These are straightforward empirical moves that were not in the prior work.

The soft spot is the full-finetuning comparison. Switching from LoRA to full updates also changes the optimization trajectory and effective per-parameter step sizes, so the null result does not isolate the low-rank constraint as the cause. Without matched update norms or adjusted learning-rate schedules, the claim that the effect is specifically a LoRA artifact rests on an assumption that may not hold.

The context and token findings look more reliable because they stay inside the LoRA regime. The citation pattern is appropriate and does not over-reach.

This is for people running safety evaluations on finetuned models who need to know how narrow the behavioral transmission channel actually is. It deserves a serious referee to verify the training controls and check whether the rank effect replicates under tighter matching.

Referee Report

3 major / 2 minor

Summary. The paper claims that subliminal learning—where a teacher model transmits behavioral traits (e.g., cat obsession) to a student via finetuning on innocuous numerical sequences—is an artifact of LoRA rather than a general finetuning phenomenon. Key evidence includes an inverted U-shaped dependence of transmission on LoRA rank, complete disappearance under full finetuning, strong sensitivity to finetuning/evaluation context (e.g., presence/absence of the default Qwen system prompt), and localization of the effect to tokens present in both phases.

Significance. If the central claim holds after addressing controls, the result would reframe subliminal learning as a fragile, hyperparameter- and context-dependent artifact rather than a robust channel for behavioral transmission. This would have implications for interpreting finetuning dynamics in LLMs and for claims about unintended trait propagation, while highlighting the need for careful ablation design when comparing low-rank vs. full updates.

major comments (3)

[Abstract] Abstract (and the LoRA-rank / full-finetuning comparisons): the claim that subliminal learning is a LoRA artifact rests on transmission disappearing under full finetuning. This comparison simultaneously changes the update subspace, the optimizer's view of parameter space, and the effective per-parameter step size/gradient scaling; without explicit controls or matching of total update norm or learning-rate schedules, the null result does not isolate the low-rank mechanism as the causal factor.
[Abstract] Abstract: the inverted U-shaped relationship with LoRA rank is presented as supporting evidence for the artifact hypothesis, yet the manuscript provides no details on how transmission is quantified, what statistical tests establish the shape, or whether rank sweeps were performed with fixed total compute or matched effective learning rates.
[Abstract] Abstract: the localization claim (subliminal behavior confined to tokens seen during both finetuning and evaluation) is load-bearing for the context-dependence argument, but the description does not specify the token-level measurement procedure, control conditions, or how "computation at tokens" was isolated from other factors.

minor comments (2)

[Abstract] The citation "Cloud et al., 2025" should be expanded with a full reference or arXiv identifier for reproducibility.
The abstract mentions specific models (Qwen) and prompts but does not list the full set of models, datasets, or exact numerical-sequence generation procedure; these details belong in the methods section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and strengthen the experimental controls where feasible.

read point-by-point responses

Referee: [Abstract] Abstract (and the LoRA-rank / full-finetuning comparisons): the claim that subliminal learning is a LoRA artifact rests on transmission disappearing under full finetuning. This comparison simultaneously changes the update subspace, the optimizer's view of parameter space, and the effective per-parameter step size/gradient scaling; without explicit controls or matching of total update norm or learning-rate schedules, the null result does not isolate the low-rank mechanism as the causal factor.

Authors: We agree that the full-finetuning comparison confounds the rank constraint with other optimization differences. In the revision we will add experiments that match total update norm (via scaled full updates) and discuss remaining differences in gradient scaling. This will better isolate whether the low-rank structure is the primary driver of the observed disappearance of the effect. revision: yes
Referee: [Abstract] Abstract: the inverted U-shaped relationship with LoRA rank is presented as supporting evidence for the artifact hypothesis, yet the manuscript provides no details on how transmission is quantified, what statistical tests establish the shape, or whether rank sweeps were performed with fixed total compute or matched effective learning rates.

Authors: Transmission is quantified via the rate at which the target behavior appears in generations on a fixed evaluation prompt; the rank sweep used a fixed step count, batch size, and base learning rate. No formal statistical test for the U-shape was applied. We will add these methodological details to the abstract and main text, and note that total compute was not explicitly matched across ranks. revision: yes
Referee: [Abstract] Abstract: the localization claim (subliminal behavior confined to tokens seen during both finetuning and evaluation) is load-bearing for the context-dependence argument, but the description does not specify the token-level measurement procedure, control conditions, or how "computation at tokens" was isolated from other factors.

Authors: The localization uses token-masking ablations during evaluation, measuring the drop in target behavior when shared tokens are removed versus when non-shared tokens are removed. We will include a concise description of this procedure in the revised abstract and ensure the controls are stated explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparisons are self-contained

full rationale

The paper's central claims rest on direct experimental observations: an inverted-U relationship between transmission and LoRA rank, plus disappearance of the effect under full finetuning. These are measured outcomes from controlled runs, not quantities obtained by fitting a parameter to a subset of the same data and relabeling it a prediction, nor any self-definitional loop, self-citation chain, or ansatz smuggled via prior work. The citation to Cloud et al. (2025) merely introduces the original phenomenon; the new results about LoRA dependence stand on the reported ablations and context manipulations without reducing to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the domain assumption that numerical sequences carry no explicit trait information and that LoRA versus full finetuning differs only in parameter scope; no new entities are introduced and no parameters are fitted to produce the central result.

axioms (2)

domain assumption Numerical sequences generated by the teacher contain no explicit behavioral information.
This premise, inherited from the cited subliminal learning setup, is required for the transmission to be considered subliminal rather than explicit.
domain assumption The only material difference between LoRA and full finetuning in these experiments is the fraction of parameters updated.
Invoked when the paper attributes the disappearance of the effect to the removal of LoRA rather than other optimization differences.

pith-pipeline@v0.9.1-grok · 5769 in / 1427 out tokens · 32235 ms · 2026-06-28T18:26:38.382799+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Subliminal Learning Is Steering Vector Distillation
cs.AI 2026-05 unverdicted novelty 7.0

Subliminal learning is steering vector distillation: a student fine-tuned on a steered teacher's outputs learns to imitate the steering vector.

Reference graph

Works this paper leans on

48 extracted references · 12 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

arXiv preprint arXiv:2507.14805 , year=

Subliminal learning: Language models transmit behavioral traits via hidden signals in data , author=. arXiv preprint arXiv:2507.14805 , year=

work page arXiv
[2]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

Token Entanglement in Subliminal Learning , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

2025
[3]

arXiv preprint arXiv:2502.17424 , year=

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs , author=. arXiv preprint arXiv:2502.17424 , year=

work page arXiv
[4]

The Linear Representation Hypothesis and the Geometry of Large Language Models

The linear representation hypothesis and the geometry of large language models , author=. arXiv preprint arXiv:2311.03658 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2310.01693 , year=

Closing the curious case of neural text degeneration , author=. arXiv preprint arXiv:2310.01693 , year=

work page arXiv
[6]

arXiv preprint arXiv:2509.23886 , year=

Towards understanding subliminal learning: When and how hidden biases transfer , author=. arXiv preprint arXiv:2509.23886 , year=

work page arXiv
[7]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=
[8]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[9]

Editing Models with Task Arithmetic

Editing models with task arithmetic , author=. arXiv preprint arXiv:2212.04089 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Advances in Neural Information Processing Systems , volume=

Ties-merging: Resolving interference when merging models , author=. Advances in Neural Information Processing Systems , volume=
[11]

International Conference on Machine Learning , pages=

Task-specific skill localization in fine-tuned language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[12]

Knowledge is a Region in Weight Space for Fine-tuned Language Models

Gueta, Almog and Venezian, Elad and Raffel, Colin and Slonim, Noam and Katz, Yoav and Choshen, Leshem. Knowledge is a Region in Weight Space for Fine-tuned Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.95

work page doi:10.18653/v1/2023.findings-emnlp.95 2023
[13]

2022 , url =

Beren Millidge and Sid Black , title =. 2022 , url =

2022
[14]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025
[15]

Weird Generalization and Inductive Backdoors: New Ways to Corrupt

Betley, Jan and Cocola, Jorio and Feng, Dylan and Chua, James and Arditi, Andy and Sztyber-Betley, Anna and Evans, Owain , journal=. Weird Generalization and Inductive Backdoors: New Ways to Corrupt
[16]

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M. and Maxwell, Tim and Cheng, Newton and Jermyn, Adam and Askell, Amanda and Radhakrishnan, Ansh and Anil, Cem and Duvenaud, David and Ganguli, Deep and Barez, Fazl and Clark, Jack and Ndousse, Kamal and Sachan, Ks...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Shuttleworth, Reece and Andreas, Jacob and Torralba, Antonio and Sharma, Pratyusha , journal=
[18]

Thinking Machines Lab: Connectionism , year =

John Schulman and others , title =. Thinking Machines Lab: Connectionism , year =
[19]

Toy Models of Superposition

Toy Models of Superposition , author=. arXiv preprint arXiv:2209.10652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

2021
[21]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

2022
[22]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Dissecting Recall of Factual Associations in Auto-Regressive Language Models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

2023
[23]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in. 2022 , url=

2022
[24]

The Fourteenth International Conference on Learning Representations , year=

Dynamic Weight Grafting: Localizing Finetuned Factual Knowledge in Transformers , author=. The Fourteenth International Conference on Learning Representations , year=
[25]

arXiv preprint arXiv:2302.00456 , year=

Analyzing feed-forward blocks in transformers through the lens of attention maps , author=. arXiv preprint arXiv:2302.00456 , year=

work page arXiv
[26]

Information Flow Routes: Automatically Interpreting Language Models at Scale , url =

Information flow routes: Automatically interpreting language models at scale , author=. arXiv preprint arXiv:2403.00824 , year=

work page arXiv
[27]

Atp*: An eﬀicient and scalable method for localizing llm behaviour to components

Atp*: An efficient and scalable method for localizing llm behaviour to components , author=. arXiv preprint arXiv:2403.00745 , year=

work page arXiv
[28]

2025 , eprint=

Model Organisms for Emergent Misalignment , author=. 2025 , eprint=

2025
[29]

Subliminal Poisoning is the LLM version of a Buffer Overflow , year =
[30]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[31]

2023 , eprint=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=

2023
[32]

Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations , editor =

Distributed Representations , author =. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations , editor =. 1986 , publisher =

1986
[33]

2018 , eprint=

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , author=. 2018 , eprint=

2018
[34]

2025 , eprint=

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition , author=. 2025 , eprint=

2025
[35]

Behavioral and Brain Sciences , volume =

On the Proper Treatment of Connectionism , author =. Behavioral and Brain Sciences , volume =. 1988 , doi =

1988
[36]

2026 , eprint=

Subliminal Steering: Stronger Encoding of Hidden Signals , author=. 2026 , eprint=

2026
[37]

How to use and interpret activation patching

Heimersheim, Stefan and Nanda, Neel. How to use and interpret activation patching. arXiv [cs.LG]
[38]

2023 , eprint=

Localizing Model Behavior with Path Patching , author=. 2023 , eprint=

2023
[39]

2024 , eprint=

Poisoning Web-Scale Training Datasets is Practical , author=. 2024 , eprint=

2024
[40]

2025 , eprint=

Superposition Yields Robust Neural Scaling , author=. 2025 , eprint=

2025
[41]

2019 , eprint=

Adversarial Examples Are Not Bugs, They Are Features , author=. 2019 , eprint=

2019
[42]

2014 , eprint=

Intriguing properties of neural networks , author=. 2014 , eprint=

2014
[43]

2025 , eprint=

Adversarial Examples Are Not Bugs, They Are Superposition , author=. 2025 , eprint=

2025
[44]

2015 , eprint=

Explaining and Harnessing Adversarial Examples , author=. 2015 , eprint=

2015
[45]

2025 , eprint=

Adversarial Attacks Leverage Interference Between Features in Superposition , author=. 2025 , eprint=

2025
[46]

Distill , year =

Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =
[47]

2014 , eprint=

Representation Learning: A Review and New Perspectives , author=. 2014 , eprint=

2014
[48]

2019 , eprint=

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , author=. 2019 , eprint=

2019

[1] [1]

arXiv preprint arXiv:2507.14805 , year=

Subliminal learning: Language models transmit behavioral traits via hidden signals in data , author=. arXiv preprint arXiv:2507.14805 , year=

work page arXiv

[2] [2]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

Token Entanglement in Subliminal Learning , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

2025

[3] [3]

arXiv preprint arXiv:2502.17424 , year=

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs , author=. arXiv preprint arXiv:2502.17424 , year=

work page arXiv

[4] [4]

The Linear Representation Hypothesis and the Geometry of Large Language Models

The linear representation hypothesis and the geometry of large language models , author=. arXiv preprint arXiv:2311.03658 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2310.01693 , year=

Closing the curious case of neural text degeneration , author=. arXiv preprint arXiv:2310.01693 , year=

work page arXiv

[6] [6]

arXiv preprint arXiv:2509.23886 , year=

Towards understanding subliminal learning: When and how hidden biases transfer , author=. arXiv preprint arXiv:2509.23886 , year=

work page arXiv

[7] [7]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

[8] [8]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[9] [9]

Editing Models with Task Arithmetic

Editing models with task arithmetic , author=. arXiv preprint arXiv:2212.04089 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Advances in Neural Information Processing Systems , volume=

Ties-merging: Resolving interference when merging models , author=. Advances in Neural Information Processing Systems , volume=

[11] [11]

International Conference on Machine Learning , pages=

Task-specific skill localization in fine-tuned language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[12] [12]

Knowledge is a Region in Weight Space for Fine-tuned Language Models

Gueta, Almog and Venezian, Elad and Raffel, Colin and Slonim, Noam and Katz, Yoav and Choshen, Leshem. Knowledge is a Region in Weight Space for Fine-tuned Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.95

work page doi:10.18653/v1/2023.findings-emnlp.95 2023

[13] [13]

2022 , url =

Beren Millidge and Sid Black , title =. 2022 , url =

2022

[14] [14]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025

[15] [15]

Weird Generalization and Inductive Backdoors: New Ways to Corrupt

Betley, Jan and Cocola, Jorio and Feng, Dylan and Chua, James and Arditi, Andy and Sztyber-Betley, Anna and Evans, Owain , journal=. Weird Generalization and Inductive Backdoors: New Ways to Corrupt

[16] [16]

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M. and Maxwell, Tim and Cheng, Newton and Jermyn, Adam and Askell, Amanda and Radhakrishnan, Ansh and Anil, Cem and Duvenaud, David and Ganguli, Deep and Barez, Fazl and Clark, Jack and Ndousse, Kamal and Sachan, Ks...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Shuttleworth, Reece and Andreas, Jacob and Torralba, Antonio and Sharma, Pratyusha , journal=

[18] [18]

Thinking Machines Lab: Connectionism , year =

John Schulman and others , title =. Thinking Machines Lab: Connectionism , year =

[19] [19]

Toy Models of Superposition

Toy Models of Superposition , author=. arXiv preprint arXiv:2209.10652 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

2021

[21] [21]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

2022

[22] [22]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Dissecting Recall of Factual Associations in Auto-Regressive Language Models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

2023

[23] [23]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in. 2022 , url=

2022

[24] [24]

The Fourteenth International Conference on Learning Representations , year=

Dynamic Weight Grafting: Localizing Finetuned Factual Knowledge in Transformers , author=. The Fourteenth International Conference on Learning Representations , year=

[25] [25]

arXiv preprint arXiv:2302.00456 , year=

Analyzing feed-forward blocks in transformers through the lens of attention maps , author=. arXiv preprint arXiv:2302.00456 , year=

work page arXiv

[26] [26]

Information Flow Routes: Automatically Interpreting Language Models at Scale , url =

Information flow routes: Automatically interpreting language models at scale , author=. arXiv preprint arXiv:2403.00824 , year=

work page arXiv

[27] [27]

Atp*: An eﬀicient and scalable method for localizing llm behaviour to components

Atp*: An efficient and scalable method for localizing llm behaviour to components , author=. arXiv preprint arXiv:2403.00745 , year=

work page arXiv

[28] [28]

2025 , eprint=

Model Organisms for Emergent Misalignment , author=. 2025 , eprint=

2025

[29] [29]

Subliminal Poisoning is the LLM version of a Buffer Overflow , year =

[30] [30]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[31] [31]

2023 , eprint=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=

2023

[32] [32]

Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations , editor =

Distributed Representations , author =. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations , editor =. 1986 , publisher =

1986

[33] [33]

2018 , eprint=

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , author=. 2018 , eprint=

2018

[34] [34]

2025 , eprint=

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition , author=. 2025 , eprint=

2025

[35] [35]

Behavioral and Brain Sciences , volume =

On the Proper Treatment of Connectionism , author =. Behavioral and Brain Sciences , volume =. 1988 , doi =

1988

[36] [36]

2026 , eprint=

Subliminal Steering: Stronger Encoding of Hidden Signals , author=. 2026 , eprint=

2026

[37] [37]

How to use and interpret activation patching

Heimersheim, Stefan and Nanda, Neel. How to use and interpret activation patching. arXiv [cs.LG]

[38] [38]

2023 , eprint=

Localizing Model Behavior with Path Patching , author=. 2023 , eprint=

2023

[39] [39]

2024 , eprint=

Poisoning Web-Scale Training Datasets is Practical , author=. 2024 , eprint=

2024

[40] [40]

2025 , eprint=

Superposition Yields Robust Neural Scaling , author=. 2025 , eprint=

2025

[41] [41]

2019 , eprint=

Adversarial Examples Are Not Bugs, They Are Features , author=. 2019 , eprint=

2019

[42] [42]

2014 , eprint=

Intriguing properties of neural networks , author=. 2014 , eprint=

2014

[43] [43]

2025 , eprint=

Adversarial Examples Are Not Bugs, They Are Superposition , author=. 2025 , eprint=

2025

[44] [44]

2015 , eprint=

Explaining and Harnessing Adversarial Examples , author=. 2015 , eprint=

2015

[45] [45]

2025 , eprint=

Adversarial Attacks Leverage Interference Between Features in Superposition , author=. 2025 , eprint=

2025

[46] [46]

Distill , year =

Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =

[47] [47]

2014 , eprint=

Representation Learning: A Review and New Perspectives , author=. 2014 , eprint=

2014

[48] [48]

2019 , eprint=

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , author=. 2019 , eprint=

2019