arxiv: 2604.17465 · v2 · submitted 2026-04-19 · 💻 cs.AI

Recognition: unknown

Language models recognize dropout and Gaussian noise applied to their activations

Damiano Fornasiere , Mirko Bronzi , Spencer Kitts , Alessandro Palmas , Yoshua Bengio , Oliver Richardson

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:34 UTC · model grok-4.3

classification 💻 cs.AI

keywords language modelsactivation perturbationsdropoutGaussian noiseinternal state detectionperturbation localizationAI safety

0 comments

The pith

Language models can detect, localize, and distinguish dropout masking from Gaussian noise applied to their own activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models notice when their internal activations have been changed by either masking values in a way that simulates dropout or by adding random Gaussian noise. The authors apply one of these changes at the activations tied to a specific sentence in a prompt, then pose multiple-choice questions asking which sentence was affected or which kind of change was used. The evaluated models succeed at detection and localization with high accuracy in many cases and can be taught via in-context examples to tell the two perturbations apart. This matters to a sympathetic reader because it hints that models may carry some detectable trace of their own internal processing history, which could bear on questions of model awareness and control.

Core claim

We provide evidence that language models can detect, localize and, to a certain degree, verbalize the difference between perturbations applied to their activations. We either mask activations, simulating dropout, or add Gaussian noise to them at a target sentence. We then ask a multiple-choice question such as which of the previous sentences was perturbed or which of the two perturbations was applied. The tested models can easily detect and localize the perturbations, often with perfect accuracy. These models can also learn, when taught in context, to distinguish between dropout and Gaussian noise. Accuracy in one case improves with perturbation strength and drops when in-context labels are翻

What carries the argument

Multiple-choice questions that ask the model to identify which sentence received an activation perturbation or which type of perturbation (dropout-style masking versus added Gaussian noise) occurred, serving as a probe for whether the model registers changes to its internal states.

If this is right

Models can identify which specific sentence in a sequence had its activations altered.
Models can acquire the ability to label a perturbation as dropout or as Gaussian noise after seeing a few in-context examples.
Accuracy on identifying the perturbation type rises with the magnitude of the applied noise or masking for at least some models.
Flipping the correct labels in the in-context examples reduces accuracy, consistent with the model holding a prior favoring the true distinction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The detection mechanism might be usable to let models flag other unexpected modifications to their internal computations.
If the signal is robust, it could serve as one component in protocols that check whether a deployed model is operating under the conditions it was trained for.
The same approach could be extended to other internal operations such as attention masking or activation scaling to map what else models can report about their own processing.

Load-bearing premise

The models' high accuracy on these tasks reflects genuine detection of the activation changes rather than exploitation of surface statistical cues created by how the perturbations were introduced or how the questions were worded.

What would settle it

If accuracy falls to chance level when the same perturbations are applied but the questions are rephrased to remove any direct reference to perturbation type or location, or when the perturbations are chosen so they leave the model's token outputs unchanged, that would indicate the original performance did not rely on internal detection.

Figures

Figures reproduced from arXiv: 2604.17465 by Alessandro Palmas, Damiano Fornasiere, Mirko Bronzi, Oliver Richardson, Spencer Kitts, Yoshua Bengio.

**Figure 1.** Figure 1: We perturb the activations of a target sentence by either masking activations or adding Gaussian noise. In the same prompt we then ask the model to identify which perturbation was applied. Success is measured as accuracy of the most-likely next-token. (i) All the models we tested can easily detect and localize the injected perturbation with at least an 80% accuracy, with the Qwen models reaching a perfect … view at source ↗

**Figure 2.** Figure 2: We present the models with two sentences, perturb the activations of only one, and ask the model which sentence is perturbed. Whether the perturbation is dropout (left) or Gaussian noise (right), all models can easily detect and locate it (in contrast with the chance-level accuracy when neither is applied). Standard errors are below 1.58% and hence difficult to visualize. 5For some models, we re-ran this e… view at source ↗

**Figure 3.** Figure 3: Models are presented with two sentences (e.g., one about dogs and one about Montreal), and the activations ´ of only one of the two are perturbed—either with dropout (left) or Gaussian noise (right). We ask the model an easy comprehension question (e.g., which is about dogs). All models display perfect accuracy until the perturbations surpass a threshold. Standard errors are below 0.71% and hence not visib… view at source ↗

**Figure 4.** Figure 4: We apply dropout or add Gaussian noise to the activations of a target sentence in a prompt. We ask “Did we apply dropout or noise?” (dark red curves) and report accuracy. In controls, we alias the the concepts asking e.g., “Did we apply rotation or permutation?” while still applying either dropout or noise (blue curves). In the controls, we report the frequency of times the models answers, e.g., ‘rotation”… view at source ↗

**Figure 5.** Figure 5: Results of in-context learning to distinguish dropout from Gaussian noise. Left: the accuracy of Qwen3-32B with a single pair of in-context examples, as a function of dropout rate and noise SD (standard errors are not depicted but below 1.58%). Right: the average accuracy of every model as a function of number of in-context examples (standard errors are below 0.14% and hence not visible). The overall effec… view at source ↗

**Figure 6.** Figure 6: The difference between (i) the classification accuracy of Qwen3-32B when given a single labeled example in-context, and (ii) its accuracy when given an example with a swapped label, as a function of the the strengths of the two perturbations. Positive numbers (red) indicate better performance with the correct labels. In the experimental arms (top), correct labels make the task easier. Bottom left: learning… view at source ↗

**Figure 7.** Figure 7: Prompt example: localization (§4). 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt example: localization, control (§ [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt example: zero-shot classification (§ [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt example: few-shot classification (§ [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Localization: Llama3.1-8B. 3tok 7tok 11tok 15tok 19tok 23tok 0 0.2 0.4 0.6 0.8 0 20 40 60 80 100 Dropout rate p Accuracy (%) 0 0.1 0.2 0.3 0.4 0.5 Noise SD σ [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Localization: Qwen3-14B. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Localization: Qwen3-32B. 3tok 7tok 11tok 15tok 19tok 23tok 0 0.2 0.4 0.6 0.8 0 20 40 60 80 100 Dropout rate p Accuracy (%) 0 0.1 0.2 0.3 0.4 0.5 Noise SD σ [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Localization: Olmo3.1-32B. A.5 Localization: Olmo3.1-32B accuracy between labels In the localization experiments, the accuracy Olmo3.1-32B starts below chance for small values of dropout rate and noise SD. Interestingly, this is because, at the question “Which sentence had a perturbation applied?” the model answers neither despite the prefill The answer is: . Therefore, the plots below show the model’s ac… view at source ↗

**Figure 15.** Figure 15: Localization: accuracy of Olmo3.1-32B when comparing the tokens ‘ A’ and ‘ B’. [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Zero-shot classification accuracy of Qwen3-14B (left pair) and Llama3.1-8B (right pair). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: Zero-shot: entropy of the token distribution for Qwen3-32B (left pair) and Olmo3.1-32B (right pair). Dropout/Noise Masking/Jitter Rotation/Permutation X/Y Qwen3-14B Llama3.1-8B 0 100 0 0.2 0.4 0.6 0.8 1 Dropout p (%) Entropy (nats) 0 Noise 100 σ (%) 0 100 0 0.2 0.4 0.6 0.8 1 Dropout p (%) 0 Noise 100 σ (%) [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗

**Figure 18.** Figure 18: Zero-shot: entropy of the token distribution for Qwen3-14B (left pair) and Llama3.1-8B (right pair). A.10 Few-shot classification: heatmaps of all models We report the heatmaps displaying accuracy at the few-shot classification task, for the Llama3.1-8B, Qwen3- 14B and Olmo3.1-32B models. In particular, we show the accuracy of each model as a function of dropout rate, noise SD, when the model sees 1 or 9 … view at source ↗

**Figure 19.** Figure 19: Accuracy of every model at distinguishing dropout from Gaussian noise when taught with 1 in-context example. Standard errors are below 1.58% everywhere. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

**Figure 20.** Figure 20: Accuracy of every model at distinguishing dropout from Gaussian noise when taught with 9 in-context example. Standard errors are below 1.58% everywhere. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

**Figure 21.** Figure 21: Left: accuracy of Qwen3-32B when taught with one in-context example of flipped labels. Right: average accuracy of Qwen3-32B as a function of number of example pairs, with correct and flipped labels. A.12 Few-shot classification: delta heatmaps We report the heatmaps with the delta accuracy between the correct labels minus the swapped labels for the dropout / noise experiment for Llama3.1-8B, Qwen3-14B and… view at source ↗

**Figure 22.** Figure 22: Difference in accuracy with one in-context example and correct labels vs with one in-context example and flipped labels. Standard errors are below 1.58% everywhere. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗

read the original abstract

We provide evidence that language models can detect, localize and, to a certain degree, verbalize the difference between perturbations applied to their activations. More precisely, we either (a) mask activations, simulating dropout, or (b) add Gaussian noise to them, at a target sentence. We then ask a multiple-choice question such as "Which of the previous sentences was perturbed?" or "Which of the two perturbations was applied?". We test models from the Llama, Olmo, and Qwen families, with sizes between 8B and 32B, all of which can easily detect and localize the perturbations, often with perfect accuracy. These models can also learn, when taught in context, to distinguish between dropout and Gaussian noise. Notably, Qwen3-32B's zero-shot accuracy in identifying which perturbation was applied improves as a function of the perturbation strength and, moreover, decreases if the in-context labels are flipped, suggesting a prior for the correct ones -- even modulo controls. Because dropout has been used as a training-regularization technique, while Gaussian noise is sometimes added during inference, we discuss the possibility of a data-agnostic "training awareness" signal and the implications for AI safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Models can identify which sentence got perturbed and tell dropout from Gaussian noise via prompting, but the result may rest on output distribution shifts rather than any internal awareness.

read the letter

Models can identify which sentence got perturbed and tell dropout from Gaussian noise via prompting, but the result may rest on output distribution shifts rather than any internal awareness. The paper reports that Llama, Olmo, and Qwen models, from 8B to 32B, often reach perfect or near-perfect accuracy on localization and type-discrimination questions after a single forward pass with either dropout masking or added noise at one sentence position. Accuracy rises with perturbation strength and drops when in-context labels are flipped, which is a useful check against pure format following. That pattern across model families is the concrete new observation here. The work is straightforward to describe and the label-flip control gives it some credibility over a pure guessing baseline. The main gap is the missing check on whether the perturbation itself changes the model's next-token distribution or hidden-state statistics enough for the later question to be answered from surface cues alone. Without something like a KL comparison between clean and perturbed outputs before the query, or a frozen copy that receives the same activations but answers through a separate head, it is hard to know if the model is representing the intervention or just noticing its downstream effects. The training-awareness interpretation therefore stays preliminary. This is the sort of result that interpretability and safety researchers would want to see replicated with tighter controls. A reader who follows model introspection work would get value from the scaling trend and the cross-family consistency. It is worth sending to peer review so the authors can supply the output-distribution controls and full methods details.

Referee Report

2 major / 3 minor

Summary. The manuscript claims that language models (Llama, Olmo, Qwen families, 8B–32B) can detect, localize, and to some degree verbalize activation-level perturbations consisting of either dropout masking or additive Gaussian noise applied at a target sentence during the forward pass. This is tested via multiple-choice queries such as 'Which of the previous sentences was perturbed?' or 'Which of the two perturbations was applied?', with reported near-perfect accuracies, a scaling trend with perturbation strength for Qwen3-32B, and reduced performance under label flips that is taken to indicate a prior for the correct labels.

Significance. If the central empirical findings are robust, the work supplies evidence that current-scale language models maintain some form of meta-representation of their own internal activation statistics, with possible relevance to training-versus-inference distinctions and AI safety. The cross-model replication and the in-context learning result for perturbation-type discrimination are concrete strengths; the absence of machine-checked proofs or parameter-free derivations is expected for this empirical style of paper.

major comments (2)

[Methods / Experimental Setup] The experimental design does not include controls that isolate internal detection from exploitation of downstream statistical signatures. No comparison is reported to a frozen copy of the model receiving the same perturbed activations but answering via an independent head, nor is KL divergence or logit-shift magnitude quantified between perturbed and clean forward passes before the query is posed.
[Results] Results for Qwen3-32B report scaling of zero-shot accuracy with perturbation strength and sensitivity to label flips, yet the manuscript provides no statistical significance tests, trial counts, or exclusion criteria. This leaves the scaling claim and the 'prior for correct labels' interpretation difficult to evaluate quantitatively.

minor comments (3)

[Abstract] The abstract states that results hold 'modulo controls' without enumerating those controls; this phrasing should be replaced by an explicit list or reference to the relevant subsection.
[Methods] Notation for the perturbation parameters (dropout probability, Gaussian variance) is introduced without a consolidated table; adding one would improve reproducibility.
[Discussion] The discussion of 'data-agnostic training awareness' would benefit from a short paragraph contrasting the training-time use of dropout with inference-time Gaussian noise, including any cited references.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate planned revisions to improve the manuscript.

read point-by-point responses

Referee: [Methods / Experimental Setup] The experimental design does not include controls that isolate internal detection from exploitation of downstream statistical signatures. No comparison is reported to a frozen copy of the model receiving the same perturbed activations but answering via an independent head, nor is KL divergence or logit-shift magnitude quantified between perturbed and clean forward passes before the query is posed.

Authors: We acknowledge that additional controls could further isolate whether detection relies on internal activation statistics versus downstream output signatures. Our core setup applies perturbations directly to activations during the forward pass at a target sentence, after which the model processes the query using those modified states; this requires the model to draw on its updated internal representations to localize or classify the perturbation. A frozen-copy comparison with an independent head was not included, as it would necessitate separate infrastructure outside the scope of testing integrated introspection in a single forward pass. However, we agree that quantifying the perturbations' effects is valuable and will add KL divergence and logit-shift magnitude measurements between perturbed and clean passes (prior to the query) in the revised manuscript. revision: partial
Referee: [Results] Results for Qwen3-32B report scaling of zero-shot accuracy with perturbation strength and sensitivity to label flips, yet the manuscript provides no statistical significance tests, trial counts, or exclusion criteria. This leaves the scaling claim and the 'prior for correct labels' interpretation difficult to evaluate quantitatively.

Authors: We agree that these details are needed for quantitative evaluation. The Qwen3-32B experiments were run over a fixed number of trials per condition (to be specified explicitly, e.g., 50–100 trials), with exclusion limited to clearly invalid or non-responsive outputs. We will add the exact trial counts, exclusion criteria, and statistical tests (binomial tests for accuracy against chance, and appropriate trend tests for scaling with perturbation strength and label-flip sensitivity) to the revised manuscript. This will allow readers to assess the robustness of the scaling trend and the prior-for-correct-labels interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical intervention-and-query design

full rationale

The paper reports direct experiments in which dropout masks or Gaussian noise are injected into activations at chosen positions, followed by multiple-choice queries to the same model about the intervention. Reported accuracies are measured outcomes of these interventions and queries, not quantities derived from equations or parameters fitted to the target data. The label-flip control is an independent manipulation that tests sensitivity to surface cues without reducing the main result to a fit. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the central claims; the work is self-contained against external benchmarks of model behavior under perturbation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is purely empirical and rests on standard assumptions about transformer forward passes and the ability to intervene on hidden states; no free parameters, new entities, or non-standard axioms are introduced in the abstract.

axioms (1)

domain assumption Intervening on activations at a target sentence produces detectable changes in the model's subsequent behavior.
This is presupposed by the experimental design of applying dropout or noise and then querying for detection.

pith-pipeline@v0.9.0 · 5525 in / 1247 out tokens · 28496 ms · 2026-05-10T05:34:24.782367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 10 canonical work pages · 3 internal anchors

[1]

The Falcon Series of Open Language Models , journal =

URLhttps://arxiv.org/abs/2311.16867. Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, 10 Rishi Bommasani, Stephen Casper, Tom Davidson, Raymond Douglas, et al. International ai safety report 2026.arXiv preprint arXiv:2602.21012,

work page arXiv 2026
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

1901
[3]

Comsa and Murray Shanahan

URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/c16a5320fa475530d9583c34fd356ef5-Paper.pdf. Iulia M Comsa and Murray Shanahan. Does it make sense to speak of introspection in large language models?arXiv preprint arXiv:2506.05068,

work page arXiv 2020
[4]

arXiv preprint arXiv:1909.11556 , year=

URLhttps://arxiv.org/abs/1909.11556. Ely Hahami, Lavik Jain, and Ishaan Sinha. Feeling the strength but not the source: Partial introspection in llms.arXiv preprint arXiv:2512.12411,

work page arXiv 1909
[6]

Improving neural networks by preventing co-adaptation of feature detectors

URL http://arxiv.org/abs/1207.0580. Harvey Lederman and Kyle Mahowald. Dissociating direct access from inference in ai introspection.arXiv preprint arXiv:2603.05414,

work page Pith review arXiv
[7]

Emergent introspective awareness in large language models

Jack Lindsey. Emergent introspective awareness in large language models. https:// transformer-circuits.pub/2025/introspection/index.html,

2025
[8]

Houjun Liu, John Bauer, and Christopher D

Anthropic, Transformer Circuits thread, accessed 2026-03-27. Houjun Liu, John Bauer, and Christopher D. Manning. Drop dropout on single-epoch language model pretraining, 2025a. URLhttps://arxiv.org/abs/2505.24788. Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yubing Jian, Yao Qin, and Roland Memisevic. Enhancing hallucination detection ...

work page arXiv 2026
[9]

Steering Llama 2 via Contrastive Activation Addition

URLhttps://arxiv.org/abs/2312.06681. Theia Pearson-Vogel, Martin Vanek, Raymond Douglas, and Jan Kulveit. Latent introspection: Models can detect prior concept injections.arXiv preprint arXiv:2602.20031, 2026a. Theia Pearson-Vogel, Martin Vanek, Raymond Douglas, and Jan Kulveit. Latent introspection: Models can detect prior concept injections, 2026b. URLh...

work page internal anchor Pith review arXiv
[10]

URL https://arxiv.org/abs/2412. 01784. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review arXiv
[11]

Qwen3 Technical Report

URL https://proceedings.neurips.cc/paper_files/ paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, and Ming Zhou

URL https://arxiv.org/ abs/1907.11065. Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, and Ming Zhou. Scheduled drophead: A regularization method for transformer models,

work page arXiv 1907
[13]

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al

URLhttps://arxiv.org/abs/2004.13342. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405,

work page arXiv 2004
[14]

the answer is: A

A Appendix A.1 Models The exact model nouns we experimented with are:Llama-3.1-8B-Instruct, Olmo-3.1-32B-Instruct, Qwen3-14B, orQwen3-32B. We also experimented withGemma-3-1b-it,Olmo-3-7B-Instruct, Qwen3-4B-Instruct-2507, Qwen3-8B, and Qwen3-30B-A3B-Instruct-2507, but did not find enough variance in the results to justify allocating compute resources for ...

2017