The Attentional White Bear Effect in Transformer Language Models

Brian Scassellati; Rebecca Ramnauth

arxiv: 2605.28639 · v1 · pith:7LUALLIUnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

The Attentional White Bear Effect in Transformer Language Models

Rebecca Ramnauth , Brian Scassellati This is my paper

Pith reviewed 2026-06-29 12:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsinstruction suppressionrepresentational probingattention analysissemantic leakageprohibited contentmodel alignment

0 comments

The pith

Prohibited concepts remain internally active in language models even when their output is suppressed by instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether telling language models to avoid prohibited content actually removes those ideas from their thinking or merely stops them from saying the words. Researchers used tests to read out hidden information from the models, look at how attention is directed, and see if forbidden ideas leak into other parts of the responses. They discovered that the concepts stay easy to detect inside the model, keep affecting where the model pays attention, and still influence what the model generates even when direct mentions are avoided. This holds true no matter how the information is pooled or which models are tested. The result points to a mismatch between what the model does on the surface and what it represents underneath.

Core claim

Prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families, exposing a fundamental gap between behavioral and representational alignment.

What carries the argument

The attentional white bear effect, in which instruction-based suppression of a concept keeps it active in hidden states and attention patterns rather than removing it.

If this is right

Prohibited concepts can be recovered from model internals despite output suppression.
Attention mechanisms continue to route based on suppressed concepts.
Downstream text generation is influenced by these concepts in non-lexical ways.
The effects are consistent across different analysis methods and model types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety techniques relying only on output instructions may leave internal concept processing intact.
Direct interventions on hidden representations could be needed to achieve fuller suppression.
The same internal persistence might occur in non-text modalities or other model types.

Load-bearing premise

The methods for probing representations, analyzing attention, and checking semantic leakage isolate the effects of suppression without introducing their own artifacts or biases.

What would settle it

An experiment in which prohibited concepts become unrecoverable from hidden representations under suppression, show no measurable influence on attention routing, and produce no semantic leakage in generations while still achieving lexical avoidance.

Figures

Figures reproduced from arXiv: 2605.28639 by Brian Scassellati, Rebecca Ramnauth.

**Figure 1.** Figure 1: Layerwise recoverability. Last-token pooling (left) reveals strongly transient early-layer suppression dynamics, whereas mean non-padding pooling (right) demonstrates that suppression-related recoverability remains broadly distributed across much of the transformer stack. Shaded regions indicate 95% CIs. The strongest effect emerged at layer 4 (∆ = 0.810, 95% CI [0.758, 0.861], d = 2.66, p < .001). 4.3 Ind… view at source ↗

**Figure 2.** Figure 2: Attention allocation. Aggregate layerwise attention differences were modest overall (top), but head specialization emerged in middle layers (bottom). Bar heights indicate the largest positive head-level suppression effect within each layer. Numbers above bars denote the number of heads with significantly positive suppression effects (95% CIs entirely above zero). that global attention allocation differen… view at source ↗

**Figure 4.** Figure 4: Cross-model replication. Elevated recoverability under indirect suppression generalizes across architectures, suggesting that suppression-related semantic persistence is not model-specific. by direct lexical repetition within the prompt itself. Semantic drift also varied across concepts. Concepts like “knife,” “password,” and “self-harm” exhibited stronger drift under suppression than other concepts (A… view at source ↗

**Figure 6.** Figure 6: Layer-region suppression salience across model families. All models exhibited significantly lower suppression salience in middle layers than in early or late layers, revealing a non-monotonic suppression profile across transformer architectures [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Condition ordering across model families. Mean probe scores under concept-absent, indirect suppression, and direct mention conditions across models. All pairwise contrasts were significant at p < .001. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims suppression instructions leave prohibited concepts intact in LLM internals and still shape attention and output, but the abstract alone gives no numbers or method details to judge if that's supported.

read the letter

The main thing here is that the authors test whether telling a model not to generate something actually removes the idea from its hidden states or just blocks the words. They report that the concepts stay recoverable via probes, keep affecting attention patterns, and still influence generations even when the model avoids the forbidden terms.

What is new is the specific package: they combine representational probing, attention analysis, and semantic leakage checks on instruction-based suppression, and they run it across multiple model families while varying pooling and using indirect semantic controls. That setup tries to show the effect is not just lexical avoidance.

The work does a reasonable job of attempting controls and testing robustness across models, which is a fair way to look for a consistent gap between behavior and representation.

The soft spots are clear from the abstract. There are no quantitative results, effect sizes, error bars, or exclusion criteria reported, so it is impossible to tell how strong or reliable the effects are. The stress-test point about possible artifacts from the probing classifiers, head selection, or control choices lands because the abstract gives no information on how the probes were trained or whether the indirect controls matched all relevant dimensions. If those choices introduce their own signals, the claimed white-bear effect could be overstated.

This is for researchers working on AI safety and alignment evaluations who want to know whether behavioral suppression actually changes internal states. A reader focused on representation-versus-behavior gaps would get some value from the experimental framing if the full methods and numbers hold up.

I would send it to peer review. The question is relevant and the approach is straightforward enough to be worth referee time, even though the current write-up is too thin on data to assess the claims.

Referee Report

3 major / 2 minor

Summary. The paper investigates whether instruction-based suppression of prohibited concepts in transformer language models eliminates their internal representations or merely blocks lexical output. Through representational probing, attention pattern analysis, and semantic leakage experiments on downstream generations, the authors claim that prohibited concepts remain highly recoverable from hidden states, continue to shape attention routing, and measurably influence generated text despite successful lexical avoidance. These effects are reported to persist across pooling strategies, indirect semantic controls, and multiple model families, indicating a gap between behavioral suppression and representational alignment.

Significance. If the central empirical claims hold after methodological clarification, the work would be significant for AI alignment research by demonstrating that common suppression techniques are superficial at the representational level. The multi-model evaluation and use of indirect controls represent strengths that could help distinguish suppression-specific effects from general model knowledge. However, the absence of quantitative effect sizes, error bars, or explicit validation metrics in the provided abstract limits immediate assessment of practical impact.

major comments (3)

[§3] §3 (Representational Probing): The probing classifiers appear to be trained on data containing the target concept outside the suppression regime. This setup risks attributing recoverability to general pretraining knowledge rather than persistence under suppression instructions; a within-regime training control or explicit comparison to non-suppressed baselines is needed to support the claim that suppression fails to reduce internal representations.
[§4.1] §4.1 (Indirect Semantic Controls): No quantitative validation (e.g., cosine similarity thresholds, embedding distances, or human ratings) is provided to confirm that the indirect controls match all semantic dimensions except the suppression instruction. Without this, observed semantic leakage in generations could arise from residual semantic overlap rather than the attentional white bear effect.
[§5] §5 (Attention Analysis): The criteria for identifying heads that 'continue to influence attention routing' under suppression are unspecified. If heads were selected post-hoc based on the presence of the effect, this introduces selection bias that undermines the claim of continued influence independent of experimental design choices.

minor comments (2)

The abstract states effects 'persist across pooling strategies' but does not list the specific strategies or report per-strategy statistics; adding a table with these breakdowns would improve clarity.
Figure captions for attention visualizations should explicitly state the suppression prompt template and model variant used in each panel.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (Representational Probing): The probing classifiers appear to be trained on data containing the target concept outside the suppression regime. This setup risks attributing recoverability to general pretraining knowledge rather than persistence under suppression instructions; a within-regime training control or explicit comparison to non-suppressed baselines is needed to support the claim that suppression fails to reduce internal representations.

Authors: We appreciate this point. Our experiments train probes on hidden states extracted from inputs that include the suppression instruction, and we compare these to probes trained on non-suppressed inputs to show the persistence of the concept under suppression. To further address the concern about general pretraining knowledge, we will incorporate a within-regime training control in the revised manuscript, training probes exclusively on suppressed data and testing on held-out suppressed examples. revision: yes
Referee: [§4.1] §4.1 (Indirect Semantic Controls): No quantitative validation (e.g., cosine similarity thresholds, embedding distances, or human ratings) is provided to confirm that the indirect controls match all semantic dimensions except the suppression instruction. Without this, observed semantic leakage in generations could arise from residual semantic overlap rather than the attentional white bear effect.

Authors: We agree that additional quantitative validation would enhance the robustness of our indirect semantic controls. In the original manuscript, the controls were constructed to differ only in the suppression aspect based on semantic similarity measures, but we did not report explicit thresholds. We will add cosine similarity thresholds, embedding distance metrics, and details on how the controls were validated in the revised version. revision: yes
Referee: [§5] §5 (Attention Analysis): The criteria for identifying heads that 'continue to influence attention routing' under suppression are unspecified. If heads were selected post-hoc based on the presence of the effect, this introduces selection bias that undermines the claim of continued influence independent of experimental design choices.

Authors: The head selection was performed using a pre-specified criterion based on attention weight differences across conditions, applied consistently before examining the effect on generations. This was not post-hoc. We will explicitly state the selection criteria and thresholds in the methods section of the revised manuscript to clarify this process and mitigate concerns about selection bias. revision: yes

Circularity Check

0 steps flagged

Empirical observational study with no derivations or self-referential reductions

full rationale

The paper conducts representational probing, attention analysis, and semantic leakage experiments across models. No equations, fitted parameters called predictions, or derivation chains are present that reduce claims to inputs by construction. Central findings rest on direct experimental measurements rather than self-definitional or self-citation load-bearing steps. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5609 in / 1104 out tokens · 37652 ms · 2026-06-29T12:40:55.522390+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 12 canonical work pages · 8 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073. Yonatan Belinkov

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Discovering Latent Knowledge in Language Models Without Supervision

Discovering latent knowledge in lan- guage models without supervision.arXiv preprint arXiv:2212.03827. Daniel C Dennett. 1984.Cognitive wheels: the frame problem of AI. Minds, Machines and Evolution. Cam- bridge University Press Cambridge, UK:. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai,...

work page internal anchor Pith review Pith/arXiv arXiv 1984
[3]

Https://transformer- circuits.pub/2021/framework/index.html

A mathemati- cal framework for transformer circuits.Trans- former Circuits Thread. Https://transformer- circuits.pub/2021/framework/index.html. Allyson Ettinger

2021
[4]

InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235

Dissecting recall of factual associa- tions in auto-regressive language models. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others

2023
[5]

The Llama 3 Herd of Models

The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée L...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Mistral 7B

Mistral 7b.Preprint, arXiv:2310.06825. Nora Kassner and Hinrich Schütze

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

Redirected, not removed: Task- dependent stereotyping reveals the limits of LLM alignments.arXiv preprint arXiv:2604.02669. Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, and Rex Ying

work page internal anchor Pith review Pith/arXiv arXiv
[8]

R Thomas McCoy, Ellie Pavlick, and Tal Linzen

Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350. R Thomas McCoy, Ellie Pavlick, and Tal Linzen

work page arXiv
[9]

Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, and 1 others

Large language model align- ment: A survey.arXiv preprint arXiv:2309.15025. Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, and 1 others

work page arXiv
[10]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, and 1 others

Getting aligned on representational alignment.arXiv preprint arXiv:2310.13018. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, and 1 others

work page arXiv
[11]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Steering Language Models With Activation Engineering

Steering language mod- els with activation engineering.arXiv preprint arXiv:2308.10248. 9 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson

Eras- ing concepts, steering generations: A comprehen- sive survey of concept suppression.arXiv preprint arXiv:2505.19398. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson

work page arXiv
[14]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Univer- sal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043. Appendices Appendix A describes the concept library used in our experiments. Appendix B presents representa- tive prompt templates and conditions. Appendix C reports additional descriptive statistics correspond- ing to the analyses discussed in the...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073. Yonatan Belinkov

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Discovering Latent Knowledge in Language Models Without Supervision

Discovering latent knowledge in lan- guage models without supervision.arXiv preprint arXiv:2212.03827. Daniel C Dennett. 1984.Cognitive wheels: the frame problem of AI. Minds, Machines and Evolution. Cam- bridge University Press Cambridge, UK:. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai,...

work page internal anchor Pith review Pith/arXiv arXiv 1984

[3] [3]

Https://transformer- circuits.pub/2021/framework/index.html

A mathemati- cal framework for transformer circuits.Trans- former Circuits Thread. Https://transformer- circuits.pub/2021/framework/index.html. Allyson Ettinger

2021

[4] [4]

InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235

Dissecting recall of factual associa- tions in auto-regressive language models. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others

2023

[5] [5]

The Llama 3 Herd of Models

The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée L...

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Mistral 7B

Mistral 7b.Preprint, arXiv:2310.06825. Nora Kassner and Hinrich Schütze

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

Redirected, not removed: Task- dependent stereotyping reveals the limits of LLM alignments.arXiv preprint arXiv:2604.02669. Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, and Rex Ying

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

R Thomas McCoy, Ellie Pavlick, and Tal Linzen

Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350. R Thomas McCoy, Ellie Pavlick, and Tal Linzen

work page arXiv

[9] [9]

Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, and 1 others

Large language model align- ment: A survey.arXiv preprint arXiv:2309.15025. Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, and 1 others

work page arXiv

[10] [10]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, and 1 others

Getting aligned on representational alignment.arXiv preprint arXiv:2310.13018. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, and 1 others

work page arXiv

[11] [11]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Steering Language Models With Activation Engineering

Steering language mod- els with activation engineering.arXiv preprint arXiv:2308.10248. 9 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson

Eras- ing concepts, steering generations: A comprehen- sive survey of concept suppression.arXiv preprint arXiv:2505.19398. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson

work page arXiv

[14] [14]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Univer- sal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043. Appendices Appendix A describes the concept library used in our experiments. Appendix B presents representa- tive prompt templates and conditions. Appendix C reports additional descriptive statistics correspond- ing to the analyses discussed in the...

work page internal anchor Pith review Pith/arXiv arXiv