pith. sign in

arxiv: 2605.28639 · v1 · pith:7LUALLIUnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

The Attentional White Bear Effect in Transformer Language Models

Pith reviewed 2026-06-29 12:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords language modelsinstruction suppressionrepresentational probingattention analysissemantic leakageprohibited contentmodel alignment
0
0 comments X

The pith

Prohibited concepts remain internally active in language models even when their output is suppressed by instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether telling language models to avoid prohibited content actually removes those ideas from their thinking or merely stops them from saying the words. Researchers used tests to read out hidden information from the models, look at how attention is directed, and see if forbidden ideas leak into other parts of the responses. They discovered that the concepts stay easy to detect inside the model, keep affecting where the model pays attention, and still influence what the model generates even when direct mentions are avoided. This holds true no matter how the information is pooled or which models are tested. The result points to a mismatch between what the model does on the surface and what it represents underneath.

Core claim

Prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families, exposing a fundamental gap between behavioral and representational alignment.

What carries the argument

The attentional white bear effect, in which instruction-based suppression of a concept keeps it active in hidden states and attention patterns rather than removing it.

If this is right

  • Prohibited concepts can be recovered from model internals despite output suppression.
  • Attention mechanisms continue to route based on suppressed concepts.
  • Downstream text generation is influenced by these concepts in non-lexical ways.
  • The effects are consistent across different analysis methods and model types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety techniques relying only on output instructions may leave internal concept processing intact.
  • Direct interventions on hidden representations could be needed to achieve fuller suppression.
  • The same internal persistence might occur in non-text modalities or other model types.

Load-bearing premise

The methods for probing representations, analyzing attention, and checking semantic leakage isolate the effects of suppression without introducing their own artifacts or biases.

What would settle it

An experiment in which prohibited concepts become unrecoverable from hidden representations under suppression, show no measurable influence on attention routing, and produce no semantic leakage in generations while still achieving lexical avoidance.

Figures

Figures reproduced from arXiv: 2605.28639 by Brian Scassellati, Rebecca Ramnauth.

Figure 1
Figure 1. Figure 1: Layerwise recoverability. Last-token pooling (left) reveals strongly transient early-layer suppression dynamics, whereas mean non-padding pooling (right) demonstrates that suppression-related recoverability remains broadly distributed across much of the transformer stack. Shaded regions indicate 95% CIs. The strongest effect emerged at layer 4 (∆ = 0.810, 95% CI [0.758, 0.861], d = 2.66, p < .001). 4.3 Ind… view at source ↗
Figure 2
Figure 2. Figure 2: Attention allocation. Aggregate layerwise attention differences were modest overall (top), but head specialization emerged in middle layers (bottom). Bar heights indicate the largest positive head-level suppres￾sion effect within each layer. Numbers above bars de￾note the number of heads with significantly positive suppression effects (95% CIs entirely above zero). that global attention allocation differen… view at source ↗
Figure 4
Figure 4. Figure 4: Cross-model replication. Elevated recov￾erability under indirect suppression generalizes across architectures, suggesting that suppression-related se￾mantic persistence is not model-specific. by direct lexical repetition within the prompt itself. Semantic drift also varied across concepts. Con￾cepts like “knife,” “password,” and “self-harm” ex￾hibited stronger drift under suppression than other concepts (A… view at source ↗
Figure 6
Figure 6. Figure 6: Layer-region suppression salience across model families. All models exhibited significantly lower suppression salience in middle layers than in early or late layers, revealing a non-monotonic suppression profile across transformer architectures [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Condition ordering across model families. Mean probe scores under concept-absent, indirect sup￾pression, and direct mention conditions across models. All pairwise contrasts were significant at p < .001. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper investigates whether instruction-based suppression of prohibited concepts in transformer language models eliminates their internal representations or merely blocks lexical output. Through representational probing, attention pattern analysis, and semantic leakage experiments on downstream generations, the authors claim that prohibited concepts remain highly recoverable from hidden states, continue to shape attention routing, and measurably influence generated text despite successful lexical avoidance. These effects are reported to persist across pooling strategies, indirect semantic controls, and multiple model families, indicating a gap between behavioral suppression and representational alignment.

Significance. If the central empirical claims hold after methodological clarification, the work would be significant for AI alignment research by demonstrating that common suppression techniques are superficial at the representational level. The multi-model evaluation and use of indirect controls represent strengths that could help distinguish suppression-specific effects from general model knowledge. However, the absence of quantitative effect sizes, error bars, or explicit validation metrics in the provided abstract limits immediate assessment of practical impact.

major comments (3)
  1. [§3] §3 (Representational Probing): The probing classifiers appear to be trained on data containing the target concept outside the suppression regime. This setup risks attributing recoverability to general pretraining knowledge rather than persistence under suppression instructions; a within-regime training control or explicit comparison to non-suppressed baselines is needed to support the claim that suppression fails to reduce internal representations.
  2. [§4.1] §4.1 (Indirect Semantic Controls): No quantitative validation (e.g., cosine similarity thresholds, embedding distances, or human ratings) is provided to confirm that the indirect controls match all semantic dimensions except the suppression instruction. Without this, observed semantic leakage in generations could arise from residual semantic overlap rather than the attentional white bear effect.
  3. [§5] §5 (Attention Analysis): The criteria for identifying heads that 'continue to influence attention routing' under suppression are unspecified. If heads were selected post-hoc based on the presence of the effect, this introduces selection bias that undermines the claim of continued influence independent of experimental design choices.
minor comments (2)
  1. The abstract states effects 'persist across pooling strategies' but does not list the specific strategies or report per-strategy statistics; adding a table with these breakdowns would improve clarity.
  2. Figure captions for attention visualizations should explicitly state the suppression prompt template and model variant used in each panel.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Representational Probing): The probing classifiers appear to be trained on data containing the target concept outside the suppression regime. This setup risks attributing recoverability to general pretraining knowledge rather than persistence under suppression instructions; a within-regime training control or explicit comparison to non-suppressed baselines is needed to support the claim that suppression fails to reduce internal representations.

    Authors: We appreciate this point. Our experiments train probes on hidden states extracted from inputs that include the suppression instruction, and we compare these to probes trained on non-suppressed inputs to show the persistence of the concept under suppression. To further address the concern about general pretraining knowledge, we will incorporate a within-regime training control in the revised manuscript, training probes exclusively on suppressed data and testing on held-out suppressed examples. revision: yes

  2. Referee: [§4.1] §4.1 (Indirect Semantic Controls): No quantitative validation (e.g., cosine similarity thresholds, embedding distances, or human ratings) is provided to confirm that the indirect controls match all semantic dimensions except the suppression instruction. Without this, observed semantic leakage in generations could arise from residual semantic overlap rather than the attentional white bear effect.

    Authors: We agree that additional quantitative validation would enhance the robustness of our indirect semantic controls. In the original manuscript, the controls were constructed to differ only in the suppression aspect based on semantic similarity measures, but we did not report explicit thresholds. We will add cosine similarity thresholds, embedding distance metrics, and details on how the controls were validated in the revised version. revision: yes

  3. Referee: [§5] §5 (Attention Analysis): The criteria for identifying heads that 'continue to influence attention routing' under suppression are unspecified. If heads were selected post-hoc based on the presence of the effect, this introduces selection bias that undermines the claim of continued influence independent of experimental design choices.

    Authors: The head selection was performed using a pre-specified criterion based on attention weight differences across conditions, applied consistently before examining the effect on generations. This was not post-hoc. We will explicitly state the selection criteria and thresholds in the methods section of the revised manuscript to clarify this process and mitigate concerns about selection bias. revision: yes

Circularity Check

0 steps flagged

Empirical observational study with no derivations or self-referential reductions

full rationale

The paper conducts representational probing, attention analysis, and semantic leakage experiments across models. No equations, fitted parameters called predictions, or derivation chains are present that reduce claims to inputs by construction. Central findings rest on direct experimental measurements rather than self-definitional or self-citation load-bearing steps. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5609 in / 1104 out tokens · 37652 ms · 2026-06-29T12:40:55.522390+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073. Yonatan Belinkov

  2. [2]

    Discovering Latent Knowledge in Language Models Without Supervision

    Discovering latent knowledge in lan- guage models without supervision.arXiv preprint arXiv:2212.03827. Daniel C Dennett. 1984.Cognitive wheels: the frame problem of AI. Minds, Machines and Evolution. Cam- bridge University Press Cambridge, UK:. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai,...

  3. [3]

    Https://transformer- circuits.pub/2021/framework/index.html

    A mathemati- cal framework for transformer circuits.Trans- former Circuits Thread. Https://transformer- circuits.pub/2021/framework/index.html. Allyson Ettinger

  4. [4]

    InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235

    Dissecting recall of factual associa- tions in auto-regressive language models. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others

  5. [5]

    The Llama 3 Herd of Models

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée L...

  6. [6]

    Mistral 7B

    Mistral 7b.Preprint, arXiv:2310.06825. Nora Kassner and Hinrich Schütze

  7. [7]

    Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

    Redirected, not removed: Task- dependent stereotyping reveals the limits of LLM alignments.arXiv preprint arXiv:2604.02669. Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, and Rex Ying

  8. [8]

    R Thomas McCoy, Ellie Pavlick, and Tal Linzen

    Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350. R Thomas McCoy, Ellie Pavlick, and Tal Linzen

  9. [9]

    Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, and 1 others

    Large language model align- ment: A survey.arXiv preprint arXiv:2309.15025. Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, and 1 others

  10. [10]

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, and 1 others

    Getting aligned on representational alignment.arXiv preprint arXiv:2310.13018. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, and 1 others

  11. [11]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid

  12. [12]

    Steering Language Models With Activation Engineering

    Steering language mod- els with activation engineering.arXiv preprint arXiv:2308.10248. 9 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin

  13. [13]

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson

    Eras- ing concepts, steering generations: A comprehen- sive survey of concept suppression.arXiv preprint arXiv:2505.19398. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson

  14. [14]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Univer- sal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043. Appendices Appendix A describes the concept library used in our experiments. Appendix B presents representa- tive prompt templates and conditions. Appendix C reports additional descriptive statistics correspond- ing to the analyses discussed in the...