Laplacian Heads Improve Transformers by Smoothing Token Representations

Vardan Papyan; Yuchong Zhang

arxiv: 2602.09297 · v3 · pith:SAH3XESMnew · submitted 2026-02-10 · 💻 cs.LG

Laplacian Heads Improve Transformers by Smoothing Token Representations

Yuchong Zhang , Vardan Papyan This is my paper

Pith reviewed 2026-05-21 14:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformerslaplacian headsattention mechanismstoken smoothingneural collapseself-supervised learninggraph diffusion

0 comments

The pith

Replacing some attention heads with Laplacian heads improves transformers by smoothing token representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that inserting Laplacian heads, which replace the attention matrix P with I - P in the update rule, enhances transformer models on supervised, language modeling, and self-supervised tasks. This works because Laplacian heads allow direct control over the mean and variance of token representations and correspond to a step of heat diffusion on the token graph. Sympathetic readers would care because the change is simple yet yields better performance while producing smoother representations with faster decaying spectra. The work shows that in supervised settings this aligns representations with Neural Collapse, in language modeling it improves next-token separability, and in self-supervised it aids segmentation via principal components.

Core claim

By modifying the transformer update to X ← X + sum over attention heads P X Wv Wo + sum over Laplacian heads (I - P) X Wv Wo, the model achieves improved performance across tasks. Laplacian heads enable updating the mean of tokens while controlling within-sequence variance and can be viewed as heat diffusion on graphs where tokens are nodes with edges from attention weights. This leads to token representations with faster spectral decay, within-class collapse in supervised learning that matches Neural Collapse geometry, increased separability for next-token predictions in language modeling, and principal components better suited for segmentation in self-supervised learning.

What carries the argument

Laplacian head, defined by using the graph Laplacian matrix I minus the softmax attention matrix P in place of P within the multi-head attention residual update.

Load-bearing premise

The performance gains from Laplacian heads do not depend on the specific selection of which heads to replace and come without sacrificing the long-range dependency modeling that standard attention provides.

What would settle it

If experiments show that adding Laplacian heads fails to increase next-token prediction separability in language models or does not lead to faster spectral decay in token representations, the proposed benefits would be falsified.

read the original abstract

Transformers update token representations through multi-head attention and residual connections as $X \leftarrow X + \sum_{i} P^{(i)}XW_{V_i}W_{o_i}$, where $P^{(i)}$ is the softmax attention matrix in head $i$. We propose replacing a subset of $P^{(i)}$'s with the Laplacian $I - P^{(i)}$, giving $X \leftarrow X + \sum_{i \in \mathcal{A}} P^{(i)}XW_{V_i}W_{o_i} + \sum_{i \in \mathcal{L}} (I - P^{(i)})XW_{V_i}W_{o_i}$. Our proposal has two motivations. First, it allows attention heads to update the mean of token representations, while Laplacian heads can directly control within-sequence variance. Second, if tokens are viewed as nodes in a graph with edge weights $P^{(i)}$, then $I - P^{(i)}$ is the corresponding graph Laplacian, and the update can be interpreted as one step of heat diffusion on the graph. We show that this simple modification improves performance across supervised learning, language modeling, and self-supervised learning tasks. To investigate why, we examine the token representations learned with and without Laplacian heads. In supervised learning, Laplacian heads collapse token representations within the same sequence and align the sequence means with the geometry of Neural Collapse. In language modeling, they increase the separability of token representations that share the same next-token prediction. In self-supervised learning, they produce token representations whose principal components are better suited for segmentation. Across modalities, they also lead to faster-decaying spectra, indicating stronger token smoothing. Overall, our findings challenge the prevailing view that token oversmoothing is inherently harmful, showing instead that certain forms of smoothing can be beneficial.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Laplacian heads show gains across tasks via controlled smoothing but the benefits may hinge on unablated choices of which heads to replace.

read the letter

The main takeaway from this paper is that replacing some of the attention heads with Laplacian versions improves performance on supervised, language modeling, and self-supervised tasks, and it suggests that certain kinds of smoothing can be beneficial for token representations. They motivate the change by noting that normal heads can shift the mean while Laplacian ones control variance, and they give a graph diffusion reading. The empirical side covers multiple settings and includes checks on spectra, Neural Collapse alignment, and separability. The substitution itself plus those interpretations is the novel part here. What they do well is test the idea broadly enough to show it's not limited to one domain. The soft spots center on the head selection process. The paper doesn't detail how they choose the subset or the fraction, and as the stress test highlights, this matters because the advantage might only appear when Laplacian heads are used in specific positions while standard ones handle long-range dependencies elsewhere. Without those ablations, the claim that Laplacian heads generally improve things via smoothing is a bit loose. The lack of quantitative effect sizes and stats in the abstract also keeps the strength of the evidence moderate. This is the kind of work that would appeal to people experimenting with transformer variants or studying representation properties in deep models. It has enough substance that a reader could build on the idea or try the modification themselves. I would recommend sending it out for peer review. The proposal is simple and the results point to something worth investigating further, so referees can help sort out the experimental gaps.

Referee Report

3 major / 1 minor

Summary. The paper proposes replacing a subset of attention heads in Transformers with Laplacian heads that substitute the graph Laplacian (I - P) for the standard attention matrix P in the residual update. Motivated by mean/variance control and heat diffusion on token graphs, the modification is shown to yield performance gains on supervised, language modeling, and self-supervised tasks. Mechanistic analyses indicate faster spectral decay, alignment with Neural Collapse geometry for sequence means, improved next-token separability, and better-suited principal components for segmentation.

Significance. If the empirical gains and mechanistic observations hold under rigorous controls, the work would be significant for reframing oversmoothing as potentially beneficial rather than inherently harmful in Transformers. The cross-task results and spectral/geometric analyses offer a concrete alternative to standard attention that could influence representation learning research, though the current evidence base is primarily observational.

major comments (3)

Abstract and method description: the update X ← X + ∑_{i∈A} P X W_V W_o + ∑_{i∈L} (I-P) X W_V W_o leaves the choice of subset L (which heads to replace and at what fraction/positions) unspecified. Without ablations demonstrating robustness across different selections or showing that gains persist when Laplacian heads are placed in later layers, the central claim that Laplacian heads provide a general smoothing benefit risks being an artifact of preserving long-range capacity in specific positions.
Experimental results: the abstract and results claim consistent gains across tasks, yet no quantitative effect sizes, confidence intervals, or statistical tests are reported, and no ablation table examines the number or layer placement of Laplacian heads. This directly weakens the cross-task generalization and the interpretation that the benefit arises from the Laplacian operator itself.
Analysis of token representations: claims that Laplacian heads produce faster-decaying spectra, align sequence means with Neural Collapse geometry, or improve next-token separability rest on post-hoc spectral and geometric observations without quantitative metrics, baseline comparisons, or controls for the head-selection procedure.

minor comments (1)

The notation distinguishing sets A and L could be introduced with an explicit example or diagram in the method section to improve clarity for readers unfamiliar with the construction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our work. We address each of the major comments below and indicate the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: Abstract and method description: the update X ← X + ∑_{i∈A} P X W_V W_o + ∑_{i∈L} (I-P) X W_V W_o leaves the choice of subset L (which heads to replace and at what fraction/positions) unspecified. Without ablations demonstrating robustness across different selections or showing that gains persist when Laplacian heads are placed in later layers, the central claim that Laplacian heads provide a general smoothing benefit risks being an artifact of preserving long-range capacity in specific positions.

Authors: We agree that a more detailed specification and analysis of the subset L is necessary to support the generality of our claims. In the revised manuscript, we will explicitly describe the selection procedure employed in our experiments (replacing a fixed proportion of heads distributed across layers). Furthermore, we will add ablation studies varying the fraction of Laplacian heads and their layer positions, including placements in later layers, to demonstrate robustness. These additions will clarify that the benefits arise from the Laplacian operator rather than specific positional choices. revision: yes
Referee: Experimental results: the abstract and results claim consistent gains across tasks, yet no quantitative effect sizes, confidence intervals, or statistical tests are reported, and no ablation table examines the number or layer placement of Laplacian heads. This directly weakens the cross-task generalization and the interpretation that the benefit arises from the Laplacian operator itself.

Authors: We acknowledge that the original presentation lacked quantitative effect sizes, confidence intervals, and formal statistical tests. We will revise the results section to include these metrics for the reported performance improvements across tasks. Additionally, we will incorporate an ablation table that systematically examines the effects of different numbers and layer placements of Laplacian heads. This will provide a more rigorous basis for the cross-task claims and strengthen the attribution to the Laplacian modification. revision: yes
Referee: Analysis of token representations: claims that Laplacian heads produce faster-decaying spectra, align sequence means with Neural Collapse geometry, or improve next-token separability rest on post-hoc spectral and geometric observations without quantitative metrics, baseline comparisons, or controls for the head-selection procedure.

Authors: The mechanistic analyses are intended to offer insights into the observed empirical benefits rather than standalone proofs. To address the concern, we will enhance the analysis section by introducing quantitative metrics for spectral decay, Neural Collapse alignment, and next-token separability. We will also include direct comparisons to standard Transformer baselines and specify the controls applied to the head-selection process in these experiments. These revisions will make the observations more quantitative and controlled. revision: yes

Circularity Check

0 steps flagged

No circularity: proposal and empirical claims are self-contained

full rationale

The paper introduces an architectural modification by replacing a subset of attention matrices P with I - P, motivated by direct algebraic properties (mean update and graph Laplacian diffusion) and standard graph theory. All subsequent claims about performance gains and representation properties (faster spectral decay, Neural Collapse alignment, next-token separability) are presented as empirical observations from experiments rather than derived predictions. No equations reduce a result to a fitted parameter by construction, no self-citations serve as load-bearing uniqueness theorems, and no ansatz is smuggled in. The derivation chain relies on explicit definitions and external validation, remaining independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard Transformer residual update, the algebraic identity that I-P is the graph Laplacian when P is row-stochastic, and the empirical observation that faster spectral decay correlates with the reported benefits. No new particles or forces are postulated.

free parameters (1)

which heads are replaced
The paper selects a subset of heads to become Laplacian; the choice is a modeling decision that must be tuned or validated.

axioms (1)

domain assumption Tokens form nodes of a graph whose edge weights are given by the attention matrix P
Invoked to interpret the I-P update as heat diffusion.

pith-pipeline@v0.9.0 · 5858 in / 1205 out tokens · 57714 ms · 2026-05-21T14:02:51.819560+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Repeated application of the normalized graph Laplacian ... drives token representations toward their locally averaged state, reducing within-sequence variance.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.