pith. sign in

arxiv: 2602.09297 · v3 · pith:SAH3XESMnew · submitted 2026-02-10 · 💻 cs.LG

Laplacian Heads Improve Transformers by Smoothing Token Representations

Pith reviewed 2026-05-21 14:02 UTC · model grok-4.3

classification 💻 cs.LG
keywords transformerslaplacian headsattention mechanismstoken smoothingneural collapseself-supervised learninggraph diffusion
0
0 comments X

The pith

Replacing some attention heads with Laplacian heads improves transformers by smoothing token representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that inserting Laplacian heads, which replace the attention matrix P with I - P in the update rule, enhances transformer models on supervised, language modeling, and self-supervised tasks. This works because Laplacian heads allow direct control over the mean and variance of token representations and correspond to a step of heat diffusion on the token graph. Sympathetic readers would care because the change is simple yet yields better performance while producing smoother representations with faster decaying spectra. The work shows that in supervised settings this aligns representations with Neural Collapse, in language modeling it improves next-token separability, and in self-supervised it aids segmentation via principal components.

Core claim

By modifying the transformer update to X ← X + sum over attention heads P X Wv Wo + sum over Laplacian heads (I - P) X Wv Wo, the model achieves improved performance across tasks. Laplacian heads enable updating the mean of tokens while controlling within-sequence variance and can be viewed as heat diffusion on graphs where tokens are nodes with edges from attention weights. This leads to token representations with faster spectral decay, within-class collapse in supervised learning that matches Neural Collapse geometry, increased separability for next-token predictions in language modeling, and principal components better suited for segmentation in self-supervised learning.

What carries the argument

Laplacian head, defined by using the graph Laplacian matrix I minus the softmax attention matrix P in place of P within the multi-head attention residual update.

Load-bearing premise

The performance gains from Laplacian heads do not depend on the specific selection of which heads to replace and come without sacrificing the long-range dependency modeling that standard attention provides.

What would settle it

If experiments show that adding Laplacian heads fails to increase next-token prediction separability in language models or does not lead to faster spectral decay in token representations, the proposed benefits would be falsified.

read the original abstract

Transformers update token representations through multi-head attention and residual connections as $X \leftarrow X + \sum_{i} P^{(i)}XW_{V_i}W_{o_i}$, where $P^{(i)}$ is the softmax attention matrix in head $i$. We propose replacing a subset of $P^{(i)}$'s with the Laplacian $I - P^{(i)}$, giving $X \leftarrow X + \sum_{i \in \mathcal{A}} P^{(i)}XW_{V_i}W_{o_i} + \sum_{i \in \mathcal{L}} (I - P^{(i)})XW_{V_i}W_{o_i}$. Our proposal has two motivations. First, it allows attention heads to update the mean of token representations, while Laplacian heads can directly control within-sequence variance. Second, if tokens are viewed as nodes in a graph with edge weights $P^{(i)}$, then $I - P^{(i)}$ is the corresponding graph Laplacian, and the update can be interpreted as one step of heat diffusion on the graph. We show that this simple modification improves performance across supervised learning, language modeling, and self-supervised learning tasks. To investigate why, we examine the token representations learned with and without Laplacian heads. In supervised learning, Laplacian heads collapse token representations within the same sequence and align the sequence means with the geometry of Neural Collapse. In language modeling, they increase the separability of token representations that share the same next-token prediction. In self-supervised learning, they produce token representations whose principal components are better suited for segmentation. Across modalities, they also lead to faster-decaying spectra, indicating stronger token smoothing. Overall, our findings challenge the prevailing view that token oversmoothing is inherently harmful, showing instead that certain forms of smoothing can be beneficial.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes replacing a subset of attention heads in Transformers with Laplacian heads that substitute the graph Laplacian (I - P) for the standard attention matrix P in the residual update. Motivated by mean/variance control and heat diffusion on token graphs, the modification is shown to yield performance gains on supervised, language modeling, and self-supervised tasks. Mechanistic analyses indicate faster spectral decay, alignment with Neural Collapse geometry for sequence means, improved next-token separability, and better-suited principal components for segmentation.

Significance. If the empirical gains and mechanistic observations hold under rigorous controls, the work would be significant for reframing oversmoothing as potentially beneficial rather than inherently harmful in Transformers. The cross-task results and spectral/geometric analyses offer a concrete alternative to standard attention that could influence representation learning research, though the current evidence base is primarily observational.

major comments (3)
  1. Abstract and method description: the update X ← X + ∑_{i∈A} P X W_V W_o + ∑_{i∈L} (I-P) X W_V W_o leaves the choice of subset L (which heads to replace and at what fraction/positions) unspecified. Without ablations demonstrating robustness across different selections or showing that gains persist when Laplacian heads are placed in later layers, the central claim that Laplacian heads provide a general smoothing benefit risks being an artifact of preserving long-range capacity in specific positions.
  2. Experimental results: the abstract and results claim consistent gains across tasks, yet no quantitative effect sizes, confidence intervals, or statistical tests are reported, and no ablation table examines the number or layer placement of Laplacian heads. This directly weakens the cross-task generalization and the interpretation that the benefit arises from the Laplacian operator itself.
  3. Analysis of token representations: claims that Laplacian heads produce faster-decaying spectra, align sequence means with Neural Collapse geometry, or improve next-token separability rest on post-hoc spectral and geometric observations without quantitative metrics, baseline comparisons, or controls for the head-selection procedure.
minor comments (1)
  1. The notation distinguishing sets A and L could be introduced with an explicit example or diagram in the method section to improve clarity for readers unfamiliar with the construction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our work. We address each of the major comments below and indicate the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract and method description: the update X ← X + ∑_{i∈A} P X W_V W_o + ∑_{i∈L} (I-P) X W_V W_o leaves the choice of subset L (which heads to replace and at what fraction/positions) unspecified. Without ablations demonstrating robustness across different selections or showing that gains persist when Laplacian heads are placed in later layers, the central claim that Laplacian heads provide a general smoothing benefit risks being an artifact of preserving long-range capacity in specific positions.

    Authors: We agree that a more detailed specification and analysis of the subset L is necessary to support the generality of our claims. In the revised manuscript, we will explicitly describe the selection procedure employed in our experiments (replacing a fixed proportion of heads distributed across layers). Furthermore, we will add ablation studies varying the fraction of Laplacian heads and their layer positions, including placements in later layers, to demonstrate robustness. These additions will clarify that the benefits arise from the Laplacian operator rather than specific positional choices. revision: yes

  2. Referee: Experimental results: the abstract and results claim consistent gains across tasks, yet no quantitative effect sizes, confidence intervals, or statistical tests are reported, and no ablation table examines the number or layer placement of Laplacian heads. This directly weakens the cross-task generalization and the interpretation that the benefit arises from the Laplacian operator itself.

    Authors: We acknowledge that the original presentation lacked quantitative effect sizes, confidence intervals, and formal statistical tests. We will revise the results section to include these metrics for the reported performance improvements across tasks. Additionally, we will incorporate an ablation table that systematically examines the effects of different numbers and layer placements of Laplacian heads. This will provide a more rigorous basis for the cross-task claims and strengthen the attribution to the Laplacian modification. revision: yes

  3. Referee: Analysis of token representations: claims that Laplacian heads produce faster-decaying spectra, align sequence means with Neural Collapse geometry, or improve next-token separability rest on post-hoc spectral and geometric observations without quantitative metrics, baseline comparisons, or controls for the head-selection procedure.

    Authors: The mechanistic analyses are intended to offer insights into the observed empirical benefits rather than standalone proofs. To address the concern, we will enhance the analysis section by introducing quantitative metrics for spectral decay, Neural Collapse alignment, and next-token separability. We will also include direct comparisons to standard Transformer baselines and specify the controls applied to the head-selection process in these experiments. These revisions will make the observations more quantitative and controlled. revision: yes

Circularity Check

0 steps flagged

No circularity: proposal and empirical claims are self-contained

full rationale

The paper introduces an architectural modification by replacing a subset of attention matrices P with I - P, motivated by direct algebraic properties (mean update and graph Laplacian diffusion) and standard graph theory. All subsequent claims about performance gains and representation properties (faster spectral decay, Neural Collapse alignment, next-token separability) are presented as empirical observations from experiments rather than derived predictions. No equations reduce a result to a fitted parameter by construction, no self-citations serve as load-bearing uniqueness theorems, and no ansatz is smuggled in. The derivation chain relies on explicit definitions and external validation, remaining independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard Transformer residual update, the algebraic identity that I-P is the graph Laplacian when P is row-stochastic, and the empirical observation that faster spectral decay correlates with the reported benefits. No new particles or forces are postulated.

free parameters (1)
  • which heads are replaced
    The paper selects a subset of heads to become Laplacian; the choice is a modeling decision that must be tuned or validated.
axioms (1)
  • domain assumption Tokens form nodes of a graph whose edge weights are given by the attention matrix P
    Invoked to interpret the I-P update as heat diffusion.

pith-pipeline@v0.9.0 · 5858 in / 1205 out tokens · 57714 ms · 2026-05-21T14:02:51.819560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.